It is quite common practice to carry everything that was in the initial data throughout the whole analysis
Reducing the data size would reduce the needed resources (space, ram, etc.)
Usually having a mapping between initial data and processed data would be sufficient
“I hope I didn’t forget anything”-problem
This is especially problematic when working remotely as transferring full datasets is slower than transferring increments
Problem can be minimized when using data analysis libraries such as pandas that allow for easy data merging from multiple sources
This is closely related to the “She’ll have the steak”-problem
“She’ll have the steak”-problem
Quite often workflows happen in the following way:
Data inspection tools are written to inspect data.
Coding framework & data format is chosen based on how easy data is to load for inspection.
Data needs to be used with a model / solver, but data format has already been chosen.
Data fits to the model / solver poorly.
“She’ll have the steak”-problem
A better solution would be to look at the problem from the other end:
Model / solver is chosen.
Coding framework & data format is chosen based on how easy data is to use with a model.
Data inspection tools are written based on the framework.
Data fits well into the model / solver.
“She’ll have the steak”-problem
Major frameworks usually have a preferred way of working with data (e.g. tidy data with Pandas, data loaders for deep learning toolkits, NetCDF for physics simulations, …)
Downside is that one might need to write visualization tools with respect to the data formats needed by these frameworks
Upside is other people are doing the same thing: there are lots of already existing tools
Profiling vs understanding
Profiling vs understanding
Quite often when talking about I/O we’ll use terms such as Megabytes per second or I/O operations per second
Profiling is good, but more often understanding what we’re trying to accomplish with the data is more important
Knowing that you’re shoveling crap fast doesn’t help with the fact that you’re still shoveling crap
Profiling vs understanding
Computers will try to do what they are told to do, even if that is inefficient
Often we need to ask how the computer perceives what we’re telling it to do
Example in understanding a problem
Common problem in deep learning is related to randomness:
We want to randomize our data ordering for each epoch
Example in understanding a problem
This is problematic, as random access is much slower than sequential access
To fix this, we give up some degree of randomness for increased efficiency
Example in understanding a problem
Consider shuffling of a deck of playing cards:
Example in understanding a problem
Instead of shuffling the deck, we split the deck into multiple batches:
Example in understanding a problem
We shuffle the batches randomly:
Example in understanding a problem
We shuffle the data within each batch:
Example in understanding a problem
Is this ordering random enough?
More often than not, it is.
Example in understanding a problem
This kind of IO can be done sequentially.
We still get randomization, but not complete randomization.
Vast majority of big data analysis uses this.
This was used in the demo as well.
Conclusions
Conclusions
I/O problems are often tied to the way we’re working
To solve them, we need to look at our workflows
Looking at the problem from the computer’s point of view can help