Many programming paradigms are reaching us nowadays bringing the promise of being faster by leveraging more cores and more machines (and more system administration headaches, but this is rarely stated). Reality is that many times these paradigms do not take in account the increasing mismatch between memory speed and CPUs (see http://www.blosc.org/docs/StarvingCPUs-CISE-2010.pdf), and this is becoming utterly critical so as to get maximum performance out of your data handling applications.
During my tutorial, I will introduce different data containers for handling different kind of data and will propose experimenting with them while explaining why some adapts better to the task at hand. I will start with a quick introduction for Python data containers (lists, dicts, arrays…), continuing with well-known in-memory NumPy and Pandas containers as well as on-disk HDF5/PyTables and ending with bcolz (https://github.com/Blosc/bcolz), a novel way to store and quickly retrieve data which uses chunking and compression techniques so as to leverage the memory hierarchy of modern computer architectures.
People attending will need a working Python setup with IPython notebook, NumPy, pandas, PyTables and bcolz installed. Anaconda or Enthought Canopy distributions are recommended, but any other means of installing (e.g. pip) will do.