I/O

The la package provides two ways to archive larrys: using archive functions such as save and load and using the dictionary-like interface of the IO class. Both I/O methods store larrys in HDF5 1.8 format and require h5py.

Not all data types can be saved to a HDF5 archive; see Data types for details.

Archive functions

One method to archive larrys is to use the archive functions (see IO class for a second, more powerful method):

To demonstrate, let’s start by creating a larry:

>>> import la
>>> y = la.larry([1, 2, 3])

Next let’s save the larry, y, in an archive using the save function:

>>> la.save('/tmp/data.hdf5', y, 'y')

The contents of the archive:

>>> la.archive_directory('/tmp/data.hdf5')
['y']

To load the larry we use the load function:

>>> z = la.load('/tmp/data.hdf5', 'y')

The entire larry is loaded from the archive. The load function does not have an option to load parts of a larry, such as a slice. (To load parts of a larrys from the archive, see IO class.)

The name of the larry in save and load statements (and in all the other archive functions) must be a string. But the string may contain one or more forward slashes (‘/’), which is to say that larrys can be archived in a hierarchical structure:

>>> la.save('/tmp/data/hdf5', y, '/experiment/2/y')
>>> z = la.load('/tmp/data/hdf5', '/experiment/2/y')

Instead of passing a filename to the archive functions you can optionally pass a h5py File object:

>>> import h5py
>>> f = h5py.File('/tmp/data.hdf5')
>>> z = la.load(f, 'y')

To check if a larry is in the archive:

>>> la.is_archived_larry(f, 'y')
True

To delete a larry from the archive:

>>> la.delete(f, 'y')
>>> la.is_archived_larry(f, 'y')
False

HDF5 does not keep track of the freespace in an archive across opening and closing of the archive. After repeatedly opening, closing and deleting larrys from the archive, the unused space in the archive may grow. The only way to reclaim the freespace is to repack the archive:

>>> la.repack(f)

To see how much space the archive takes on disk and to see how much freespace is in the archive see IO class.

For further information on the archive functions see Archive function reference.

IO class

The IO class provides a dictionary-like interface to the archive.

To demonstrate, let’s start by creating two larrys, a and b:

>>> import la
>>> a = la.larry([1.0, 2.0, 3.0, 4.0])
>>> b = la.larry([[1, 2],[3, 4]])

To work with an archive you need to create an IO object:

>>> io = la.IO('/tmp/data.hdf5')

where /tmp/data.hdf5 is the path to the archive used in this example.

Let’s add (save) two larrys, a and b, to the archive and then list the contents of the archive:

>>> io['a'] = a
>>> io['b'] = b
>>> io

larry  dtype    shape
----------------------
a      float64  (4,)
b      int64    (2, 2)

We can get a list of the keys (larrys) in the archive:

>>> io.keys()
    ['a', 'b']

>>> for key in io: print key
...
a
b

>>> len(io)
2

Are the larrys a (yes) and c (no) in the archive?

>>> 'a' in io
True
>>> 'c' in io
False

>>> list(set(io) & set(['a', 'c']))
['a']

When we load data from the archive using an IO object, we get a lara not a larry:

>>> z = io['a']
>>> type(z)
    <class 'la.io.lara'>

Whereas larry stores his data in a numpy array and a list (labels), lara stores her data in a h5py Dataset object and a list (labels). The reason that an IO object returns a lara instead of a larry is that you may want to extract only part of a larry, such as a slice, from the archive.

To convert a lara object into a larry, just index into the lara (the indexing below is the slice [:2]):

>>> z = io['a'][:2]
>>> type(z)
<class 'la.deflarry.larry'>

>>> z
label_0
    0
    1
x
array([ 1.,  2.])

In the example above, only the first two items in the array were loaded from the archive—a feature that comes in handy when you only need a small part of a large larry.

Although the data from a larry is not loaded until you index into the lara, the entire label is always loaded. That allows you to use the labels right away:

>>> z = io['a']
>>> type(z)
<class 'la.io.lara'>

>>> idx = z.labelindex(1, axis=0)
>>> type(z[:idx])
<class 'la.deflarry.larry'>

To delete the larry b from the archive:

>>> del io['b']

HDF5 does not keep track of the freespace in an archive across opening and closing of the archive. After repeatedly opening, closing and deleting larrys from the archive, the unused space in the archive may grow. The only way to reclaim the freespace is to repack the archive:

>>> io.repack()

Repack means to transfer all the larrys to a new archive (with the same name) and delete the old archive.

Before looking at the size of the archive, let’s add some bigger larrys:

>>> import numpy as np
>>> io['rand'] = la.rand(1000, 1000)
>>> io['randn'] = la.randn(1000, 1000)
>>> io
larry  dtype    shape
----------------------------
a      float64  (4,)
rand   float64  (1000, 1000)
randn  float64  (1000, 1000)

How many MB does the archive occupy on disk?

>>> io.space / 1e6
16.038903999999999  # MB

How much freespace is there?

>>> io.freespace / 1e6
0.0068399999999999997  # MB

Let’s delete randn from the archive and look at the space and freespace:

>>> del io['randn']
>>> io.space / 1e6
16.038903999999999  # MB
>>> io.freespace / 1e6
8.0228400000000004  # MB

So deleting a larry from the the archive does not reduce the size of the archive unless you repack:

>>> io.repack()
>>> io.space / 1e6
8.02224  # MB
>>> io.freespace / 1e6
0.0061760000000000001  # MB

(Sometimes freespace will get reused when saving new larrys to the archive. If any HDF5 users are reading this, could you tell me when freespace is reused and when it is not?)

The IO class takes an optional argument that can be used to automatically repack the archive when the freespace after deleting a larry exceeds a specified amount. The following IO object will repack the archive everytime a delete causes the freespace in the archive to exceed 100 MB:

>>> io = la.IO('/tmp/data.hdf5', max_freespace=100e6)

You can iterate through the keys or the values or the (key, value) pairs of an IO object:

>>> for key, value in io.iteritems():
...     print key, value.shape
...
a (4,)
rand (1000, 1000)

The keys (larrys) in an IO object (archive) must be strings. But the string may contain one or more forward slashes (‘/’), which is to say that larrys can be archived in a hierarchical structure:

>>> io['/experiment/2/y'] = la.larry([1, 2, 3])
>>> z = io['/experiment/2/y']

What filename is associated with the archive?

>>> io.filename
'/tmp/data.hdf5'

For further information on the IO class see IO class reference.

Limitations

There are several limitations of the archiving method used by the la package. In this section we will discuss two limitations:

  • The freespace in the archive is not by default automatically reclaimed after deleting larrys.
  • In order to archive a larry, its data and labels must be of a type supported by HDF5.

Freespace

HDF5 does not keep track of the freespace in an archive across opening and closing of the archive. Therefore, after opening, closing and deleting larrys from the archive, the unused space in the archive may grow. The only way to reclaim the freespace is to repack the archive.

You can use the utility provided by HDF5 to repack the archive or you can use the repack method (see IO class) or function (see Archive functions) in the la package.

Data types

A larry can have labels of mixed type, for example strings and numbers. However, when archiving larrys in HDF5 format the labels are converted to Numpy arrays and the elements of a Numpy array must be of the same type. Therefore, to archive a larry the labels along any one dimension must be of the same type and that type must be one that is recognized by h5py and HDF5: strings and scalars. An exception is made for labels with dates of type datetime.date, datetime.time, and datetime.datetime: la automatically converts them to tuples of integers when saving and back to dates when loading.

Archive format

An archive is contructed from two types of HDF5 objects: Groups and Datasets. Groups can contain Datasets and more Groups. Datasets can contain arrays.

larrys are stored in a HDF5 Group. The name of the group, often referred to in this manual as the key, is the name of the larry. The group is given an attribute called ‘larry’ and assigned the value True. Inside the group are several HDF5 Datasets. For a 2d larry, for example, there are three datasets: one to hold the data (named ‘x’) and two to hold the labels (named ‘0’ and ‘1’). In general, for a nd larry there are n+1 datasets. Each label Dataset is given an attribute called ‘isdate’ which is set to True if all labels along the given axis are dates of type datetime.date; False otherwise. If ‘isdate’ is True then the labels are converted to integers before saving, and converted back to datetime.date object when loading.

Reference

This section contains the reference guide to the archive functions, Archive function reference, and the IO class methods, IO class reference.

Archive function reference

la.save(file, lar, key)

Save a larry in HDF5 format.

Each larry is stored in a HDF5 group. The group is assigned an attribute named ‘larry’ which is set to True. Inside the group is a HDF5 dataset containing the data (named ‘x’) and one dataset for each dimension of the label (named str(dimension)). For example, a 2d larry named ‘price’ is stored in a group called ‘price’ that contains a dataset called ‘x’ (the price) and two datasets called ‘0’ and ‘1’ (the labels).

Before saving, the labels are converted to Numpy arrays, one array for each dimension. Therefore, to save a larry in HDF5 format, the elements of a label along any one dimension must be of the same type and that type must be supported by HDF5.

If all labels along an axis are dates of type datetime.date, then the dates are converted to integers before saving and the HDF5 Dataset used to store that label is assigned an attribute name ‘datetime_type’ which is set to ‘date’. When loading the larry, the dates will automatically be converted back to datetime.date dates.

Similarly, if the labels are of type datetime.time, then we convert to integers and set the attribute to ‘date’ when saving and automatically convert back to datetime.time when loading.

Finally, if the labels are datetime.datetime then the attribute is set to ‘datetime’ and the labels are converted to tuples when saving and back to datetime.datetime when loading.

Parameters :

file : str or h5py.File

Filename or h5py.File object of the archive.

lar : larry

Data to save.

key : str

Name of larry.

See also

la.load
Load larrys without a dictionary-like interface.
la.IO
A dictionary-like interface to the archive.

Examples

Create a larry:

>>> x = la.larry([1, 2, 3])

Save the larry:

>>> la.save('/tmp/x.hdf5', x, 'x')        

la.load(file, key)

Load a larry from a HDF5 archive.

Each larry is stored in a HDF5 group. The group is assigned an attribute named ‘larry’ which is set to True. Inside the group is a HDF5 dataset containing the data (named ‘x’) and one dataset for each dimension of the label (named str(dimension)). For example, a 2d larry named ‘price’ is stored in a group called ‘price’ that contains a dataset called ‘x’ (the price) and two datasets called ‘0’ and ‘1’ (the labels).

Parameters :

file : str or h5py.File

Filename or h5py.File object of the archive.

key : str

Name of larry.

Returns :

out : larry

Returns the larry from the archive.

See also

la.save
Save larrys without a dictionary-like interface.
la.IO
A dictionary-like interface to the archive.

Examples

Create a larry:

>>> x = la.larry([1, 2, 3])

Save the larry:

>>> la.save('/tmp/x.hdf5', x, 'x')

Now load it:

>>> y = la.load('/tmp/x.hdf5', 'x')            

la.io.delete(file, key)

Delete a larry from a HDF5 archive.

Parameters :

file : str or h5py.File

Filename or h5py.File object of the archive.

key : str

Name of larry.

Returns :

out : None

Nothing is returned, just None.

See also

la.save
Save larrys without a dictionary-like interface.
la.load
Load larrys without a dictionary-like interface.
la.IO
A dictionary-like interface to the archive.

Examples

Create a larry:

>>> x = la.larry([1, 2, 3])

Save the larry:

>>> la.save('/tmp/x.hdf5', x, 'x')

Now delete it:

>>> la.delete('/tmp/x.hdf5', 'x')            

la.repack(file)

Repack archive to remove freespace.

Parameters :

file : h5py File or str

A h5py File instance of an archive such as h5py.File(‘/tmp/data.hdf5’) or a filename.

Returns :

file : h5py File or None

If the input is a h5py.File then a h5py File instance of the repacked archive is returned. The input File instance will no longer be useable. If the input was a filename, then None is returned.


la.is_archived_larry(file, key)

True if the key (larry name) is in the archive, False otherwise.


la.archive_directory(file)

Return a list of the keys (larry names) in the archive.

IO class reference

class la.IO(filename, max_freespace=inf)

Save and load larrys in HDF5 format using a dictionary-like interface.

Methods

__init__(filename, max_freespace=inf)

Save and load larrys in HDF5 format using a dictionary-like interface.

Dictionaries are made up of (key, value) pairs. In an IO object, a key is the name of a larry. The value part of the dictionary is a larry when saving data and is a lara, a larry-like archive object, when loading data.

(h5py has the same duality. When saving, the values are Numpy arrays; when loading the values are h5py Dataset objects.)

To convert a lara into a larry just index into the lara.

The reason why loading does not return a larry is that you may not want to load the entire larry which could, for example, be very large.

A lara loads the labels but does not load the array data until you index into it.

Each larry is stored in a HDF5 group. The group is assigned an attribute named ‘larry’ which is set to True. Inside the group is a HDF5 dataset containing the data (named ‘x’) and one dataset for each dimension of the label (named str(dimension)). For example, a 2d larry named ‘price’ is stored in a group called ‘price’ that contains a dataset called ‘x’ (the price) and two datasets called ‘0’ and ‘1’ (the labels).

Before saving, the labels are converted to Numpy arrays, one array for each dimension. Therefore, to save a larry in HDF5 format, the elements of a label along any one dimension must be of the same type and that type must be supported by HDF5.

Parameters :

filename : str

The filename is the path to the archive. If the file does not exists, it will be created.

max_freespace : scalar

If the size of the freespace (unused archive space) exceeds max_freespace bytes after a larry is deleted from the archive, then the archive is repacked. The default (np.inf) is to never repack. Repack means to transfer all the larrys to a new archive (with the same name) and delete the old archive. HDF5 does not reuse the freespace across openening and closing of the archive.

Returns :

A dictionary-like IO object. :

See also

la.save
Save larrys without a dictionary-like interface.
la.load
Load larrys without a dictionary-like interface.

Notes

  • Because the archive interface is dictionary-like, data will be overwritten when assigning a (key, value) pair if the key already exists in the archive.
  • Deleting a larry from the archive only unlinks it. You won’t be able to reuse the unlinked space if you close the connection. This is a limitation of the HDF5 format, not a limitation of the IO class or h5py. You can repack the archive with the repack method or have it done automatically for you: see freespace above.

Examples

Save a larry in the archive:

>>> import la
>>> io = la.IO('/tmp/dataset.hdf5')
>>> io['x'] = la.larry([1,2,3])  # <-- Save

Examine the contents of the archive:

>>> io   
larry  dtype  shape
------------------
x      int64  (3,)

Overwrite the contents of x in the archive:

>>> io['x'] = la.larry([4.0])  # <-- Overwrite

Load from the archive:

>>> y = io['x']  # <-- Load
>>> type(y)
    <class 'la.io.io.lara'>
>>> type(y[:])
    <class 'la.deflarry.larry'>
>>> type(y[2:])
    <class 'la.deflarry.larry'> 

Test if x is in the archive:

>>> 'x' in io
    True    
>>> del io['x']  # <-- Delete (unlink)
>>> 'x' in io
    False             
keys()

Return a list of larry names (keys) in archive.

values()

Return a list of larry objects (values) in archive.

has_key(key)

True if key is in archive, False otherwise.

items()

Return a list of all (key, value) pairs.

iterkeys()

An iterator over the keys.

itervalues()

An iterator over the values.

iteritems()

An iterator over (key, value) items.

merge(key, lar, update=False)

Merge, or optionally update, a larry with a second larry.

See larry.merge for details.

Note: the entire larry is loaded from the archive, merged with lar and then the merged larry is saved back to the archive. The resize function of h5py is not used. In other words, this function might not be practical for very large larrys.

space

The number of bytes used by the archive.

freespace

The number of bytes of freespace in the archive.

repack()

Repack archive to remove freespace.

clear()

Warning: this will delete (unlink) all larrys from the archive!

Table Of Contents

Previous topic

Working with larry

Next topic

Data

This Page