Lazy datasets

This module provides lightweight wrappers that expose array-like, lazy access to datasets without loading full arrays into memory.

The interface is defined by DatasetLike and mimics that of h5py datasets, allowing users to slice and access data on demand. Specific implementations load the necessary data and apply transformations, such as stacking, scaling, summing, etc., only for the requested slices.

xyz = LazyStackedDataset([ds1, ds2, ds3])

# Loaded and computed only for this slice
xyz_slice = xyz[100:200]

# Writing is supported for LazyScaledDataset and LazyStackedDataset
xyz[100:200] = new_data
scaled = LazyScaledDataset(ds, scale_factor=2.0)
scaled[50:100] = values  # Applies inverse scaling before writing

combined_signal = LazySumDataset([ds1, ds2])
combined_flags = LazyBooleanOrDataset([flags1, flags2])

# Loaded and computed only for this slice (read-only)
signal_slice = combined_signal[0:1024]
flags_slice = combined_flags[0:1024]

Some operations also supports writing. For example, LazyScaledDataset applies the inverse scaling operation before writing to the underlying dataset.

xyz = LazyScaledDataset(ds, scale_factor=2.0)
xyz[100:200] = new_values  # Writes new_values / 2.0 to ds[100:200]

Dataset interface

class mojito.lazy.DatasetLike(*args, **kwargs)[source]

Protocol for (potentially lazy) datasets with NumPy-style access.

Implementations must expose shape, dimensionality, dtype and support slicing through __getitem__ similarly to h5py.Dataset. Writing via __setitem__ is optional and may not be supported by all implementations.

__getitem__(*args, **kwargs)[source]

Return a sliced array.

Return type:

Any

__setitem__(*args, **kwargs)[source]

Write to the dataset.

Raises:

NotImplementedError – If writing is not supported by the implementation.

Return type:

None

property dtype: dtype[Any]

Data type of the dataset.

property ndim: int

Number of dimensions.

property shape: tuple[int, ...]

Shape of the dataset.

Operations on datasets

class mojito.lazy.LazyScaledDataset(dataset, scale_factor=1.0)[source]

Lazy-loaded dataset that applies scaling on slicing.

This is useful to provide access to scaled datasets without loading all data into memory at once, similar to h5py datasets.

Parameters:
  • dataset (DatasetLike) – HDF5 dataset to scale.

  • scale_factor (float, default: 1.0) – Scaling factor to apply to all values.

__getitem__(key)[source]

Slice and scale the dataset lazily.

Parameters:

key – Slicing key (integers, slices, lists, ellipsis).

Returns:

Scaled sliced array from the dataset.

Return type:

NDArray

__setitem__(key, value)[source]

Write scaled values to the dataset lazily.

The inverse scaling operation is applied before writing to the underlying dataset.

Parameters:
  • key (Any) – Slicing key (integers, slices, lists, ellipsis).

  • value (Any) – Data to write. Will be divided by scale_factor before writing.

Raises:

ValueError – If scale_factor is zero (inverse scaling undefined).

Return type:

None

property dtype: dtype

Data type of the dataset.

property ndim: int

Number of dimensions.

property shape: tuple[int, ...]

Shape of the dataset.

class mojito.lazy.LazySumDataset(datasets)[source]

Lazy-loaded dataset that sums values across multiple datasets on slicing.

This is useful for lazily combining multiple datasets via element-wise summation without loading all data into memory at once.

Parameters:

datasets (Sequence[DatasetLike]) – Sequence of HDF5 datasets to sum.

Raises:

ValueError – If datasets list is empty.

__getitem__(key)[source]

Slice and sum all datasets lazily.

Parameters:

key – Slicing key (integers, slices, lists, ellipsis).

Returns:

Sum of sliced arrays from all datasets.

Return type:

NDArray

property dtype: dtype

Data type of the dataset.

property ndim: int

Number of dimensions.

property shape: tuple[int, ...]

Shape of the dataset.

class mojito.lazy.LazyBooleanOrDataset(datasets)[source]

Lazy-loaded dataset that applies Boolean OR across multiple datasets.

This is useful for lazily combining flag datasets via element-wise Boolean OR without loading all data into memory at once.

Parameters:

datasets (Sequence[DatasetLike]) – Sequence of HDF5 datasets to OR.

Raises:

ValueError – If datasets list is empty.

__getitem__(key)[source]

Slice and OR all datasets lazily.

Parameters:

key – Slicing key (integers, slices, lists, ellipsis).

Returns:

Boolean OR of sliced arrays from all datasets.

Return type:

NDArray

property dtype: dtype

Data type of the dataset.

property ndim: int

Number of dimensions.

property shape: tuple[int, ...]

Shape of the dataset.

Stacking datasets

class mojito.lazy.LazyStackedDataset(datasets, axis=-1)[source]

Lazy-loaded stacked dataset that does not load all data into memory.

This is useful to provide access to stacked and normalized datasets without loading all data into memory at once, similar to h5py datasets.

>>> stacked = LazyStackedDataset([ds1, ds2, ds3])
>>> print(stacked.shape)
(1000, 3)
>>> data_slice = stacked[100:200]  # Loads only the requested slice
>>> data_slice2 = stacked[100:200, ..., 0]  # Slicing along stacked axis
Parameters:
  • datasets (Sequence[DatasetLike]) – Sequence of HDF5 datasets to stack.

  • axis (int, default: -1) – Axis along which to stack the datasets.

Raises:

ValueError – If axis is out of bounds for stacking.

__getitem__(key)[source]

Slice the stacked datasets on demand without loading all data.

Supports slicing along all dimensions including the stacked axis. The stacked axis can be sliced using integers, slices, or lists of indices.

Note that fancy indexing (i.e., using arrays of indices for multiple axes) for non-stacked axes is not supported due to a limitation in h5py. However, it is supported for a single non-stacked axis at a time.

Insertion of None or np.newaxis is not supported.

Parameters:

key – Slicing key, can include integers (Python int or Numpy integer scalars), slices, lists of indices, and ellipsis.

Returns:

Sliced and stacked array.

Return type:

Any

Raises:
  • IndexError – If None or np.newaxis is used in the key.

  • IndexError – If too many indices are provided for the array.

__setitem__(key, value)[source]

Write to stacked datasets on demand.

The value is unstacked along the stacked axis and distributed to the appropriate underlying datasets.

Parameters:
  • key (Any) – Slicing key, can include integers, slices, or lists of indices.

  • value (Any) – Data to write. Must be compatible with the sliced shape.

Raises:
  • IndexError – If None or np.newaxis is used in the key, or if too many indices are provided.

  • ValueError – If value shape is incompatible with the requested slice.

Return type:

None

property dtype: dtype

Data type of the stacked array.

property ndim: int

Number of dimensions of the stacked array.

property normalized_axis: int

Normalize axis to a positive value.

property shape: tuple[int, ...]

Shape of the stacked array.