GSEGUtils.lazy_disk_cache

Lazy disk-backed cache primitives.

Exposes LazyDiskCache (per-array offload-to-disk wrapper), LazyDiskCacheConfig / LazyDiskCacheKw (configuration helpers), DiskBackedNDArray (single-pickle-file ndarray proxy), and DiskBackedStore (collection of named LazyDiskCache entries sharing a cache directory).

class GSEGUtils.lazy_disk_cache.LazyDiskCache

Bases: ABC

Abstract base for cache objects that transparently offload to a NumPy memmap.

Subclasses provide the live buffer via _describe_buffer() / _set_buffer() and the shape/dtype metadata via _describe_shape_dtype(); the base class handles the offload/load state machine, the finalizer that cleans up the memmap on GC, and the pickle protocol.

_MEMMAP_SUFFIX = '.dat'
__init__(**settings)
Parameters:

settings (Unpack[LazyDiskCacheKw])

Return type:

None

_abc_impl = <_abc._abc_data object>
_convert_to_memmap()

Allocate (or reopen) the persistent memmap and adopt it as the live buffer.

Notes

Phase 4 PERF-04 D-04 / D-06: small arrays use the fast in-RAM copy path (today’s behaviour); arrays exceeding the per-host chunk budget (~10% of psutil.virtual_memory().available, or _MEMMAP_FALLBACK_CHUNK_BYTES when psutil is unavailable) are streamed in row-chunks. The new code never materialises a full-RAM copy of the source array on the streaming path.

Return type:

None

_convert_to_ndarray()

Pull the memmap entirely into RAM as a plain ndarray.

Return type:

None

abstractmethod _describe_buffer()

Return (shape, dtype, in_memory_array) describing the current live buffer.

Return type:

tuple[tuple[int, …], type[Any] | dtype[Any] | _SupportsDType[dtype[Any]] | tuple[Any, Any] | list[Any] | _DTypeDict | str | None, ndarray[tuple[Any, …], dtype[_ScalarT]]]

abstractmethod _describe_shape_dtype()

Return (shape, dtype) without materialising the full array.

Return type:

tuple[tuple[int, …], type[Any] | dtype[Any] | _SupportsDType[dtype[Any]] | tuple[Any, Any] | list[Any] | _DTypeDict | str | None]

abstractmethod _drop_buffer()

Drop the in-memory array reference (e.g. set the subclass slot to None).

Return type:

None

_init_from_config(config)
Parameters:

config (LazyDiskCacheConfig)

Return type:

None

abstractmethod _set_buffer(buf)

Adopt buf as the new live buffer (typically a memmap returned from disk).

Parameters:

buf (ndarray[tuple[Any, ...], dtype[_ScalarT]])

Return type:

None

property automatic_offloading: bool

Return whether the cache automatically offloads after each load.

property cache_enabled: bool

Return True when this instance is backed by a memmap.

property cache_path: Path | None

Return the memmap file path associated with this cache.

disable_caching()

Turn off memmap-backed caching, copying any memmap content back into RAM.

Return type:

None

disable_purge()

Disable automatic deletion of the cache file on garbage collection.

enable_caching()

Turn on memmap-backed caching, converting the live buffer if needed.

Return type:

None

enable_purge()

Enable automatic deletion of the cache file on garbage collection.

Registers a fresh weakref.finalize() callback if one is not currently alive.

static ensure_loaded(func)

Decorate func so the cache is materialised before the call and offloaded afterwards.

If self.automatic_offloading is True and the cache was offloaded on entry, it is re-offloaded once the wrapped function returns.

load(mode='r+')

Reload the buffer from disk into memory (i.e. make self._mmap your active buffer).

Parameters:

mode (Literal['r', 'r+', 'w+', 'c'])

Return type:

None

offload()

Flush the current buffer to disk, drop the in-RAM array, and mark offloaded.

Return type:

None

property offloaded: bool

Return True when the live buffer currently lives on disk.

on_load()

Run after loading from disk; subclasses may override to reinitialise extra state.

Return type:

None

on_offload()

Run after offloading to disk; subclasses may override to prune extra resources.

Return type:

None

property purge_disk_on_gc: bool

Return whether the memmap file is deleted when this object is collected.

class GSEGUtils.lazy_disk_cache.LazyDiskCacheKw

Bases: TypedDict

Keyword-argument TypedDict for LazyDiskCache constructors.

enable_caching

When True the live buffer is backed by a numpy.memmap; when False it stays in plain RAM.

Type:

bool

cache_path

Destination path for the memmap file. When None a temporary file is created.

Type:

pathlib.Path, optional

purge_disk_on_gc

When True the memmap file is deleted via weakref.finalize() once the cache object is collected.

Type:

bool

automatic_offloading

When True the cache offloads after every load.

Type:

bool

enable_caching: bool
cache_path: Path | None
purge_disk_on_gc: bool
automatic_offloading: bool
class GSEGUtils.lazy_disk_cache.LazyDiskCacheConfig

Bases: object

Frozen pydantic dataclass mirroring LazyDiskCacheKw.

Useful as a single argument that can be threaded through factory chains and re-derived via updated() / extend_cache_path().

enable_caching

See LazyDiskCacheKw.

Type:

bool

cache_path

See LazyDiskCacheKw.

Type:

pathlib.Path, optional

purge_disk_on_gc

See LazyDiskCacheKw.

Type:

bool

automatic_offloading

See LazyDiskCacheKw.

Type:

bool

__init__(*args, **kwargs)
Parameters:
  • __dataclass_self__ (PydanticDataclass)

  • args (Any)

  • kwargs (Any)

Return type:

None

as_kwargs()

Return the configuration as a LazyDiskCacheKw mapping.

Return type:

LazyDiskCacheKw

automatic_offloading: bool = False
cache_path: Path | None = None
enable_caching: bool = False
extend_cache_path(new_folder)

Return a copy with cache_path extended by new_folder.

Parameters:

new_folder (str) – Sub-directory name to append to the existing cache_path.

Returns:

A new configuration. If self.cache_path is None the new configuration also carries cache_path=None (and an informational log entry is emitted).

Return type:

LazyDiskCacheConfig

classmethod from_kwargs(settings)

Construct a LazyDiskCacheConfig from a TypedDict-shaped mapping.

Parameters:

settings (LazyDiskCacheKw)

Return type:

Self

purge_disk_on_gc: bool = True
updated(**overrides)

Return a copy of this configuration with the given fields overridden.

Parameters:

overrides (LazyDiskCacheKw)

Return type:

Self

class GSEGUtils.lazy_disk_cache.DiskBackedNDArray

Bases: LazyDiskCache, NDArrayOperatorsMixin

Disk-backed view of a single numpy.ndarray.

Combines the offload-on-pressure semantics of LazyDiskCache with the NDArray operator surface of numpy.lib.mixins.NDArrayOperatorsMixin, so callers can index into the cache entry or pass it to a ufunc as if it were a regular ndarray. The underlying buffer is materialised on demand whenever an attribute marked with LazyDiskCache.ensure_loaded() is read.

Parameters:
  • data (NDArray) – The initial in-memory array. Its shape and dtype are cached so the cache entry can be described while the buffer is offloaded.

  • **settings (Unpack[LazyDiskCacheKw]) – Forwarded to LazyDiskCache.__init__.

Initialize with data in memory and forward cache settings to the base class.

__init__(data, **settings)

Initialize with data in memory and forward cache settings to the base class.

Parameters:
Return type:

None

_abc_impl = <_abc._abc_data object>
_describe_buffer()

Return (shape, dtype, in_memory_array) describing the current live buffer.

Return type:

tuple[tuple[int, …], type[Any] | dtype[Any] | _SupportsDType[dtype[Any]] | tuple[Any, Any] | list[Any] | _DTypeDict | str | None, ndarray]

_describe_shape_dtype()

Return (shape, dtype) without materialising the full array.

Return type:

tuple[tuple[int, …], type[Any] | dtype[Any] | _SupportsDType[dtype[Any]] | tuple[Any, Any] | list[Any] | _DTypeDict | str | None]

_drop_buffer()

Delete the in-memory buffer; direct _data reads then raise AttributeError (BUG-02 fix).

The __array__() and __getitem__() @LazyDiskCache.ensure_loaded decorators and the data property’s offloaded-check re-materialise the buffer on the next public access. Only callers that bypass those paths (and reach for self._data directly) will see the AttributeError — which is the intended contract.

Return type:

None

_set_buffer(buf)

Adopt buf as the new live buffer (typically a memmap returned from disk).

Parameters:

buf (ndarray[tuple[Any, ...], dtype[_ScalarT]])

Return type:

None

property data

Return the underlying ndarray, loading it from disk on demand.

class GSEGUtils.lazy_disk_cache.DiskBackedStore

Bases: MutableMapping[str, T], Generic

Mapping of string keys to LazyDiskCache entries with shared offload directory.

Parameters:
  • config (LazyDiskCacheConfig, optional) – Shared cache configuration (cache dir, caching flag, offload policy, purge-on-gc policy). Defaults to LazyDiskCacheConfig().

  • factory (Factory[T]) – Callable used to wrap raw arrays into the concrete cache subtype T when add_data_to_store() is called.

  • value_type (type[T] or tuple of type[T], optional) – If set, every value inserted must be an instance of this type / one of these types.

  • validator (Validator[T], optional) – Additional runtime check executed on every insert.

Notes

Threading: this class has no instance lock; per-entry writes get their atomicity from LazyDiskCache’s own threading.RLock plus the os.replace semantics of offload(). Single-PCD multi-thread mutation is unsupported (see PROJECT.md threading constraint).

_DBNDArrayFileExt = '.npy'
_DBNDArrayMetaExt = '.meta.json'
_LegacyPickleExt = '.pkl'
__init__(*, config=LazyDiskCacheConfig(), factory, value_type=None, validator=None)
Parameters:
Return type:

None

_abc_impl = <_abc._abc_data object>
_check_T(value)
Parameters:

value (object)

Return type:

T

_get_legacy_pickle_path(feature)

Return the legacy pre-Phase-2 .pkl path for feature (refused on read).

Parameters:

feature (str)

Return type:

Path

_get_meta_path(feature)

Return the on-disk JSON sidecar path for feature.

Parameters:

feature (str)

Return type:

Path

_get_npy_path(feature)

Return the on-disk .npy path for feature.

Parameters:

feature (str)

Return type:

Path

_load_entry(key)

Load a cache entry from the <key>.npy + <key>.meta.json pair.

Refuses legacy .pkl files with a single INFO log (D-05). Raises KeyError on cache miss, ValueError on schema-version mismatch or unknown lazy_disk_cache_class.

Per W-5: the reconstructed instance’s cache_path field is populated to the <key>.npy file path so the Plan-02-04 finalizer’s LazyDiskCache.enable_purge() reaches the registration branch instead of silently no-op-ing on if not self._cache_path: return. Note that LazyDiskCache._init_from_config() re-suffixes the provided cache_path with _MEMMAP_SUFFIX (.dat) internally, so the live self._cache_path on the reconstructed instance is <key>.dat rather than <key>.npy. The W-5 invariant (a non-None cache_path so enable_purge can register) holds either way.

Parameters:

key (str)

Return type:

T

_store_entry(key, entry)

Atomically write a cache entry as .npy + .meta.json pair (D-04 + Pitfall 4).

Write order: .npy.tmp → flush+fsync → .meta.json.tmp → flush+fsync → os.replace(.npy.tmp .npy)os.replace(.meta.json.tmp .meta.json) → POSIX dir-fsync. A torn write leaves only .tmp files which the reader treats as cache miss. Disk-full / permission errors are re-raised after best-effort .tmp cleanup.

Parameters:
Return type:

None

add_data_to_store(key, data, *, enable_caching_override=None, automatic_offloading_override=None, purge_disk_on_gc_override=None)

Wrap data via the configured factory and insert it under key.

Parameters:
  • key (str) – Key under which the new cache entry is registered.

  • data (NDArray) – Raw array to be wrapped.

  • enable_caching_override (bool, optional) – Per-entry override for the store-level caching flag.

  • automatic_offloading_override (bool, optional) – Per-entry override for the store-level auto-offload flag.

  • purge_disk_on_gc_override (bool, optional) – Per-entry override for the store-level purge-on-gc flag.

Raises:

KeyError – If key is already present in the store.

Return type:

None

property cache_dir: Path

Return the directory where offloaded codec pairs are written.

items()

Iterate over (key, value) pairs (value is None where offloaded).

Return type:

Iterator[tuple[str, T | None]]

keys()

Return a list of all tracked keys.

Return type:

list[str]

offload(keys=None, pickle_container=False)

Offload selected entries to disk.

When no keys are provided every cached entry is considered. Items with cache_enabled=False are skipped. When pickle_container is True (the legacy parameter name, kept for backward compatibility) the entire container entry is offloaded via the Phase-2 codec (.npy + JSON sidecar, no actual pickling), the in-memory reference is cleared, and the next access reloads it lazily via _load_entry().

Parameters:
  • keys (str or list[str], optional) – Specific keys to offload. Defaults to every tracked key.

  • pickle_container (bool, optional) – When True write the wrapping container via the codec; when False (default) delegate to each entry’s own offload() method. The name is retained for API stability — no pickle is used.

Return type:

None

property store: dict[str, T | None]

Return the internal mapping of keys to in-memory entries (None if offloaded).

values()

Iterate over the current in-memory entries (None where offloaded).

Return type:

Iterator[T | None]

_store: dict[str, T | None]
_cache_dir: Path
_enable_caching: bool
_automatic_offloading: bool
_purge_disk_on_gc: bool
_factory: Factory
_value_type: type[T] | tuple[type[T], ...] | None
_validator: Validator | None