GSEGUtils.lazy_disk_cache
Lazy disk-backed cache primitives.
Exposes LazyDiskCache (per-array offload-to-disk wrapper),
LazyDiskCacheConfig / LazyDiskCacheKw (configuration helpers),
DiskBackedNDArray (single-pickle-file ndarray proxy), and
DiskBackedStore (collection of named LazyDiskCache entries
sharing a cache directory).
- class GSEGUtils.lazy_disk_cache.LazyDiskCache
Bases:
ABCAbstract base for cache objects that transparently offload to a NumPy memmap.
Subclasses provide the live buffer via
_describe_buffer()/_set_buffer()and the shape/dtype metadata via_describe_shape_dtype(); the base class handles the offload/load state machine, the finalizer that cleans up the memmap on GC, and the pickle protocol.- _MEMMAP_SUFFIX = '.dat'
- __init__(**settings)
- Parameters:
settings (Unpack[LazyDiskCacheKw])
- Return type:
None
- _abc_impl = <_abc._abc_data object>
- _convert_to_memmap()
Allocate (or reopen) the persistent memmap and adopt it as the live buffer.
Notes
Phase 4 PERF-04 D-04 / D-06: small arrays use the fast in-RAM copy path (today’s behaviour); arrays exceeding the per-host chunk budget (~10% of
psutil.virtual_memory().available, or_MEMMAP_FALLBACK_CHUNK_BYTESwhen psutil is unavailable) are streamed in row-chunks. The new code never materialises a full-RAM copy of the source array on the streaming path.- Return type:
None
- _convert_to_ndarray()
Pull the memmap entirely into RAM as a plain ndarray.
- Return type:
None
- abstractmethod _describe_buffer()
Return
(shape, dtype, in_memory_array)describing the current live buffer.
- abstractmethod _describe_shape_dtype()
Return
(shape, dtype)without materialising the full array.
- abstractmethod _drop_buffer()
Drop the in-memory array reference (e.g. set the subclass slot to
None).- Return type:
None
- _init_from_config(config)
- Parameters:
config (LazyDiskCacheConfig)
- Return type:
None
- abstractmethod _set_buffer(buf)
Adopt
bufas the new live buffer (typically a memmap returned from disk).
- property automatic_offloading: bool
Return whether the cache automatically offloads after each load.
- disable_caching()
Turn off memmap-backed caching, copying any memmap content back into RAM.
- Return type:
None
- disable_purge()
Disable automatic deletion of the cache file on garbage collection.
- enable_caching()
Turn on memmap-backed caching, converting the live buffer if needed.
- Return type:
None
- enable_purge()
Enable automatic deletion of the cache file on garbage collection.
Registers a fresh
weakref.finalize()callback if one is not currently alive.
- static ensure_loaded(func)
Decorate
funcso the cache is materialised before the call and offloaded afterwards.If
self.automatic_offloadingisTrueand the cache was offloaded on entry, it is re-offloaded once the wrapped function returns.
- load(mode='r+')
Reload the buffer from disk into memory (i.e. make self._mmap your active buffer).
- Parameters:
mode (Literal['r', 'r+', 'w+', 'c'])
- Return type:
None
- offload()
Flush the current buffer to disk, drop the in-RAM array, and mark offloaded.
- Return type:
None
- on_load()
Run after loading from disk; subclasses may override to reinitialise extra state.
- Return type:
None
- on_offload()
Run after offloading to disk; subclasses may override to prune extra resources.
- Return type:
None
- class GSEGUtils.lazy_disk_cache.LazyDiskCacheKw
Bases:
TypedDictKeyword-argument TypedDict for
LazyDiskCacheconstructors.- enable_caching
When
Truethe live buffer is backed by anumpy.memmap; whenFalseit stays in plain RAM.- Type:
- cache_path
Destination path for the memmap file. When
Nonea temporary file is created.- Type:
pathlib.Path, optional
- purge_disk_on_gc
When
Truethe memmap file is deleted viaweakref.finalize()once the cache object is collected.- Type:
- class GSEGUtils.lazy_disk_cache.LazyDiskCacheConfig
Bases:
objectFrozen pydantic dataclass mirroring
LazyDiskCacheKw.Useful as a single argument that can be threaded through factory chains and re-derived via
updated()/extend_cache_path().- enable_caching
See
LazyDiskCacheKw.- Type:
- cache_path
See
LazyDiskCacheKw.- Type:
pathlib.Path, optional
- purge_disk_on_gc
See
LazyDiskCacheKw.- Type:
- automatic_offloading
See
LazyDiskCacheKw.- Type:
- __init__(*args, **kwargs)
- Parameters:
__dataclass_self__ (PydanticDataclass)
args (Any)
kwargs (Any)
- Return type:
None
- as_kwargs()
Return the configuration as a
LazyDiskCacheKwmapping.- Return type:
- extend_cache_path(new_folder)
Return a copy with
cache_pathextended bynew_folder.- Parameters:
new_folder (str) – Sub-directory name to append to the existing
cache_path.- Returns:
A new configuration. If
self.cache_pathisNonethe new configuration also carriescache_path=None(and an informational log entry is emitted).- Return type:
- classmethod from_kwargs(settings)
Construct a
LazyDiskCacheConfigfrom a TypedDict-shaped mapping.- Parameters:
settings (LazyDiskCacheKw)
- Return type:
- updated(**overrides)
Return a copy of this configuration with the given fields overridden.
- Parameters:
overrides (LazyDiskCacheKw)
- Return type:
- class GSEGUtils.lazy_disk_cache.DiskBackedNDArray
Bases:
LazyDiskCache,NDArrayOperatorsMixinDisk-backed view of a single
numpy.ndarray.Combines the offload-on-pressure semantics of
LazyDiskCachewith the NDArray operator surface ofnumpy.lib.mixins.NDArrayOperatorsMixin, so callers can index into the cache entry or pass it to a ufunc as if it were a regularndarray. The underlying buffer is materialised on demand whenever an attribute marked withLazyDiskCache.ensure_loaded()is read.- Parameters:
data (NDArray) – The initial in-memory array. Its shape and dtype are cached so the cache entry can be described while the buffer is offloaded.
**settings (Unpack[LazyDiskCacheKw]) – Forwarded to
LazyDiskCache.__init__.
Initialize with
datain memory and forward cache settings to the base class.- __init__(data, **settings)
Initialize with
datain memory and forward cache settings to the base class.
- _abc_impl = <_abc._abc_data object>
- _describe_buffer()
Return
(shape, dtype, in_memory_array)describing the current live buffer.
- _describe_shape_dtype()
Return
(shape, dtype)without materialising the full array.
- _drop_buffer()
Delete the in-memory buffer; direct
_datareads then raiseAttributeError(BUG-02 fix).The
__array__()and__getitem__()@LazyDiskCache.ensure_loaded decorators and thedataproperty’soffloaded-check re-materialise the buffer on the next public access. Only callers that bypass those paths (and reach forself._datadirectly) will see theAttributeError— which is the intended contract.- Return type:
None
- _set_buffer(buf)
Adopt
bufas the new live buffer (typically a memmap returned from disk).
- property data
Return the underlying ndarray, loading it from disk on demand.
- class GSEGUtils.lazy_disk_cache.DiskBackedStore
Bases:
MutableMapping[str,T],GenericMapping of string keys to
LazyDiskCacheentries with shared offload directory.- Parameters:
config (LazyDiskCacheConfig, optional) – Shared cache configuration (cache dir, caching flag, offload policy, purge-on-gc policy). Defaults to
LazyDiskCacheConfig().factory (Factory[T]) – Callable used to wrap raw arrays into the concrete cache subtype
Twhenadd_data_to_store()is called.value_type (type[T] or tuple of type[T], optional) – If set, every value inserted must be an instance of this type / one of these types.
validator (Validator[T], optional) – Additional runtime check executed on every insert.
Notes
Threading: this class has no instance lock; per-entry writes get their atomicity from
LazyDiskCache’s ownthreading.RLockplus theos.replacesemantics ofoffload(). Single-PCD multi-thread mutation is unsupported (see PROJECT.md threading constraint).- _DBNDArrayFileExt = '.npy'
- _DBNDArrayMetaExt = '.meta.json'
- _LegacyPickleExt = '.pkl'
- __init__(*, config=LazyDiskCacheConfig(), factory, value_type=None, validator=None)
- Parameters:
config (LazyDiskCacheConfig)
factory (Factory)
validator (Validator | None)
- Return type:
None
- _abc_impl = <_abc._abc_data object>
- _get_legacy_pickle_path(feature)
Return the legacy pre-Phase-2
.pklpath forfeature(refused on read).
- _get_meta_path(feature)
Return the on-disk JSON sidecar path for
feature.
- _get_npy_path(feature)
Return the on-disk
.npypath forfeature.
- _load_entry(key)
Load a cache entry from the
<key>.npy + <key>.meta.jsonpair.Refuses legacy
.pklfiles with a single INFO log (D-05). RaisesKeyErroron cache miss,ValueErroron schema-version mismatch or unknownlazy_disk_cache_class.Per W-5: the reconstructed instance’s
cache_pathfield is populated to the<key>.npyfile path so the Plan-02-04 finalizer’sLazyDiskCache.enable_purge()reaches the registration branch instead of silently no-op-ing onif not self._cache_path: return. Note thatLazyDiskCache._init_from_config()re-suffixes the providedcache_pathwith_MEMMAP_SUFFIX(.dat) internally, so the liveself._cache_pathon the reconstructed instance is<key>.datrather than<key>.npy. The W-5 invariant (a non-Nonecache_pathsoenable_purgecan register) holds either way.- Parameters:
key (str)
- Return type:
T
- _store_entry(key, entry)
Atomically write a cache entry as
.npy+.meta.jsonpair (D-04 + Pitfall 4).Write order:
.npy.tmp→ flush+fsync →.meta.json.tmp→ flush+fsync →os.replace(.npy.tmp → .npy)→os.replace(.meta.json.tmp → .meta.json)→ POSIX dir-fsync. A torn write leaves only.tmpfiles which the reader treats as cache miss. Disk-full / permission errors are re-raised after best-effort.tmpcleanup.- Parameters:
key (str)
entry (LazyDiskCache)
- Return type:
None
- add_data_to_store(key, data, *, enable_caching_override=None, automatic_offloading_override=None, purge_disk_on_gc_override=None)
Wrap
datavia the configured factory and insert it underkey.- Parameters:
key (str) – Key under which the new cache entry is registered.
data (NDArray) – Raw array to be wrapped.
enable_caching_override (bool, optional) – Per-entry override for the store-level caching flag.
automatic_offloading_override (bool, optional) – Per-entry override for the store-level auto-offload flag.
purge_disk_on_gc_override (bool, optional) – Per-entry override for the store-level purge-on-gc flag.
- Raises:
KeyError – If
keyis already present in the store.- Return type:
None
- items()
Iterate over
(key, value)pairs (valueisNonewhere offloaded).
- offload(keys=None, pickle_container=False)
Offload selected entries to disk.
When no keys are provided every cached entry is considered. Items with
cache_enabled=Falseare skipped. Whenpickle_containerisTrue(the legacy parameter name, kept for backward compatibility) the entire container entry is offloaded via the Phase-2 codec (.npy+ JSON sidecar, no actual pickling), the in-memory reference is cleared, and the next access reloads it lazily via_load_entry().- Parameters:
keys (str or list[str], optional) – Specific keys to offload. Defaults to every tracked key.
pickle_container (bool, optional) – When
Truewrite the wrapping container via the codec; whenFalse(default) delegate to each entry’s ownoffload()method. The name is retained for API stability — no pickle is used.
- Return type:
None
- property store: dict[str, T | None]
Return the internal mapping of keys to in-memory entries (
Noneif offloaded).
- values()
Iterate over the current in-memory entries (
Nonewhere offloaded).- Return type:
Iterator[T | None]
- _factory: Factory