Proposed refactoring or deprecation
Introduce a real storage abstraction layer for Aim’s backend storage, so that the core repository/query layer is not tightly coupled to the current RocksDB-based implementation.
The goal is not to remove RocksDB or force a different default. The goal is to make the storage architecture extensible enough that the core maintainers can continue using and supporting RocksDB, while the community can implement alternative backends where needed.
Motivation
A number of existing issues suggest that the current storage architecture is creating operational and scaling pain, while also making it difficult for users to adopt alternative backends without forking or large internal changes.
Relevant issues include:
There are also related requests for storage pluggability on the artifact/object side:
In my own investigation of a Too many open files failure, the problem did not appear to be a simple OS limit issue. One Aim worker had still opened roughly 1000 regular files, most of them .sst files under .aim/meta/chunks/*.
From reading the code, this appears to be tied to the current architecture:
- there is a generic container interface in
aim/storage/container.py
- but
Repo still directly imports and instantiates RocksContainer / RocksUnionContainer
- run storage is bound to chunk-local trees
- union read paths enumerate and open chunk DBs
max_open_files=-1 is set in the RocksDB container implementation
Taken together, this suggests that RocksDB is not just the default backend, but a core architectural assumption in the current implementation.
Pitch
I would like to propose introducing a proper abstraction boundary above the current low-level container API, at the repository/storage-factory level.
Concretely, this would ideally mean:
Repo and higher-level query/storage paths depend on a backend interface rather than directly on Rocks-specific classes
- RocksDB remains a first-class default backend
- users are not forced into the current chunk-local RocksDB layout as the only practical architecture
- the community can implement alternative backends for their own needs without requiring the maintainers to replace the default storage engine
- the same architectural principle can be applied to object/artifact storage as well
This would let the core maintainers keep the current storage model where it works well, while making Aim more adaptable for users whose workloads would benefit from a different backend.
Additional context
Files that seem especially relevant:
aim/storage/container.py
aim/storage/rockscontainer.pyx
aim/storage/union.pyx
aim/sdk/repo.py
aim/sdk/base_run.py
aim/sdk/index_manager.py
Relevant code observations:
aim/storage/container.py provides a generic storage interface
aim/sdk/repo.py directly imports and constructs RocksContainer / RocksUnionContainer
aim/sdk/base_run.py binds run data to chunk-local trees under meta/chunks/<run> and seqs/chunks/<run>
aim/storage/union.pyx enumerates and opens chunk databases for read access
aim/storage/rockscontainer.pyx sets max_open_files=-1
I think a real abstraction layer here would be valuable even if no new backend ships immediately, because it would reduce coupling and make future storage work much easier for both maintainers and the community.
Proposed refactoring or deprecation
Introduce a real storage abstraction layer for Aim’s backend storage, so that the core repository/query layer is not tightly coupled to the current RocksDB-based implementation.
The goal is not to remove RocksDB or force a different default. The goal is to make the storage architecture extensible enough that the core maintainers can continue using and supporting RocksDB, while the community can implement alternative backends where needed.
Motivation
A number of existing issues suggest that the current storage architecture is creating operational and scaling pain, while also making it difficult for users to adopt alternative backends without forking or large internal changes.
Relevant issues include:
There are also related requests for storage pluggability on the artifact/object side:
In my own investigation of a
Too many open filesfailure, the problem did not appear to be a simple OS limit issue. One Aim worker had still opened roughly 1000 regular files, most of them.sstfiles under.aim/meta/chunks/*.From reading the code, this appears to be tied to the current architecture:
aim/storage/container.pyRepostill directly imports and instantiatesRocksContainer/RocksUnionContainermax_open_files=-1is set in the RocksDB container implementationTaken together, this suggests that RocksDB is not just the default backend, but a core architectural assumption in the current implementation.
Pitch
I would like to propose introducing a proper abstraction boundary above the current low-level container API, at the repository/storage-factory level.
Concretely, this would ideally mean:
Repoand higher-level query/storage paths depend on a backend interface rather than directly on Rocks-specific classesThis would let the core maintainers keep the current storage model where it works well, while making Aim more adaptable for users whose workloads would benefit from a different backend.
Additional context
Files that seem especially relevant:
aim/storage/container.pyaim/storage/rockscontainer.pyxaim/storage/union.pyxaim/sdk/repo.pyaim/sdk/base_run.pyaim/sdk/index_manager.pyRelevant code observations:
aim/storage/container.pyprovides a generic storage interfaceaim/sdk/repo.pydirectly imports and constructsRocksContainer/RocksUnionContaineraim/sdk/base_run.pybinds run data to chunk-local trees undermeta/chunks/<run>andseqs/chunks/<run>aim/storage/union.pyxenumerates and opens chunk databases for read accessaim/storage/rockscontainer.pyxsetsmax_open_files=-1I think a real abstraction layer here would be valuable even if no new backend ships immediately, because it would reduce coupling and make future storage work much easier for both maintainers and the community.