The Road to 1.0
by @haadcode and @aphelionz
Big things are coming to OrbitDB. In this document, we describe our proposal for getting OrbitDB from alpha to 1.0.
As always, our roadmap is a combination of the long-term vision for OrbitDB, features the users have been asking for, issues the community wants to address, the core developers have separately discussed, or have been encountered as bugs in using OrbitDB. Please note that inclusion in the road map is not a promise of delivery!
Things here are subject to alteration or deletion. As of this writing, these should be considered proposals and open conversations and anybody should feel welcome and able to provide feedback in the form of questions, comments, or suggestions.
As always, we welcome contributions from the community and would be happy to help to land any of the discussed features or fixes.
If you would like to financially support OrbitDB, we now have an OpenCollective that we request you contribute to. Anything helps and we are forever grateful for your support, monetary or otherwise.
In general, the feaqtures and improvements proposed revolve around three categories: Performance and Resource Consumption, User Experience, and Encryption. Without further ado, let's look at the specific items.
Checklist
Non-breaking changes
The changes in this section should be able to be implemented without breaking any backwards compatibility or public-facing APIs. Though certain application-level details might change and need to be addressed, by and large these changes should not require a new major version.
Replicator Refactoring
Use Case: I have a database that has been replicated locally. I want to get the current state of the db as fast as possible when opening the db (in order to return the first query as fast as possible).
As of right now, the store replicator uses the next field in a log entry to replicate, whereas it could use the new refs field, as loading now does.
This is by and large the most effective improvement we can make, and perhaps the most often requested and discussed in the community.
There are other possible ways to address the initial query and loading performance that we may want to take up on.
Further discussion: orbitdb-archive/SCPs#3
Work is happening here: orbitdb-archive/orbit-db-store#100
BTree Indexing for KVStore and DocStore
As it stands, all keys from the database index are kept in memory. This works well for most cases, but becomes a limit once you get to the order of 1M keys or more. This can be minimized with the use of B-Trees.
@vasa-develop and @vaultec81 utilized this technique in AvionDB.
This is highly connected to "Hot/Cold Data Separation (in-memory vs. on-disk data)".
Snapshots
I have a database that has been replicated locally. I want to get the current state of the db as fast as possible when opening the db (in order to return the first query as fast as possible).
A snapshot is the current state of the database, ie. only the current data without the database oplog (history). The snapshot of the current state could potentially be a log db itself.
User Experience Improvements
A collection of "small" items that would improve UX for the OrbitDB user.
- Add a "merge fields" option to
DocStore.put to merge the fields of the current doc and updated doc
- Remove the need for a database name and just use the CID as the address. Move everything else to the manifest (which already contains the name of the database).
- Remove the need for separate load() (but keep it available) and provide a one-liner to start, eg.
OrbitDB.open(<address>) performs the instantiation of the orbitdb object, opening of the database and what currently happens in load().
Developer Experience: The Publish Dance
One of the biggest hurdles to releases is a term the contributors call the "publish dance" which requires a coordinated effort of publishing around 20 different npm modules that together constitute an OrbitDB release. There's no need to enumerate them here but the process generally starts from ipfs-log and moves upward to the top-level orbit-db.
The community has discussed about solving this on tooling level, such as using Lerna for module management, but a better alternative would be to address this on the architecture / implementation level by:
- remove the inheritance of stores and inject the Store module to stores
- remove ipfs-log dependency from Store and inject it from OrbitDB
- generally switch all inheritance to dependency injection (eg. feedstore takes in as a parameter an eventstore instead of inheriting from it)
All these would make it possible to configure the dependencies on the main package level, in orbit-db, giving the users more flexibility in choosing which modules and versions they use.
Community efforts
There are a number of community efforts that we'd like to focus our attention towards getting merged, for two reasons. First, we value our community's input and want to further streamline their contributions and second, we want to make sure they are merged before the breaking changes int he next section.
See the GitHub project for more info.
Breaking changes
Ok, on to the main event.
Given the scale and impact of these changes, backwards compatibility may be abandoned and we would make a new major version to signal the breaking changes.
Oplog Watermarks
As of right now, the kvstore and docstore currently reduce the full log any time updateIndex is called, which is on every write to the oplog. This is slow, grows even slower over time and ultimately unnecessary.
A solution to this could be to add high/low watermark and only process "new oplog entries". This is highly applicable especially for KVStore and DocStore.
Database Encryption
It should be possible to encrypt the payload of an OrbitDB. We've been pushing this back in the past because maturity of the technologies used was not there yet and we wanted to give the user flexibility. Admittedly, we now want to take the onus of implication by "suggesting" a default encryption scheme. However, it's become increasingly apparent that something like this is necessary.
This is another change that touches and effects everything in the architecture as well as data formats, and this could be considered the beginning of the discussion.
- How many keys and for what are they used to encrypt (oplog entries vs. payloads)?
- Where are they stored?
- How does this effect AccessControllers?
- How does this tie into hot/cold data (see below)?
Many projects have rolled their own solutions, e.g. TallyLab using the nacl-js library, the proposal for dag-jose by @oed, and so on, so there are places to seek inspiration.
Async Iterators / Streaming
When applicable we should be using async iterators / generators (or streams) to process data and then "discard" it, allowing more real-time capabilities and the ability to return results as they become available, instead of waiting for the full log to be fetched or processed.
Hot/Cold Data Separation (in-memory vs. on-disk data)
Currently the entire database (log entries and the computed state) are loaded into memory in its entirety before use. This takes time to load and uses more memory. However, this is another massive change that effects every other part of the system.
This will also positively effect perceived performance and user experience: Entries would load fairly instantly, and reasoning about the state of the db replication becomes easier.
The general idea here is to 1) read and compute database state on-demand (ie. upon query) 2) cache "warm" data (=data that is most likely to be used soon, or was recently used) in order to have a configurable in-memory cache 3) fallback to reading from disk when the cache doesn't have the data available.
Misc. Cleanup
Some more items that are smaller in size / complexity.
- clean up all events and their semantics (eg. only one "updated" event instead of "write" and "replicated"), perhaps remove some and only use callback (eg. "onLoadProgressCallback").
- separate identities/keys from oplog entry. CID per identity/key. cuts N bytes from each pubsub message and bitswap/ipfs/ild block transfer.
- kvstore: keep only the keys in memory, make them point to the CID with the data and fetch data on query from cache/ipfs.
Potential Rust Integration
We also, during the course of this work, want to explore integrating Rust into the project, in two potential places:
- Specific places that can benefit from wasm performance, likely things like crypto verification and maybe CRDT calculation
- Implementing pieces of OrbitDB as separate Rust integrations, to be used with Rust project like Rust IPFS
Conclusion
These are our plans for 2020 onward. With these features and changes implemented, we believe OrbitDB would be on par with our vision for it as well as the user needs, and would make an excellent version 1.0.
Let us know what you think, and again, if you find any of this valuable and want to help, the best way is via the OrbitDB Open Source Community or the OpenCollective.
The Road to 1.0
by @haadcode and @aphelionz
Big things are coming to OrbitDB. In this document, we describe our proposal for getting OrbitDB from alpha to 1.0.
As always, our roadmap is a combination of the long-term vision for OrbitDB, features the users have been asking for, issues the community wants to address, the core developers have separately discussed, or have been encountered as bugs in using OrbitDB. Please note that inclusion in the road map is not a promise of delivery!
Things here are subject to alteration or deletion. As of this writing, these should be considered proposals and open conversations and anybody should feel welcome and able to provide feedback in the form of questions, comments, or suggestions.
As always, we welcome contributions from the community and would be happy to help to land any of the discussed features or fixes.
If you would like to financially support OrbitDB, we now have an OpenCollective that we request you contribute to. Anything helps and we are forever grateful for your support, monetary or otherwise.
In general, the feaqtures and improvements proposed revolve around three categories: Performance and Resource Consumption, User Experience, and Encryption. Without further ado, let's look at the specific items.
Checklist
Non-breaking changes
The changes in this section should be able to be implemented without breaking any backwards compatibility or public-facing APIs. Though certain application-level details might change and need to be addressed, by and large these changes should not require a new major version.
Replicator Refactoring
As of right now, the store replicator uses the
nextfield in a log entry to replicate, whereas it could use the newrefsfield, as loading now does.This is by and large the most effective improvement we can make, and perhaps the most often requested and discussed in the community.
There are other possible ways to address the initial query and loading performance that we may want to take up on.
Further discussion: orbitdb-archive/SCPs#3
Work is happening here: orbitdb-archive/orbit-db-store#100
BTree Indexing for KVStore and DocStore
As it stands, all keys from the database index are kept in memory. This works well for most cases, but becomes a limit once you get to the order of 1M keys or more. This can be minimized with the use of B-Trees.
@vasa-develop and @vaultec81 utilized this technique in AvionDB.
This is highly connected to "Hot/Cold Data Separation (in-memory vs. on-disk data)".
Snapshots
A snapshot is the current state of the database, ie. only the current data without the database oplog (history). The snapshot of the current state could potentially be a log db itself.
User Experience Improvements
A collection of "small" items that would improve UX for the OrbitDB user.
DocStore.putto merge the fields of the current doc and updated docOrbitDB.open(<address>)performs the instantiation of the orbitdb object, opening of the database and what currently happens inload().Developer Experience: The Publish Dance
One of the biggest hurdles to releases is a term the contributors call the "publish dance" which requires a coordinated effort of publishing around 20 different npm modules that together constitute an OrbitDB release. There's no need to enumerate them here but the process generally starts from
ipfs-logand moves upward to the top-levelorbit-db.The community has discussed about solving this on tooling level, such as using Lerna for module management, but a better alternative would be to address this on the architecture / implementation level by:
All these would make it possible to configure the dependencies on the main package level, in
orbit-db, giving the users more flexibility in choosing which modules and versions they use.Community efforts
There are a number of community efforts that we'd like to focus our attention towards getting merged, for two reasons. First, we value our community's input and want to further streamline their contributions and second, we want to make sure they are merged before the breaking changes int he next section.
See the GitHub project for more info.
Breaking changes
Ok, on to the main event.
Given the scale and impact of these changes, backwards compatibility may be abandoned and we would make a new major version to signal the breaking changes.
Oplog Watermarks
As of right now, the kvstore and docstore currently reduce the full log any time
updateIndexis called, which is on every write to the oplog. This is slow, grows even slower over time and ultimately unnecessary.A solution to this could be to add high/low watermark and only process "new oplog entries". This is highly applicable especially for KVStore and DocStore.
Database Encryption
It should be possible to encrypt the payload of an OrbitDB. We've been pushing this back in the past because maturity of the technologies used was not there yet and we wanted to give the user flexibility. Admittedly, we now want to take the onus of implication by "suggesting" a default encryption scheme. However, it's become increasingly apparent that something like this is necessary.
This is another change that touches and effects everything in the architecture as well as data formats, and this could be considered the beginning of the discussion.
Many projects have rolled their own solutions, e.g. TallyLab using the
nacl-jslibrary, the proposal fordag-joseby @oed, and so on, so there are places to seek inspiration.Async Iterators / Streaming
When applicable we should be using async iterators / generators (or streams) to process data and then "discard" it, allowing more real-time capabilities and the ability to return results as they become available, instead of waiting for the full log to be fetched or processed.
Hot/Cold Data Separation (in-memory vs. on-disk data)
Currently the entire database (log entries and the computed state) are loaded into memory in its entirety before use. This takes time to load and uses more memory. However, this is another massive change that effects every other part of the system.
This will also positively effect perceived performance and user experience: Entries would load fairly instantly, and reasoning about the state of the db replication becomes easier.
The general idea here is to 1) read and compute database state on-demand (ie. upon query) 2) cache "warm" data (=data that is most likely to be used soon, or was recently used) in order to have a configurable in-memory cache 3) fallback to reading from disk when the cache doesn't have the data available.
Misc. Cleanup
Some more items that are smaller in size / complexity.
Potential Rust Integration
We also, during the course of this work, want to explore integrating Rust into the project, in two potential places:
Conclusion
These are our plans for 2020 onward. With these features and changes implemented, we believe OrbitDB would be on par with our vision for it as well as the user needs, and would make an excellent version 1.0.
Let us know what you think, and again, if you find any of this valuable and want to help, the best way is via the OrbitDB Open Source Community or the OpenCollective.