The goals of this project are:
- Partition a collection based on _id, much like mongosync
- On each partition, run the following algorithm
- Sample 100 documents
- Calculate their schema
- Find documents that do not match the schema
- Calculate the schema for any returned documents
- Merge the schemas
- Repeat 3-5 until no new documents are returned
- Merge each partition schema into a single schema
The schema-builder-library integration tests cover library method correctness against a live
database. These tests require a running enterprise mongod server (version at least 6.0). They also
require that data be loaded in before running the tests. The data is
stored here.
After decompressing the data, to load it into the database,
use the sql-engines-common-test-infra
data-loader tool. See cargo run --bin data-loader -- --help in that repo for more details.
With the data loaded, the schema-builder-library integration tests can be run from the root of
the mongosql repository via
cargo test --features=integration --package=schema-builder-library --lib internal_integration_tests -- --test-threads=1