Hex clusters Discworld's stories.
Clustering and search tool applied to plots of Discworld novels. Currently, given an input sentence, it will find the most similar parts of Discworld books based on their plot summaries from Wikipedia.
This is just a tiny proof-of-concept of using FAISS with transformer language models that could be easily extended to cover much larger datasets.
Should work out of the box with bash and a couple of prerequisites:
( cd conda && source bootstrap.sh )
conda activate discworld-hex
poetry installTL;DR (when poetry is installed and the discworld-hex conda env is activated):
build
searchTo only fetch data and build and export the index:
build
# is just a shortcut for:
poetry run buildTo use the index to search:
search
# is just a shortcut for:
poetry run searchTo run any python script in this project:
poetry run python src/discworld_hex/any_file.pyTo run all checks:
poetry run pre-cmmit(What the user would notice.)
- Allow custom
wikipediaqueries on the input (and thus custom libraries) - Fine-tune (e.g., standard (masked) language modelling) on the specific subdomains
- Aggregate search results per-book
- Allow merging libraries
- Better CLI, allow to change
k, pass in multiple sentences, etc., either: - Support other (faster, less accurate) indexes
(What the user shouldn't notice.)
- Less redundant library serialization
- More tests
- Rebuilding Library and the FAISS index