-
Notifications
You must be signed in to change notification settings - Fork 1
Description
duplikeit: Vision Statement
You want duplikeit to help keep a body of knowledge coherent, minimal, and future-friendly, while respecting its past.
The guiding principle is DRY + SSOT: a single source of truth that’s elegant and non-redundant, but still provides a lens into history for context.
Core Functions
-
Vector representations
Each document (or part of a document) is mapped into a compact vector space so similarity can be measured consistently. -
Similarity and distance
- Items that are current and active should appear closer in that space, so the system can strongly suggest when they overlap too much.
- Items that are historical or context-only should appear further away in that space, so they remain accessible but don’t compete with the single source of truth.
-
Thresholds for action
When two or more documents fall within a certain closeness, the system can signal to the author of new content that it may be time to refactor — merging, consolidating, or replacing content to maintain the “Don’t Repeat Yourself” (DRY) ideal. -
Assist both author and reviewer roles
The system not only warns authors about duplication during writing, but also acts as a curation assistant for reviewers/maintainers — guiding them in periodically tidying, refactoring, and simplifying the collection.
Extended Goal: Dynamic Table of Contents (TOC)
- As documents are created or edited, duplikeit should suggest where they belong in a Table of Contents (TOC).
- The TOC could manifest as:
- A hierarchical directory structure,
- A kanban board, or
- Another organizational schema with a linear or cardinal order (e.g., priority, chronology, conceptual flow).
- This way, an author who starts with “I have something to say or report” gets immediate guidance about where it should live in the larger knowledge structure, reducing drift and improving discoverability.
Example Use Cases
- Trouble tickets: Group or merge overlapping reports while still preserving historical issues for context.
- Medical records: Tune similarity so that “duplicate” means duplication relative to a patient, not just text overlap, while still surfacing related but distinct cases.
- Knowledge bases / policy libraries: Suggest when new articles should replace older ones or link to them, helping maintain a concise SSOT.
- Source code repositories: Treat each file in a directory structure as part of the collection.
- duplikeit can highlight near-duplicate modules or functions across files.
- It can suggest refactoring opportunities (e.g., consolidating repeated logic into shared libraries).
- It can even help guide directory and module structure, ensuring new code finds its proper place in the project’s “TOC” (the directory tree).
What duplikeit is not
- Not a general-purpose vector database: It may use embeddings and vector search internally, but its purpose is curation, refactoring, and guidance, not raw similarity search as a service.
- Not just full-text search: It’s not meant to retrieve documents like a search engine; it’s meant to organize, compress, and guide authorship.
- Not a replacement for domain expertise: It won’t decide what should be merged or refactored without human oversight — it surfaces opportunities and recommendations.
- Not static: Unlike a one-time deduplication tool, duplikeit operates continuously, shaping the evolution of the collection as new content arrives.