敘事領域(StorySphere)是一個基於 LLM 和向量資料庫的小說分析系統,提供文本摘要、關鍵字提取、知識圖譜構建和角色分析等功能。
- 支援 PDF/DOCX 格式小說文件載入
- 自動章節識別與切分(支援中英文章節標題)
- 文本 Chunk 切分與向量化儲存
- Qdrant 向量資料庫整合
- 摘要生成:層級式摘要(章節級、全書級)
- 關鍵字提取:基於 MultipartiteRank 算法
- 知識圖譜:實體關係抽取與正規化
- 角色分析:角色屬性、關係網絡分析
- 實體識別與屬性提取(人物、地點、組織等)
- 關係識別與三元組構建
- 實體正規化與去重
- 圖譜視覺化
- 角色行為軌跡分析(in progress)
- 主題與情感分析(in progress)
- 寫作風格分析(in progress)
- 層級式內容聚合(in progress)
storysphere/
├── src/ # 核心程式碼
│ ├── core/ # 核心模組
│ │ ├── indexing/ # 向量資料庫操作
│ │ │ └── vector_store.py # Qdrant 向量儲存
│ │ ├── llm/ # LLM 客戶端
│ │ │ ├── gemini_client.py # Google Gemini
│ │ │ ├── ollama_client.py # Ollama 本地模型
│ │ │ └── openai_client.py # OpenAI API
│ │ ├── nlp/ # NLP 工具
│ │ │ ├── llm_operator.py # LLM 任務封裝
│ │ │ └── keyword_extractor.py # 關鍵字提取
│ │ ├── kg/ # 知識圖譜
│ │ │ ├── loader.py # 資料載入
│ │ │ ├── entity_linker.py # 實體連結
│ │ │ ├── graph_builder.py # 圖譜構建
│ │ │ └── kg_retriever.py # 資料檢索
│ │ ├── validators/ # 資料驗證
│ │ │ ├── kg_schema_validator.py # KG 結構驗證
│ │ │ └── nlp_utils_validator.py # NLP 輸出驗證
│ │ └── utils/ # 工具函數
│ │ ├── id_generator.py # ID 生成器
│ │ └── output_extractor.py # 輸出解析
│ ├── pipelines/ # 資料處理管道
│ │ ├── preprocessing/ # 預處理
│ │ │ ├── loader.py # 文件載入
│ │ │ ├── chapter_splitter.py # 章節切分
│ │ │ └── chunk_splitter.py # 文本切分
│ │ ├── feature_extraction/ # 特徵提取
│ │ │ └── run_llm_tasks.py # LLM 任務執行
│ │ ├── vector_indexing/ # 向量索引
│ │ │ └── embed_and_store.py # 向量化與儲存
│ │ ├── nlp/ # NLP 管道
│ │ │ ├── hierarchical_process.py # 層級處理
│ │ │ └── keyword_aggregator.py # 關鍵字聚合
│ │ └── kg/ # 知識圖譜管道
│ │ ├── canonical_entity_pipeline.py # 實體正規化
│ │ ├── graph_construction_pipeline.py # 圖譜構建
│ │ └── entity_attribute_extraction_pipeline.py # 屬性提取
│ └── workflows/ # 工作流程
│ ├── indexing/ # 索引工作流
│ │ └── run_doc_ingestion.py # 文件攝取流程
│ ├── nlp/ # NLP 工作流
│ │ ├── generate_hierarchical_summary.py # 層級摘要
│ │ └── generate_hierarchical_keywords.py # 層級關鍵字
│ ├── kg/ # 知識圖譜工作流
│ │ └── run_full_kg_workflow.py # 完整 KG 流程
│ └── character_analysis/ # 角色分析工作流
│ └── character_analysis.py # 角色分析
├── tests/ # 測試程式碼
├── data/ # 資料目錄
│ ├── novella/ # 小說文件
│ ├── kg_storage/ # 知識圖譜資料
│ └── art/ # 分析結果
├── config/ # 配置文件
│ └── schema_kg_story.yaml # KG 結構定義
├── main_test_*.py # 測試腳本
└── requirements.txt # 依賴套件
- Python 3.12.9
- Qdrant 向量資料庫
- LLM API 金鑰 (Gemini/OpenAI/Ollama)
- 克隆專案
git clone <repository-url>
cd storysphere- 安裝依賴
pip install -r requirements.txt- 設定環境變數
# 複製環境變數模板
cp .env.example .env
# 編輯 .env 文件,設定以下變數:
GEMINI_API_KEY=your_gemini_api_key
GEMINI_MODEL=gemini-1.5-flash
QDRANT_HOST=localhost
QDRANT_PORT=6333- 啟動 Qdrant 資料庫
docker run -p 6333:6333 qdrant/qdrantfrom src.workflows.indexing.run_doc_ingestion import run_ingestion_pipeline
# 處理小說文件,生成向量索引
run_ingestion_pipeline(
input_dir="./data/novella",
collection_name="MyNovel",
api_key="your_api_key",
model_name="gemini-1.5-flash",
limit_pages=10 # 限制處理頁數(可選)
)from src.workflows.nlp.generate_hierarchical_summary import (
gen_hierarchical_chapter_summary,
gen_hierarchical_book_summary
)
# 生成章節級摘要
gen_hierarchical_chapter_summary(
target_collection_doc_id="doc_uuid",
source_collection="MyNovel"
)
# 生成全書摘要
gen_hierarchical_book_summary(
target_collection_doc_id="doc_uuid",
omni_chapters_collection="chapter_summaries",
omni_books_collection="book_summaries"
)from src.workflows.kg.run_full_kg_workflow import run_full_kg_workflow
# 執行完整的知識圖譜構建流程
run_full_kg_workflow(
entity_path="./data/kg_storage/kg_entity_set.json",
relation_path="./data/kg_storage/kg_relation_set.json"
)from src.workflows.character_analysis.character_analysis import (
run_character_analysis_workflow
)
# 分析指定角色
run_character_analysis_workflow(
target_role=["Harry Potter", "Hermione Granger"],
kg_entity_path="./data/kg_storage/kg_entity_set.json"
)- 位置:
src/core/indexing/vector_store.py - 功能: Qdrant 向量資料庫操作封裝
- 主要方法:
store_chunk(),search(),get_by_id()
- 位置:
src/core/nlp/llm_operator.py - 功能: 統一的 LLM 任務介面
- 支援任務: 摘要、關鍵字提取、知識圖譜抽取
- 位置:
src/pipelines/kg/ - 功能: 實體抽取、正規化、圖譜構建
- 輸出: JSON 格式的實體關係資料
- 位置:
src/pipelines/nlp/hierarchical_process.py - 功能: 多層級內容聚合(chunk → 章節 → 全書)
專案提供多個測試腳本供快速驗證功能:
main_test_run_doc_ingestion.py: 測試文件攝取流程main_test_kg.py: 測試知識圖譜構建main_test_hierarchical.py: 測試層級摘要與關鍵字main_test_entity_retriever.py: 測試實體檢索與角色分析
- PDF: 使用 PyPDF2/pypdf 解析
- DOCX: 使用 python-docx 處理
- 章節識別: 支援中英文章節標題模式
- 📚 文學研究: 自動化文本分析與摘要
- 🎭 角色研究: 角色關係網絡與行為分析
- 📖 內容整理: 大部頭小說的結構化整理
- 🔍 語義搜尋: 基於向量相似度的內容檢索
- API 配額: LLM API 呼叫會產生費用,請注意使用量
- 記憶體使用: 大型文件處理需要充足記憶體
- 向量資料庫: 確保 Qdrant 服務正常運行
- 模型選擇: 不同 LLM 模型效果可能有差異
別了,我還有很多想實現的功能。
以下列舉幾個有點意思的:
- 增加知識圖譜中關係預測的推論功能
- 角色劇情模擬:用戶自己決定可能的情境,利用既有角色資訊做出what if scenario
- 圖像生成:using text2image model
- 可以指定風格對角色生成圖像
- 對劇情或是特定段落生成圖像等
- 但這個坑可能有點大
- 提供更多解析上的工具以及說明,讓使用者更明白自己看的東西是什麼
An intelligent novel analysis system based on LLM and vector databases, providing text summarization, keyword extraction, knowledge graph construction, and character analysis features.
- Support for PDF/DOCX novel file loading
- Automatic chapter identification and segmentation (supports Chinese and English chapter titles)
- Text chunk splitting and vector storage
- Qdrant vector database integration
- Summary Generation: Hierarchical summaries (chapter-level, book-level)
- Keyword Extraction: Based on MultipartiteRank algorithm
- Knowledge Graph: Entity relationship extraction and normalization
- Character Analysis: Character attributes and relationship network analysis
- Entity identification and attribute extraction (characters, locations, organizations, etc.)
- Relationship identification and triplet construction
- Entity normalization and deduplication
- Graph visualization
- Character behavior trajectory analysis (in progress)
- Theme and sentiment analysis (in progress)
- Writing style analysis (in progress)
- Hierarchical content aggregation (in progress)
storysphere/
├── src/ # Core source code
│ ├── core/ # Core modules
│ │ ├── indexing/ # Vector database operations
│ │ │ └── vector_store.py # Qdrant vector storage
│ │ ├── llm/ # LLM clients
│ │ │ ├── gemini_client.py # Google Gemini
│ │ │ ├── ollama_client.py # Ollama local models
│ │ │ └── openai_client.py # OpenAI API
│ │ ├── nlp/ # NLP tools
│ │ │ ├── llm_operator.py # LLM task wrapper
│ │ │ └── keyword_extractor.py # Keyword extraction
│ │ ├── kg/ # Knowledge graph
│ │ │ ├── loader.py # Data loading
│ │ │ ├── entity_linker.py # Entity linking
│ │ │ ├── graph_builder.py # Graph construction
│ │ │ └── kg_retriever.py # Data retrieval
│ │ ├── validators/ # Data validation
│ │ │ ├── kg_schema_validator.py # KG structure validation
│ │ │ └── nlp_utils_validator.py # NLP output validation
│ │ └── utils/ # Utility functions
│ │ ├── id_generator.py # ID generator
│ │ └── output_extractor.py # Output parsing
│ ├── pipelines/ # Data processing pipelines
│ │ ├── preprocessing/ # Preprocessing
│ │ │ ├── loader.py # Document loading
│ │ │ ├── chapter_splitter.py # Chapter splitting
│ │ │ └── chunk_splitter.py # Text chunking
│ │ ├── feature_extraction/ # Feature extraction
│ │ │ └── run_llm_tasks.py # LLM task execution
│ │ ├── vector_indexing/ # Vector indexing
│ │ │ └── embed_and_store.py # Vectorization & storage
│ │ ├── nlp/ # NLP pipeline
│ │ │ ├── hierarchical_process.py # Hierarchical processing
│ │ │ └── keyword_aggregator.py # Keyword aggregation
│ │ └── kg/ # Knowledge graph pipeline
│ │ ├── canonical_entity_pipeline.py # Entity normalization
│ │ ├── graph_construction_pipeline.py # Graph construction
│ │ └── entity_attribute_extraction_pipeline.py # Attribute extraction
│ └── workflows/ # Workflows
│ ├── indexing/ # Indexing workflows
│ │ └── run_doc_ingestion.py # Document ingestion flow
│ ├── nlp/ # NLP workflows
│ │ ├── generate_hierarchical_summary.py # Hierarchical summary
│ │ └── generate_hierarchical_keywords.py # Hierarchical keywords
│ ├── kg/ # Knowledge graph workflows
│ │ └── run_full_kg_workflow.py # Full KG workflow
│ └── character_analysis/ # Character analysis workflows
│ └── character_analysis.py # Character analysis
├── tests/ # Test code
├── data/ # Data directory
│ ├── novella/ # Novel files
│ ├── kg_storage/ # Knowledge graph data
│ └── art/ # Analysis results
├── config/ # Configuration files
│ └── schema_kg_story.yaml # KG structure definition
├── main_test_*.py # Test scripts
└── requirements.txt # Dependencies
- Python 3.12.9
- Qdrant vector database
- LLM API keys (Gemini/OpenAI/Ollama)
- Clone the project
git clone <repository-url>
cd storysphere- Install dependencies
pip install -r requirements.txt- Configure environment variables
# Copy environment template
cp .env.example .env
# Edit .env file and set the following variables:
GEMINI_API_KEY=your_gemini_api_key
GEMINI_MODEL=gemini-1.5-flash
QDRANT_HOST=localhost
QDRANT_PORT=6333- Start Qdrant database
docker run -p 6333:6333 qdrant/qdrantfrom src.workflows.indexing.run_doc_ingestion import run_ingestion_pipeline
# Process novel files and generate vector index
run_ingestion_pipeline(
input_dir="./data/novella",
collection_name="MyNovel",
api_key="your_api_key",
model_name="gemini-1.5-flash",
limit_pages=10 # Limit processing pages (optional)
)from src.workflows.nlp.generate_hierarchical_summary import (
gen_hierarchical_chapter_summary,
gen_hierarchical_book_summary
)
# Generate chapter-level summaries
gen_hierarchical_chapter_summary(
target_collection_doc_id="doc_uuid",
source_collection="MyNovel"
)
# Generate book-level summary
gen_hierarchical_book_summary(
target_collection_doc_id="doc_uuid",
omni_chapters_collection="chapter_summaries",
omni_books_collection="book_summaries"
)from src.workflows.kg.run_full_kg_workflow import run_full_kg_workflow
# Execute complete knowledge graph construction workflow
run_full_kg_workflow(
entity_path="./data/kg_storage/kg_entity_set.json",
relation_path="./data/kg_storage/kg_relation_set.json"
)from src.workflows.character_analysis.character_analysis import (
run_character_analysis_workflow
)
# Analyze specific characters
run_character_analysis_workflow(
target_role=["Harry Potter", "Hermione Granger"],
kg_entity_path="./data/kg_storage/kg_entity_set.json"
)- Location:
src/core/indexing/vector_store.py - Function: Qdrant vector database operations wrapper
- Main Methods:
store_chunk(),search(),get_by_id()
- Location:
src/core/nlp/llm_operator.py - Function: Unified LLM task interface
- Supported Tasks: Summarization, keyword extraction, knowledge graph extraction
- Location:
src/pipelines/kg/ - Function: Entity extraction, normalization, graph construction
- Output: JSON format entity relationship data
- Location:
src/pipelines/nlp/hierarchical_process.py - Function: Multi-level content aggregation (chunk → chapter → book)
The project provides multiple test scripts for quick functionality verification:
main_test_run_doc_ingestion.py: Test document ingestion workflowmain_test_kg.py: Test knowledge graph constructionmain_test_hierarchical.py: Test hierarchical summary & keywordsmain_test_entity_retriever.py: Test entity retrieval & character analysis
- PDF: Parsed using PyPDF2/pypdf
- DOCX: Processed using python-docx
- Chapter Recognition: Supports Chinese and English chapter title patterns
- 📚 Literary Research: Automated text analysis and summarization
- 🎭 Character Studies: Character relationship networks and behavior analysis
- 📖 Content Organization: Structured organization of large novels
- 🔍 Semantic Search: Vector similarity-based content retrieval
- API Quotas: LLM API calls incur costs, please monitor usage
- Memory Usage: Large document processing requires sufficient memory
- Vector Database: Ensure Qdrant service is running properly
- Model Selection: Different LLM models may have varying performance
There are still many features I'd like to implement.
Here are some interesting ones:
- Add inference capabilities for relationship prediction in knowledge graphs
- Character story simulation: Users can decide possible scenarios and use existing character information for "what if" scenarios
- Image generation: using text2image models
- Generate character images with specified styles
- Generate images for plots or specific passages
- But this might be a big undertaking
- Provide more analysis tools and explanations to help users better understand what they're looking at
💡 Tip: This is a backend analysis framework, suitable for developers who need deep text analysis capabilities. If you need a frontend interface, please develop or integrate existing UI frameworks yourself.