An accurate Retrieval-Augmented Generation (RAG) system that analyzes multi-language codebases using Tree-sitter, builds comprehensive knowledge graphs, and enables natural language querying of codebase structure and relationships.
- π Multi-Language Support: Supports Python, JavaScript, TypeScript, Rust, Go, Scala, and Java codebases
- π³ Tree-sitter Parsing: Uses Tree-sitter for robust, language-agnostic AST parsing
- π Knowledge Graph Storage: Uses Memgraph to store codebase structure as an interconnected graph
- π£οΈ Natural Language Querying: Ask questions about your codebase in plain English
- π€ AI-Powered Cypher Generation: Supports both cloud models (Google Gemini) and local models (Ollama) for natural language to Cypher translation
- π Code Snippet Retrieval: Retrieves actual source code snippets for found functions/methods
- βοΈ File System Operations: Full agentic control over file content creation, reading, and editing.
- β‘οΈ Shell Command Execution: Can execute terminal commands for tasks like running tests or using CLI tools.
- π Dependency Analysis: Parses
pyproject.tomlto understand external dependencies - π― Nested Function Support: Handles complex nested functions and class hierarchies
- π Language-Agnostic Design: Unified graph schema across all supported languages
The system consists of two main components:
- Multi-language Parser: Tree-sitter based parsing system that analyzes codebases and ingests data into Memgraph
- RAG System (
codebase_rag/): Interactive CLI for querying the stored knowledge graph
- π³ Tree-sitter Integration: Language-agnostic parsing using Tree-sitter grammars
- π Graph Database: Memgraph for storing code structure as nodes and relationships
- π€ LLM Integration: Supports Google Gemini (cloud) and Ollama (local) for natural language processing
- π Code Analysis: Advanced AST traversal for extracting code elements across languages
- π οΈ Query Tools: Specialized tools for graph querying and code retrieval
- βοΈ Language Configuration: Configurable mappings for different programming languages
- Python 3.12+
- Docker & Docker Compose (for Memgraph)
- For cloud models: Google Gemini API key
- For local models: Ollama installed and running
uvpackage manager
- Clone the repository:
git clone https://github.com/vitali87/code-graph-rag.git
cd code-graph-rag- Install dependencies:
For basic Python support:
uv syncFor full multi-language support:
uv sync --extra treesitter-fullFor development (including tests):
uv sync --extra treesitter-full --extra testThis installs Tree-sitter grammars for:
- Python (.py)
- JavaScript (.js, .jsx)
- TypeScript (.ts, .tsx)
- Rust (.rs)
- Go (.go)
- Scala (.scala, .sc)
- Java (.java)
- Set up environment variables:
cp .env.example .env
# Edit .env with your configuration (see options below)# .env file
LLM_PROVIDER=gemini
GEMINI_API_KEY=your_gemini_api_key_hereGet your free API key from Google AI Studio.
# .env file
LLM_PROVIDER=local
LOCAL_MODEL_ENDPOINT=http://localhost:11434/v1
LOCAL_ORCHESTRATOR_MODEL_ID=llama3
LOCAL_CYPHER_MODEL_ID=llama3
LOCAL_MODEL_API_KEY=ollamaInstall and run Ollama:
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull required models
ollama pull llama3
# Or try other models like:
# ollama pull llama3.1
# ollama pull mistral
# ollama pull codellama
# Ollama will automatically start serving on localhost:11434Note: Local models provide privacy and no API costs, but may have lower accuracy compared to cloud models like Gemini.
- Start Memgraph database:
docker-compose up -dParse and ingest a multi-language repository into the knowledge graph:
For the first repository (clean start):
python -m codebase_rag.main start --repo-path /path/to/repo1 --update-graph --cleanFor additional repositories (preserve existing data):
python -m codebase_rag.main start --repo-path /path/to/repo2 --update-graph
python -m codebase_rag.main start --repo-path /path/to/repo3 --update-graphSupported Languages: The system automatically detects and processes files based on extensions:
- Python:
.pyfiles - JavaScript:
.js,.jsxfiles - TypeScript:
.ts,.tsxfiles - Rust:
.rsfiles - Go:
.gofiles - Scala:
.scala,.scfiles - Java:
.javafiles
Start the interactive RAG CLI:
python -m codebase_rag.main start --repo-path /path/to/your/repoYou can switch between cloud and local models at runtime using CLI arguments:
Use Local Models:
python -m codebase_rag.main start --repo-path /path/to/your/repo --llm-provider localUse Cloud Models:
python -m codebase_rag.main start --repo-path /path/to/your/repo --llm-provider geminiSpecify Custom Models:
# Use specific local models
python -m codebase_rag.main start --repo-path /path/to/your/repo \
--llm-provider local \
--orchestrator-model llama3.1 \
--cypher-model codellama
# Use specific Gemini models
python -m codebase_rag.main start --repo-path /path/to/your/repo \
--llm-provider gemini \
--orchestrator-model gemini-2.0-flash-thinking-exp-01-21 \
--cypher-model gemini-2.5-flash-lite-preview-06-17Available CLI Arguments:
--llm-provider: Choosegeminiorlocal--orchestrator-model: Specify model for main RAG orchestration--cypher-model: Specify model for Cypher query generation
Example queries (works across all supported languages):
- "Show me all classes that contain 'user' in their name"
- "Find functions related to database operations"
- "What methods does the User class have?"
- "Show me functions that handle authentication"
- "List all TypeScript components"
- "Find Rust structs and their methods"
- "Show me Go interfaces and implementations"
For programmatic access and integration with other tools, you can export the entire knowledge graph to JSON:
Export during graph update:
python -m codebase_rag.main start --repo-path /path/to/repo --update-graph --clean -o my_graph.jsonExport existing graph without updating:
python -m codebase_rag.main export -o my_graph.jsonWorking with exported data:
from codebase_rag.graph_loader import load_graph
# Load the exported graph
graph = load_graph("my_graph.json")
# Get summary statistics
summary = graph.summary()
print(f"Total nodes: {summary['total_nodes']}")
print(f"Total relationships: {summary['total_relationships']}")
# Find specific node types
functions = graph.find_nodes_by_label("Function")
classes = graph.find_nodes_by_label("Class")
# Analyze relationships
for func in functions[:5]:
relationships = graph.get_relationships_for_node(func.node_id)
print(f"Function {func.properties['name']} has {len(relationships)} relationships")Example analysis script:
python examples/graph_export_example.py my_graph.jsonThis provides a reliable, programmatic way to access your codebase structure without LLM restrictions, perfect for:
- Integration with other tools
- Custom analysis scripts
- Building documentation generators
- Creating code metrics dashboards
The knowledge graph uses the following node types and relationships:
- Project: Root node representing the entire repository
- Package: Language packages (Python:
__init__.py, etc.) - Module: Individual source code files (
.py,.js,.jsx,.ts,.tsx,.rs,.go,.scala,.sc,.java) - Class: Class/Struct/Enum definitions across all languages
- Function: Module-level functions and standalone functions
- Method: Class methods and associated functions
- Folder: Regular directories
- File: All files (source code and others)
- ExternalPackage: External dependencies
- Python:
function_definition,class_definition - JavaScript/TypeScript:
function_declaration,arrow_function,class_declaration - Rust:
function_item,struct_item,enum_item,impl_item - Go:
function_declaration,method_declaration,type_declaration - Scala:
function_definition,class_definition,object_definition,trait_definition - Java:
method_declaration,class_declaration,interface_declaration,enum_declaration
CONTAINS_PACKAGE/MODULE/FILE/FOLDER: Hierarchical containmentDEFINES: Module defines classes/functionsDEFINES_METHOD: Class defines methodsDEPENDS_ON_EXTERNAL: Project depends on external packages
Configuration is managed through environment variables in .env file:
LLM_PROVIDER: Set to"gemini"for cloud models or"local"for local models
GEMINI_API_KEY: Required whenLLM_PROVIDER=geminiGEMINI_MODEL_ID: Main model for orchestration (default:gemini-2.5-pro-preview-06-05)MODEL_CYPHER_ID: Model for Cypher generation (default:gemini-2.5-flash-lite-preview-06-17)
LOCAL_MODEL_ENDPOINT: Ollama endpoint (default:http://localhost:11434/v1)LOCAL_ORCHESTRATOR_MODEL_ID: Model for main RAG orchestration (default:llama3)LOCAL_CYPHER_MODEL_ID: Model for Cypher query generation (default:llama3)LOCAL_MODEL_API_KEY: API key for local models (default:ollama)
MEMGRAPH_HOST: Memgraph hostname (default:localhost)MEMGRAPH_PORT: Memgraph port (default:7687)TARGET_REPO_PATH: Default repository path (default:.)
code-graph-rag/
βββ codebase_rag/ # RAG system package
β βββ main.py # CLI entry point
β βββ config.py # Configuration management
β βββ graph_updater.py # Tree-sitter based multi-language parser
β βββ language_config.py # Language-specific configurations
β βββ prompts.py # LLM prompts and schemas
β βββ schemas.py # Pydantic models
β βββ services/ # Core services
β β βββ llm.py # Gemini LLM integration
β βββ tools/ # RAG tools
β βββ codebase_query.py # Graph querying tool
β βββ code_retrieval.py # Code snippet retrieval
β βββ file_reader.py # File content reading
β βββ file_writer.py # File content creation
β βββ file_editor.py # File content editing
β βββ shell_command.py # Shell command execution
βββ docker-compose.yaml # Memgraph setup
βββ pyproject.toml # Project dependencies & language extras
βββ README.md # This file
- tree-sitter: Core Tree-sitter library for language-agnostic parsing git - tree-sitter-{language}: Language-specific grammars (Python, JS, TS, Rust, Go, Scala, Java)
- pydantic-ai: AI agent framework for RAG orchestration
- pymgclient: Memgraph Python client for graph database operations
- loguru: Advanced logging with structured output
- python-dotenv: Environment variable management
The agent is designed with a deliberate workflow to ensure it acts with context and precision, especially when modifying the file system.
The agent has access to a suite of tools to understand and interact with the codebase:
query_codebase_knowledge_graph: The primary tool for understanding the repository. It queries the graph database to find files, functions, classes, and their relationships based on natural language.get_code_snippet: Retrieves the exact source code for a specific function or class.read_file_content: Reads the entire content of a specified file.create_new_file: Creates a new file with specified content.edit_existing_file: Overwrites an existing file with new content.execute_shell_command: Executes a shell command in the project's environment.
To prevent errors and misplaced code, the agent is explicitly instructed to follow a strict workflow before any write or edit operation:
- Understand Goal: First, it clarifies the user's objective.
- Query & Explore: It uses
query_codebase_knowledge_graphandread_file_contenttools to explore the codebase. This step is crucial for finding the correct location and understanding the existing architectural patterns for any new code. The agent can also useexecute_shell_commandto run checks or use other CLI tools. - Formulate a Plan: Based on its exploration, the agent formulates a plan. It will state which file it intends to create or edit and provide a summary of the changes.
- Execute: Only after this analysis does the agent use the
create_new_fileoredit_existing_filetools to execute the plan.
This ensures the agent is a reliable assistant for both analyzing and modifying your codebase.
Important: All file system operations (create_new_file, edit_existing_file) are strictly sandboxed to the project's root directory. The agent cannot write to or edit files outside of the repository it was tasked to analyze, preventing any potential harm from path traversal attacks.
| Language | Extensions | Functions | Classes/Structs | Modules | Package Detection |
|---|---|---|---|---|---|
| Python | .py |
β | β | β | __init__.py |
| JavaScript | .js, .jsx |
β | β | β | - |
| TypeScript | .ts, .tsx |
β | β | β | - |
| Rust | .rs |
β | β (structs/enums) | β | - |
| Go | .go |
β | β (structs) | β | - |
| Scala | .scala, .sc |
β | β (classes/objects/traits) | β | package declarations |
| Java | .java |
β | β (classes/interfaces/enums) | β | package declarations |
- Python: Full support including nested functions, methods, classes, and package structure
- JavaScript/TypeScript: Functions, arrow functions, classes, and method definitions
- Rust: Functions, structs, enums, impl blocks, and associated functions
- Go: Functions, methods, type declarations, and struct definitions
- Scala: Functions, methods, classes, objects, traits, case classes, and Scala 3 syntax
- Java: Methods, constructors, classes, interfaces, enums, and annotation types
# Basic Python-only support
uv sync
# Full multi-language support
uv sync --extra treesitter-full
# Individual language support (if needed)
uv add tree-sitter-python tree-sitter-javascript tree-sitter-typescript tree-sitter-rust tree-sitter-go tree-sitter-scala tree-sitter-javaThe system uses a configuration-driven approach for language support. Each language is defined in codebase_rag/language_config.py with:
- File extensions: Which files to process
- AST node types: How to identify functions, classes, etc.
- Module structure: How modules/packages are organized
- Name extraction: How to extract names from AST nodes
Adding support for new languages requires only configuration changes, no code modifications.
-
Check Memgraph connection:
- Ensure Docker containers are running:
docker-compose ps - Verify Memgraph is accessible on port 7687
- Ensure Docker containers are running:
-
View database in Memgraph Lab:
- Open http://localhost:3000
- Connect to memgraph:7687
-
For local models:
- Verify Ollama is running:
ollama list - Check if models are downloaded:
ollama pull llama3 - Test Ollama API:
curl http://localhost:11434/v1/models - Check Ollama logs:
ollama logs
- Verify Ollama is running:
- Follow the established code structure
- Keep files under 100 lines (as per user rules)
- Use type annotations
- Follow conventional commit messages
- Use DRY principles
For issues or questions:
- Check the logs for error details
- Verify Memgraph connection
- Ensure all environment variables are set
- Review the graph schema matches your expectations