GitHub - sirohikartik/athena: Open Source Search Agent

Athena is an open-source CLI search agent that combines web search, intelligent scraping, semantic retrieval, and local LLM reasoning to answer user questions with up-to-date information from the web. Unlike traditional search engines that return lists of links, Athena reads and understands web content to provide direct, well-sourced answers to your questions.

Installation

Prerequisites

Python 3.8+
Ollama installed and running
At least one LLM model pulled (e.g., gemma3:1b or llama3:1b)

Setup

Clone the repository:
```
git clone <repository-url>
cd athena
```
Install dependencies:
```
pip install -r requirements.txt
```

Pull an LLM model using Ollama:

ollama pull gemma3:1b
# or
ollama pull llama3:1b

Ensure Ollama is running:
```
ollama serve
```

Usage

Run the application:

python run.py

You'll see:

Initializing sentencepiece...
Ask Athena:

Enter your question when prompted. Athena will:

Search DuckDuckGo for relevant results
Scrape and extract content from the top pages
Find the most relevant information using semantic search
Generate an answer using the local LLM
Display the response with source attribution
Save the query to history

To exit, press Ctrl+C. Your conversation history will be saved automatically.

How It Works

Search Phase

Uses DuckDuckGo (ddgs library) to search for the user query
Returns top results with URLs, titles, and snippets

Scraping Phase

For each URL:
- First attempts static scraping using requests with proper headers
- If static scraping fails (non-200 status, Cloudflare protection, etc.):
  - Falls back to dynamic scraping using Selenium in headless mode
- Uses semaphores to limit concurrent requests (10 global, 2 dynamic)
- Implements timeouts to prevent hanging requests

Content Processing

HTML content is converted to clean text using Trafilatura
Text is split into overlapping chunks (300 words each)
Sentence-transformers (all-MiniLM-L6-v2) creates embeddings for all chunks
FAISS index is built for efficient similarity search

Retrieval & Generation

The user query is embedded and used to search the FAISS index
Top 3 most relevant text chunks are retrieved as context
Context + question + system prompt are formatted for the LLM
Ollama generates a response using the specified model
Response is streamed back to the user in real-time
Sources are deduplicated and displayed (top 5 unique URLs)

Conversation History

Each query and timestamp is stored in History/history.json
History is loaded on startup and saved on exit
Currently, history is not used in the LLM prompt but is available for future enhancement

Configuration

Model Selection

To change the LLM model, modify these files:

In agent.py: Change the model_name parameter in the agent() function call
In run.py: Change the model name in the agent() call (currently "gemma3:1b")

Search Parameters

Adjust these values in the code:

max_results in seeker/search.py (default: 10)
k in run.py retrieve function (default: 3 chunks)
chunk_size in run.py (default: 300 words)

Timeouts & Limits

Static scrape timeout: 8 seconds
Dynamic scrape timeout: 15 seconds
Global concurrency limit: 10 URLs
Dynamic concurrency limit: 2 URLs (to reduce strain on Selenium)

Dependencies

Key dependencies include:

ddgs: DuckDuckGo search
requests & selenium: Web scraping
trafilatura: HTML-to-text conversion
sentence-transformers & torch: Text embeddings
faiss-cpu: Vector similarity search
ollama: LLM interface
rich: Beautiful terminal output
numpy: Numerical operations

See requirements.txt for the complete list.

Customization

Adding New Search Engines

Modify seeker/search.py to use different search APIs (Google, Bing, etc.) while maintaining the same return format.

Changing Scraping Behavior

Adjust scrapion/scrape.py to:

Add more headers or cookies
Implement different waiting strategies for dynamic content
Add proxy support

Using Different Embedding Models

Change the model in run.py:

embed_model = SentenceTransformer("your-model-name")

Switching LLM Providers

Modify utils/model.py to work with different LLM APIs (OpenAI, Anthropic, etc.) while keeping the same interface.

Data Privacy

Athena is designed for privacy:

All processing happens locally on your machine
No data is sent to external APIs (except for the initial web search)
LLMs run locally via Ollama
History is stored only on your local machine
Scraped content is processed in memory and not persisted

Troubleshooting

Common Issues

"No results found"
- Check your internet connection
- Try a different query
- Verify DuckDuckGo is accessible
"No usable content extracted"
- The search results may be from sites that block scraping
- Try a query likely to return text-heavy results (news, Wikipedia, etc.)
Model loading errors
- Ensure Ollama is running: ollama serve
- Verify the model is pulled: ollama list
- Check if you have enough RAM/VRAM for the model
Selenium issues
- Ensure Chrome/Chromium is installed
- Try updating selenium and webdriver-manager
CUDA/GPU issues with sentence-transformers
- The package uses CPU by default
- For GPU support, install appropriate PyTorch with CUDA

License

This project is open source and available under the MIT License.

Acknowledgments

Built with Ollama for local LLM inference
Uses sentence-transformers for embeddings
Powered by FAISS for vector search
Scraping powered by requests and Selenium
Content extraction via trafilatura
Search via DuckDuckGo Instant Answer API
Terminal UI enhanced by Rich

Start exploring the web with Athena - your private, intelligent search agent! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Prerequisites

Setup

Usage

How It Works

Search Phase

Scraping Phase

Content Processing

Retrieval & Generation

Conversation History

Configuration

Model Selection

Search Parameters

Timeouts & Limits

Dependencies

Customization

Adding New Search Engines

Changing Scraping Behavior

Using Different Embedding Models

Switching LLM Providers

Data Privacy

Troubleshooting

Common Issues

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
processes		processes
scrapion		scrapion
seeker		seeker
utils		utils
.gitignore		.gitignore
README.md		README.md
agent.py		agent.py
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

Installation

Prerequisites

Setup

Usage

How It Works

Search Phase

Scraping Phase

Content Processing

Retrieval & Generation

Conversation History

Configuration

Model Selection

Search Parameters

Timeouts & Limits

Dependencies

Customization

Adding New Search Engines

Changing Scraping Behavior

Using Different Embedding Models

Switching LLM Providers

Data Privacy

Troubleshooting

Common Issues

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages