Aura is a search Engine written in python
Follow these steps to set up your development environment.
It's recommended to use a virtual environment to manage project dependencies.
python3 -m venv venv
source venv/bin/activateInstall the required Python packages using pip:
pip install -r requirements.txtNote: You will need to create a requirements.txt file first. You can generate it using:
pip freeze > requirements.txt
NLTK requires specific data packages for tokenization and stemming. Run the following commands:
python3 -c "import nltk; nltk.download('punkt')"
python3 -c "import nltk; nltk.download('punkt_tab')"Use the run.py script to manage different aspects of the project.
This command starts the web crawler, which will automatically trigger the indexer periodically.
python run.py crawlImportant: If you are running this for the first time or after making changes to how data is processed (e.g., stemming), ensure you delete any old data files before crawling:
rm files/crawled_data.jsonl files/inverted_index.json files/documents.jsonThis command starts the Flask web server, allowing you to interact with the search engine via a web interface.
python run.py webSeeds are used as a string point. Any url that is found is appended to a queue. With that we have random jumps after 5 seconds so the results can have some variety. We skip all files and only consider simple site data for now. We see the robots.txt to respect the sites rules. All the crawled data is saved in a jsonl file.
A simple reverse index for words is built along with icon, title, desc info for each url.
Simple TF-IDF Search
idf = math.log(total_documents / (1 + num_docs_with_word))
# Go through the list of [doc_id, term_frequency] for the word
for doc_id_str, tf in inverted_index[word]:
scores[str(doc_id_str)] += tf * idfStem the words while crawling and searching for better results. Like throwing, throwed -> throw
The front end is pretty much vibe coded as there isn't much right now.
The crawler indexer and everything is under heavy development. So stuff can change very rapidly
I donot have a clear plan. For now i will improve the page indexing and crawling. Add some structure to the search results potenitally adding more info like images yt links etc. Machine learning seems like a nice experiment but only time will tell what we do.