AI-Assisted Web Crawling Platform

Bootcamp Prototype for AI-Powered Company Values Analysis

A prototype platform that uses AI to intelligently crawl company websites, extract their values, and analyze them. Built as a learning project for the Agentic AI Bootcamp.

🎯 What This Does

This platform can:

🤖 AI-Powered Web Crawling: Navigate websites intelligently using AI (not just fetching HTML)
🔍 Two Input Methods:
- Search-based: Enter a search term (e.g., "Software development consultancy Finland")
- CSV-based: Upload a CSV file with company URLs
📊 Value Analysis: Extract and classify company values as:
- Soft Values: People/culture-oriented (caring, openness, collaboration, etc.)
- Hard Values: Business/performance-oriented (efficiency, innovation, results, etc.)
📈 Generate Reports:
- Individual site reports (detailed analysis per company)
- Aggregate table (Excel/CSV with all companies)
- Summary insights and statistics

🚀 Quick Start

Prerequisites

Python 3.9 or higher
An API key for LLM access (OpenAI, Azure OpenAI, Anthropic, etc.)

Installation

Clone or download this project
Install dependencies:
```
pip install -r requirements.txt
```
Install Playwright browsers (required for web crawling):
```
playwright install chromium
```

Configure API key:

# Copy the example env file
cp .env.example .env

# Edit .env and add your API key
# For OpenAI:
OPENAI_API_KEY=your_key_here

# Or for Azure OpenAI:
# AZURE_API_KEY=your_key_here
# AZURE_API_BASE=https://your-resource.openai.azure.com
# AZURE_API_VERSION=2024-02-15-preview

Run the application:
```
streamlit run app.py
```
Open your browser to http://localhost:8501

💡 Example Use Case

As per the bootcamp requirements:

Search Term: "Software development consultancy finland"
Goal: Analyze how companies describe their values
Result: A table where each row is a company with:

Company name / website
Extracted "values" text
Soft values (e.g., caring, openness)
Hard values (e.g., efficiency, innovation)

📖 Usage Guide

Option 1: Search-Based Crawling

Open the Search-Based tab
Enter a search term (e.g., "Software development consultancy finland")
Choose how many results to analyze (default: 5)
Click "Start Search-Based Crawl"
Wait for AI to crawl and analyze
View results and download reports

Option 2: CSV-Based Crawling

Open the CSV-Based tab
Download the sample CSV template (optional)
Prepare your CSV with columns:
- url (required): Company website URL
- company (optional): Company name
Upload your CSV file
Click "Start CSV-Based Crawl"
View results and download reports

Understanding the Results

The platform generates:

Results Table: Shows all companies with:
- Company name and website
- List of soft values identified
- List of hard values identified
- Value counts
- Overall orientation (People-Focused / Business-Focused / Balanced)
- Summary analysis
- Confidence score
Individual Reports: Detailed markdown reports for each company in ./reports/
Aggregate Reports:
- Excel file with formatted results
- CSV file for further analysis
- Markdown summary with insights

🏗️ Architecture

Core Components

ai-web-crawler-bootcamp/
├── app.py                  # Streamlit web interface
├── orchestrator.py         # Main pipeline coordinator
├── crawler.py              # AI-powered web crawler
├── analyzer.py             # Values extraction & classification
├── input_handler.py        # Search & CSV input handling
├── report_generator.py     # Report creation
├── config.py               # Configuration management
├── requirements.txt        # Python dependencies
└── .env                    # API keys (create from .env.example)

How It Works

Input Stage:
- Search Handler: Uses Google search to find company websites
- CSV Handler: Reads URLs from uploaded CSV
Crawling Stage (The AI Magic):
- Uses Playwright to render pages like a real browser
- AI decides which links to follow (values, about us, mission pages)
- Extracts text content while filtering noise
- Navigates intelligently (not just simple HTML fetching)
Analysis Stage:
- LLM extracts company name and values section
- Classifies values into soft (culture) vs hard (business)
- Generates summary and confidence score
Reporting Stage:
- Creates individual reports (markdown)
- Aggregates results into Excel/CSV table
- Generates summary statistics and insights

🔧 Configuration

Edit .env file to customize:

# LLM Model (using LiteLLM format)
LLM_MODEL=gpt-4-turbo-preview    # Or: gpt-3.5-turbo, claude-3-opus-20240229, etc.

# Crawling behavior
MAX_CRAWL_DEPTH=3                # How deep to crawl
DEFAULT_SEARCH_RESULTS=5         # Default number of search results
CRAWL_TIMEOUT=30                 # Timeout per page (seconds)

# Output
OUTPUT_DIR=./outputs
REPORTS_DIR=./reports

🧪 Running Without the Web Interface

You can also run the example directly:

python orchestrator.py

This will run the example use case: "Software development consultancy finland"

Or use it programmatically:

import asyncio
from orchestrator import CrawlOrchestrator

async def main():
    orchestrator = CrawlOrchestrator()
    
    # Search-based
    results = await orchestrator.run_search_based_crawl(
        search_term="Software companies Helsinki",
        num_results=5
    )
    
    # Or CSV-based
    from pathlib import Path
    results = await orchestrator.run_csv_based_crawl(
        csv_path=Path("companies.csv")
    )
    
    print(f"Analyzed {results['num_companies']} companies")

asyncio.run(main())

🔑 Why LiteLLM?

This project uses LiteLLM as recommended because it:

✅ Provides a unified interface for multiple LLM providers
✅ Easy to switch between OpenAI, Azure, Anthropic, etc.
✅ Handles API differences automatically
✅ Built-in retry logic and error handling
✅ No vendor lock-in

Supported providers (just change the model name):

OpenAI: gpt-4-turbo-preview, gpt-3.5-turbo
Azure OpenAI: azure/gpt-4, azure/gpt-35-turbo
Anthropic: claude-3-opus-20240229, claude-3-sonnet-20240229
And many more: https://docs.litellm.ai/docs/providers

📊 Output Examples

Results Table (Excel/CSV)

Company	Website	Soft Values	Hard Values	Orientation	Summary
Example Corp	example.com	Caring, Openness, Trust	Innovation, Efficiency	Balanced	Emphasizes both culture and performance

Individual Report (Markdown)

See ./reports/ directory for detailed per-company analyses including:

Full values text extracted
Categorized soft/hard values
Crawl statistics
Confidence scoring

🐛 Troubleshooting

"No API key found"

Make sure you created .env from .env.example
Add your API key to .env
Restart the application

"Playwright browser not found"

Run: playwright install chromium

"Search results are blocked"

Google may rate-limit searches
Use CSV-based method instead
Add delays between searches (already implemented)

"Low confidence scores"

Some companies don't have clear values sections
AI does its best to infer from available content
Check individual reports for details

📝 Project Structure

This is a bootcamp prototype demonstrating:

✅ AI-powered web navigation (not just HTML parsing)
✅ Two input methods (search + CSV)
✅ Intelligent value extraction and classification
✅ Comprehensive reporting
✅ Production-ready code structure
✅ Proper error handling and logging

Not included (would be needed for production):

Advanced rate limiting
Distributed crawling
Database storage
API endpoints
Authentication
Monitoring/alerting

🎓 Learning Outcomes

This bootcamp project teaches:

Agentic AI: AI making decisions about navigation
Web Automation: Using Playwright for browser control
LLM Integration: Using LiteLLM for flexible AI access
Pipeline Design: Orchestrating complex multi-step workflows
Report Generation: Creating useful outputs from AI analysis

📄 License

MIT License - This is a bootcamp learning project.

🤝 Contributing

This is a learning project, but feel free to:

Report issues
Suggest improvements
Fork and experiment
Share your bootcamp results!

📧 Support

For bootcamp participants:

Check the bootcamp Slack channel
Review the RFI document
Consult bootcamp slides

Built with: Python, Streamlit, Playwright, LiteLLM, BeautifulSoup, Pandas

Purpose: Agentic AI Bootcamp - Learning Project

Status: ✅ Prototype Complete - Ready for Demo

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
API_DOCS.md		API_DOCS.md
DOCKER.md		DOCKER.md
Dockerfile		Dockerfile
EXAMPLE_USAGE.md		EXAMPLE_USAGE.md
HOW_IT_WORKS.md		HOW_IT_WORKS.md
INDEX.md		INDEX.md
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SEARCH_FIX.md		SEARCH_FIX.md
START_HERE.md		START_HERE.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md
analyzer.py		analyzer.py
app.py		app.py
config.py		config.py
crawler.py		crawler.py
docker-compose.yml		docker-compose.yml
input_handler.py		input_handler.py
orchestrator.py		orchestrator.py
practice_lambda.yaml		practice_lambda.yaml
report_generator.py		report_generator.py
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI-Assisted Web Crawling Platform

🎯 What This Does

🚀 Quick Start

Prerequisites

Installation

💡 Example Use Case

📖 Usage Guide

Option 1: Search-Based Crawling

Option 2: CSV-Based Crawling

Understanding the Results

🏗️ Architecture

Core Components

How It Works

🔧 Configuration

🧪 Running Without the Web Interface

🔑 Why LiteLLM?

📊 Output Examples

Results Table (Excel/CSV)

Individual Report (Markdown)

🐛 Troubleshooting

📝 Project Structure

🎓 Learning Outcomes

📄 License

🤝 Contributing

📧 Support

About

Uh oh!

Releases

Packages

Languages

solita/AgenticCrawling

Folders and files

Latest commit

History

Repository files navigation

AI-Assisted Web Crawling Platform

🎯 What This Does

🚀 Quick Start

Prerequisites

Installation

💡 Example Use Case

📖 Usage Guide

Option 1: Search-Based Crawling

Option 2: CSV-Based Crawling

Understanding the Results

🏗️ Architecture

Core Components

How It Works

🔧 Configuration

🧪 Running Without the Web Interface

🔑 Why LiteLLM?

📊 Output Examples

Results Table (Excel/CSV)

Individual Report (Markdown)

🐛 Troubleshooting

📝 Project Structure

🎓 Learning Outcomes

📄 License

🤝 Contributing

📧 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages