Skip to content

solita/AgenticCrawling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AI-Assisted Web Crawling Platform

Bootcamp Prototype for AI-Powered Company Values Analysis

A prototype platform that uses AI to intelligently crawl company websites, extract their values, and analyze them. Built as a learning project for the Agentic AI Bootcamp.

🎯 What This Does

This platform can:

  1. πŸ€– AI-Powered Web Crawling: Navigate websites intelligently using AI (not just fetching HTML)
  2. πŸ” Two Input Methods:
    • Search-based: Enter a search term (e.g., "Software development consultancy Finland")
    • CSV-based: Upload a CSV file with company URLs
  3. πŸ“Š Value Analysis: Extract and classify company values as:
    • Soft Values: People/culture-oriented (caring, openness, collaboration, etc.)
    • Hard Values: Business/performance-oriented (efficiency, innovation, results, etc.)
  4. πŸ“ˆ Generate Reports:
    • Individual site reports (detailed analysis per company)
    • Aggregate table (Excel/CSV with all companies)
    • Summary insights and statistics

πŸš€ Quick Start

Prerequisites

  • Python 3.9 or higher
  • An API key for LLM access (OpenAI, Azure OpenAI, Anthropic, etc.)

Installation

  1. Clone or download this project

  2. Install dependencies:

    pip install -r requirements.txt
  3. Install Playwright browsers (required for web crawling):

    playwright install chromium
  4. Configure API key:

    # Copy the example env file
    cp .env.example .env
    
    # Edit .env and add your API key
    # For OpenAI:
    OPENAI_API_KEY=your_key_here
    
    # Or for Azure OpenAI:
    # AZURE_API_KEY=your_key_here
    # AZURE_API_BASE=https://your-resource.openai.azure.com
    # AZURE_API_VERSION=2024-02-15-preview
  5. Run the application:

    streamlit run app.py
  6. Open your browser to http://localhost:8501

πŸ’‘ Example Use Case

As per the bootcamp requirements:

Search Term: "Software development consultancy finland"
Goal: Analyze how companies describe their values
Result: A table where each row is a company with:

  • Company name / website
  • Extracted "values" text
  • Soft values (e.g., caring, openness)
  • Hard values (e.g., efficiency, innovation)

πŸ“– Usage Guide

Option 1: Search-Based Crawling

  1. Open the Search-Based tab
  2. Enter a search term (e.g., "Software development consultancy finland")
  3. Choose how many results to analyze (default: 5)
  4. Click "Start Search-Based Crawl"
  5. Wait for AI to crawl and analyze
  6. View results and download reports

Option 2: CSV-Based Crawling

  1. Open the CSV-Based tab
  2. Download the sample CSV template (optional)
  3. Prepare your CSV with columns:
    • url (required): Company website URL
    • company (optional): Company name
  4. Upload your CSV file
  5. Click "Start CSV-Based Crawl"
  6. View results and download reports

Understanding the Results

The platform generates:

  1. Results Table: Shows all companies with:

    • Company name and website
    • List of soft values identified
    • List of hard values identified
    • Value counts
    • Overall orientation (People-Focused / Business-Focused / Balanced)
    • Summary analysis
    • Confidence score
  2. Individual Reports: Detailed markdown reports for each company in ./reports/

  3. Aggregate Reports:

    • Excel file with formatted results
    • CSV file for further analysis
    • Markdown summary with insights

πŸ—οΈ Architecture

Core Components

ai-web-crawler-bootcamp/
β”œβ”€β”€ app.py                  # Streamlit web interface
β”œβ”€β”€ orchestrator.py         # Main pipeline coordinator
β”œβ”€β”€ crawler.py              # AI-powered web crawler
β”œβ”€β”€ analyzer.py             # Values extraction & classification
β”œβ”€β”€ input_handler.py        # Search & CSV input handling
β”œβ”€β”€ report_generator.py     # Report creation
β”œβ”€β”€ config.py               # Configuration management
β”œβ”€β”€ requirements.txt        # Python dependencies
└── .env                    # API keys (create from .env.example)

How It Works

  1. Input Stage:

    • Search Handler: Uses Google search to find company websites
    • CSV Handler: Reads URLs from uploaded CSV
  2. Crawling Stage (The AI Magic):

    • Uses Playwright to render pages like a real browser
    • AI decides which links to follow (values, about us, mission pages)
    • Extracts text content while filtering noise
    • Navigates intelligently (not just simple HTML fetching)
  3. Analysis Stage:

    • LLM extracts company name and values section
    • Classifies values into soft (culture) vs hard (business)
    • Generates summary and confidence score
  4. Reporting Stage:

    • Creates individual reports (markdown)
    • Aggregates results into Excel/CSV table
    • Generates summary statistics and insights

πŸ”§ Configuration

Edit .env file to customize:

# LLM Model (using LiteLLM format)
LLM_MODEL=gpt-4-turbo-preview    # Or: gpt-3.5-turbo, claude-3-opus-20240229, etc.

# Crawling behavior
MAX_CRAWL_DEPTH=3                # How deep to crawl
DEFAULT_SEARCH_RESULTS=5         # Default number of search results
CRAWL_TIMEOUT=30                 # Timeout per page (seconds)

# Output
OUTPUT_DIR=./outputs
REPORTS_DIR=./reports

πŸ§ͺ Running Without the Web Interface

You can also run the example directly:

python orchestrator.py

This will run the example use case: "Software development consultancy finland"

Or use it programmatically:

import asyncio
from orchestrator import CrawlOrchestrator

async def main():
    orchestrator = CrawlOrchestrator()
    
    # Search-based
    results = await orchestrator.run_search_based_crawl(
        search_term="Software companies Helsinki",
        num_results=5
    )
    
    # Or CSV-based
    from pathlib import Path
    results = await orchestrator.run_csv_based_crawl(
        csv_path=Path("companies.csv")
    )
    
    print(f"Analyzed {results['num_companies']} companies")

asyncio.run(main())

πŸ”‘ Why LiteLLM?

This project uses LiteLLM as recommended because it:

  • βœ… Provides a unified interface for multiple LLM providers
  • βœ… Easy to switch between OpenAI, Azure, Anthropic, etc.
  • βœ… Handles API differences automatically
  • βœ… Built-in retry logic and error handling
  • βœ… No vendor lock-in

Supported providers (just change the model name):

  • OpenAI: gpt-4-turbo-preview, gpt-3.5-turbo
  • Azure OpenAI: azure/gpt-4, azure/gpt-35-turbo
  • Anthropic: claude-3-opus-20240229, claude-3-sonnet-20240229
  • And many more: https://docs.litellm.ai/docs/providers

πŸ“Š Output Examples

Results Table (Excel/CSV)

Company Website Soft Values Hard Values Orientation Summary
Example Corp example.com Caring, Openness, Trust Innovation, Efficiency Balanced Emphasizes both culture and performance

Individual Report (Markdown)

See ./reports/ directory for detailed per-company analyses including:

  • Full values text extracted
  • Categorized soft/hard values
  • Crawl statistics
  • Confidence scoring

πŸ› Troubleshooting

"No API key found"

  • Make sure you created .env from .env.example
  • Add your API key to .env
  • Restart the application

"Playwright browser not found"

  • Run: playwright install chromium

"Search results are blocked"

  • Google may rate-limit searches
  • Use CSV-based method instead
  • Add delays between searches (already implemented)

"Low confidence scores"

  • Some companies don't have clear values sections
  • AI does its best to infer from available content
  • Check individual reports for details

πŸ“ Project Structure

This is a bootcamp prototype demonstrating:

  • βœ… AI-powered web navigation (not just HTML parsing)
  • βœ… Two input methods (search + CSV)
  • βœ… Intelligent value extraction and classification
  • βœ… Comprehensive reporting
  • βœ… Production-ready code structure
  • βœ… Proper error handling and logging

Not included (would be needed for production):

  • Advanced rate limiting
  • Distributed crawling
  • Database storage
  • API endpoints
  • Authentication
  • Monitoring/alerting

πŸŽ“ Learning Outcomes

This bootcamp project teaches:

  1. Agentic AI: AI making decisions about navigation
  2. Web Automation: Using Playwright for browser control
  3. LLM Integration: Using LiteLLM for flexible AI access
  4. Pipeline Design: Orchestrating complex multi-step workflows
  5. Report Generation: Creating useful outputs from AI analysis

πŸ“„ License

MIT License - This is a bootcamp learning project.

🀝 Contributing

This is a learning project, but feel free to:

  • Report issues
  • Suggest improvements
  • Fork and experiment
  • Share your bootcamp results!

πŸ“§ Support

For bootcamp participants:

  • Check the bootcamp Slack channel
  • Review the RFI document
  • Consult bootcamp slides

Built with: Python, Streamlit, Playwright, LiteLLM, BeautifulSoup, Pandas

Purpose: Agentic AI Bootcamp - Learning Project

Status: βœ… Prototype Complete - Ready for Demo

About

Basic crawling web app using agentic AI

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published