Bootcamp Prototype for AI-Powered Company Values Analysis
A prototype platform that uses AI to intelligently crawl company websites, extract their values, and analyze them. Built as a learning project for the Agentic AI Bootcamp.
This platform can:
- π€ AI-Powered Web Crawling: Navigate websites intelligently using AI (not just fetching HTML)
- π Two Input Methods:
- Search-based: Enter a search term (e.g., "Software development consultancy Finland")
- CSV-based: Upload a CSV file with company URLs
- π Value Analysis: Extract and classify company values as:
- Soft Values: People/culture-oriented (caring, openness, collaboration, etc.)
- Hard Values: Business/performance-oriented (efficiency, innovation, results, etc.)
- π Generate Reports:
- Individual site reports (detailed analysis per company)
- Aggregate table (Excel/CSV with all companies)
- Summary insights and statistics
- Python 3.9 or higher
- An API key for LLM access (OpenAI, Azure OpenAI, Anthropic, etc.)
-
Clone or download this project
-
Install dependencies:
pip install -r requirements.txt
-
Install Playwright browsers (required for web crawling):
playwright install chromium
-
Configure API key:
# Copy the example env file cp .env.example .env # Edit .env and add your API key # For OpenAI: OPENAI_API_KEY=your_key_here # Or for Azure OpenAI: # AZURE_API_KEY=your_key_here # AZURE_API_BASE=https://your-resource.openai.azure.com # AZURE_API_VERSION=2024-02-15-preview
-
Run the application:
streamlit run app.py
-
Open your browser to
http://localhost:8501
As per the bootcamp requirements:
Search Term: "Software development consultancy finland"
Goal: Analyze how companies describe their values
Result: A table where each row is a company with:
- Company name / website
- Extracted "values" text
- Soft values (e.g., caring, openness)
- Hard values (e.g., efficiency, innovation)
- Open the Search-Based tab
- Enter a search term (e.g., "Software development consultancy finland")
- Choose how many results to analyze (default: 5)
- Click "Start Search-Based Crawl"
- Wait for AI to crawl and analyze
- View results and download reports
- Open the CSV-Based tab
- Download the sample CSV template (optional)
- Prepare your CSV with columns:
url(required): Company website URLcompany(optional): Company name
- Upload your CSV file
- Click "Start CSV-Based Crawl"
- View results and download reports
The platform generates:
-
Results Table: Shows all companies with:
- Company name and website
- List of soft values identified
- List of hard values identified
- Value counts
- Overall orientation (People-Focused / Business-Focused / Balanced)
- Summary analysis
- Confidence score
-
Individual Reports: Detailed markdown reports for each company in
./reports/ -
Aggregate Reports:
- Excel file with formatted results
- CSV file for further analysis
- Markdown summary with insights
ai-web-crawler-bootcamp/
βββ app.py # Streamlit web interface
βββ orchestrator.py # Main pipeline coordinator
βββ crawler.py # AI-powered web crawler
βββ analyzer.py # Values extraction & classification
βββ input_handler.py # Search & CSV input handling
βββ report_generator.py # Report creation
βββ config.py # Configuration management
βββ requirements.txt # Python dependencies
βββ .env # API keys (create from .env.example)
-
Input Stage:
- Search Handler: Uses Google search to find company websites
- CSV Handler: Reads URLs from uploaded CSV
-
Crawling Stage (The AI Magic):
- Uses Playwright to render pages like a real browser
- AI decides which links to follow (values, about us, mission pages)
- Extracts text content while filtering noise
- Navigates intelligently (not just simple HTML fetching)
-
Analysis Stage:
- LLM extracts company name and values section
- Classifies values into soft (culture) vs hard (business)
- Generates summary and confidence score
-
Reporting Stage:
- Creates individual reports (markdown)
- Aggregates results into Excel/CSV table
- Generates summary statistics and insights
Edit .env file to customize:
# LLM Model (using LiteLLM format)
LLM_MODEL=gpt-4-turbo-preview # Or: gpt-3.5-turbo, claude-3-opus-20240229, etc.
# Crawling behavior
MAX_CRAWL_DEPTH=3 # How deep to crawl
DEFAULT_SEARCH_RESULTS=5 # Default number of search results
CRAWL_TIMEOUT=30 # Timeout per page (seconds)
# Output
OUTPUT_DIR=./outputs
REPORTS_DIR=./reportsYou can also run the example directly:
python orchestrator.pyThis will run the example use case: "Software development consultancy finland"
Or use it programmatically:
import asyncio
from orchestrator import CrawlOrchestrator
async def main():
orchestrator = CrawlOrchestrator()
# Search-based
results = await orchestrator.run_search_based_crawl(
search_term="Software companies Helsinki",
num_results=5
)
# Or CSV-based
from pathlib import Path
results = await orchestrator.run_csv_based_crawl(
csv_path=Path("companies.csv")
)
print(f"Analyzed {results['num_companies']} companies")
asyncio.run(main())This project uses LiteLLM as recommended because it:
- β Provides a unified interface for multiple LLM providers
- β Easy to switch between OpenAI, Azure, Anthropic, etc.
- β Handles API differences automatically
- β Built-in retry logic and error handling
- β No vendor lock-in
Supported providers (just change the model name):
- OpenAI:
gpt-4-turbo-preview,gpt-3.5-turbo - Azure OpenAI:
azure/gpt-4,azure/gpt-35-turbo - Anthropic:
claude-3-opus-20240229,claude-3-sonnet-20240229 - And many more: https://docs.litellm.ai/docs/providers
| Company | Website | Soft Values | Hard Values | Orientation | Summary |
|---|---|---|---|---|---|
| Example Corp | example.com | Caring, Openness, Trust | Innovation, Efficiency | Balanced | Emphasizes both culture and performance |
See ./reports/ directory for detailed per-company analyses including:
- Full values text extracted
- Categorized soft/hard values
- Crawl statistics
- Confidence scoring
"No API key found"
- Make sure you created
.envfrom.env.example - Add your API key to
.env - Restart the application
"Playwright browser not found"
- Run:
playwright install chromium
"Search results are blocked"
- Google may rate-limit searches
- Use CSV-based method instead
- Add delays between searches (already implemented)
"Low confidence scores"
- Some companies don't have clear values sections
- AI does its best to infer from available content
- Check individual reports for details
This is a bootcamp prototype demonstrating:
- β AI-powered web navigation (not just HTML parsing)
- β Two input methods (search + CSV)
- β Intelligent value extraction and classification
- β Comprehensive reporting
- β Production-ready code structure
- β Proper error handling and logging
Not included (would be needed for production):
- Advanced rate limiting
- Distributed crawling
- Database storage
- API endpoints
- Authentication
- Monitoring/alerting
This bootcamp project teaches:
- Agentic AI: AI making decisions about navigation
- Web Automation: Using Playwright for browser control
- LLM Integration: Using LiteLLM for flexible AI access
- Pipeline Design: Orchestrating complex multi-step workflows
- Report Generation: Creating useful outputs from AI analysis
MIT License - This is a bootcamp learning project.
This is a learning project, but feel free to:
- Report issues
- Suggest improvements
- Fork and experiment
- Share your bootcamp results!
For bootcamp participants:
- Check the bootcamp Slack channel
- Review the RFI document
- Consult bootcamp slides
Built with: Python, Streamlit, Playwright, LiteLLM, BeautifulSoup, Pandas
Purpose: Agentic AI Bootcamp - Learning Project
Status: β Prototype Complete - Ready for Demo