AI Web Scraper API

A powerful Flask-based web scraper API that can extract content from multiple URLs concurrently and process it using Google's Gemini AI.

Features

Concurrent Processing: Scrapes up to 4 URLs simultaneously per batch
AI-Powered Content Extraction: Uses Google Gemini to extract useful content and summarize
Raw HTML Support: Option to return raw HTML instead of processed content
API Key Protection: Secure your API with custom API keys
Docker Ready: Fully containerized for easy deployment
Error Handling: Comprehensive error handling with detailed response codes
Health Check: Built-in health monitoring endpoint

API Endpoints

POST /scrape

Main scraping endpoint that processes URLs and returns extracted content.

Headers:

X-API-Key: Your API key for authentication
Content-Type: application/json

Request Body:

{
  "links": ["https://example.com", "https://another-site.com"],
  "raw": false,
  "summarize": false
}

Parameters:

links (array, required): Array of URLs to scrape
raw (boolean, optional, default: false): Return raw HTML or process with AI
summarize (boolean, optional, default: false): Summarize content (only when raw=false)

Response:

{
  "results": [
    {
      "link": "https://example.com",
      "content": "Processed content here...",
      "status": "success"
    },
    {
      "link": "https://failed-site.com",
      "status": "error",
      "error": "HTTP 404",
      "status_code": 404
    }
  ],
  "total_processed": 2,
  "successful": 1,
  "failed": 1
}

GET /health

Health check endpoint to monitor API status.

Response:

{
  "status": "healthy",
  "timestamp": 1693958400.123,
  "gemini_configured": true
}

Setup and Installation

Option 1: Docker (Recommended)

Clone the repository:

git clone <repository-url>
cd ai-web-scraper

Create environment file:

cp .env.example .env
# Edit .env with your API keys

Build and run with Docker:

docker-compose up --build

Option 2: Local Development

Clone the repository:

git clone <repository-url>
cd ai-web-scraper

Create virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Set environment variables:

export API_KEY="your-secret-api-key"
export GEMINI_API_KEY="your-gemini-api-key"

Run the application:

python app.py

Configuration

Environment Variables

API_KEY: Your custom API key for protecting the endpoint
GEMINI_API_KEY: Google Gemini API key (get from Google AI Studio)
FLASK_ENV: Flask environment (development/production)

Getting Gemini API Key

Go to Google AI Studio
Sign in with your Google account
Create a new API key
Copy the key to your .env file

Usage Examples

Basic scraping (processed content):

curl -X POST http://localhost:5000/scrape \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-secret-api-key" \
  -d '{
    "links": ["https://example.com"]
  }'

Raw HTML extraction:

curl -X POST http://localhost:5000/scrape \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-secret-api-key" \
  -d '{
    "links": ["https://example.com"],
    "raw": true
  }'

Summarized content:

curl -X POST http://localhost:5000/scrape \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-secret-api-key" \
  -d '{
    "links": ["https://example.com"],
    "raw": false,
    "summarize": true
  }'

Multiple URLs:

curl -X POST http://localhost:5000/scrape \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-secret-api-key" \
  -d '{
    "links": [
      "https://example.com",
      "https://another-site.com",
      "https://third-site.com"
    ]
  }'

Performance Features

Concurrent Processing: Processes up to 4 URLs simultaneously
Batch Processing: Large numbers of URLs are processed in batches
Timeout Handling: 30-second timeout per request
Memory Efficient: Content is processed in streams where possible
Error Resilience: Individual URL failures don't affect other URLs

Error Handling

The API provides detailed error information:

401: Invalid or missing API key
400: Invalid request format or missing required fields
408: Request timeout
404: URL not found
500: Internal server error

Security Features

API key authentication required for all scraping operations
Non-root user in Docker container
Input validation and sanitization
Rate limiting through concurrent request limits

Monitoring

Health check endpoint at /health
Comprehensive logging
Docker health checks included
Request/response tracking

Development

Running Tests

# Install test dependencies
pip install pytest pytest-asyncio

# Run tests
pytest

Code Structure

├── app.py              # Main Flask application
├── requirements.txt    # Python dependencies
├── Dockerfile         # Docker configuration
├── docker-compose.yml # Docker Compose setup
├── .env.example       # Environment variables template
└── README.md          # This file

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
docker-compose.yml		docker-compose.yml
examples.py		examples.py
requirements.txt		requirements.txt
start.ps1		start.ps1
start.sh		start.sh
test_api.py		test_api.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Web Scraper API

Features

API Endpoints

POST /scrape

GET /health

Setup and Installation

Option 1: Docker (Recommended)

Option 2: Local Development

Configuration

Environment Variables

Getting Gemini API Key

Usage Examples

Basic scraping (processed content):

Raw HTML extraction:

Summarized content:

Multiple URLs:

Performance Features

Error Handling

Security Features

Monitoring

Development

Running Tests

Code Structure

Contributing

License

About

Uh oh!

Releases

Packages

Languages

silham/ai-web-scraper

Folders and files

Latest commit

History

Repository files navigation

AI Web Scraper API

Features

API Endpoints

POST /scrape

GET /health

Setup and Installation

Option 1: Docker (Recommended)

Option 2: Local Development

Configuration

Environment Variables

Getting Gemini API Key

Usage Examples

Basic scraping (processed content):

Raw HTML extraction:

Summarized content:

Multiple URLs:

Performance Features

Error Handling

Security Features

Monitoring

Development

Running Tests

Code Structure

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages