A powerful Flask-based web scraper API that can extract content from multiple URLs concurrently and process it using Google's Gemini AI.
- Concurrent Processing: Scrapes up to 4 URLs simultaneously per batch
- AI-Powered Content Extraction: Uses Google Gemini to extract useful content and summarize
- Raw HTML Support: Option to return raw HTML instead of processed content
- API Key Protection: Secure your API with custom API keys
- Docker Ready: Fully containerized for easy deployment
- Error Handling: Comprehensive error handling with detailed response codes
- Health Check: Built-in health monitoring endpoint
Main scraping endpoint that processes URLs and returns extracted content.
Headers:
X-API-Key: Your API key for authenticationContent-Type: application/json
Request Body:
{
"links": ["https://example.com", "https://another-site.com"],
"raw": false,
"summarize": false
}Parameters:
links(array, required): Array of URLs to scraperaw(boolean, optional, default: false): Return raw HTML or process with AIsummarize(boolean, optional, default: false): Summarize content (only when raw=false)
Response:
{
"results": [
{
"link": "https://example.com",
"content": "Processed content here...",
"status": "success"
},
{
"link": "https://failed-site.com",
"status": "error",
"error": "HTTP 404",
"status_code": 404
}
],
"total_processed": 2,
"successful": 1,
"failed": 1
}Health check endpoint to monitor API status.
Response:
{
"status": "healthy",
"timestamp": 1693958400.123,
"gemini_configured": true
}- Clone the repository:
git clone <repository-url>
cd ai-web-scraper- Create environment file:
cp .env.example .env
# Edit .env with your API keys- Build and run with Docker:
docker-compose up --build- Clone the repository:
git clone <repository-url>
cd ai-web-scraper- Create virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Set environment variables:
export API_KEY="your-secret-api-key"
export GEMINI_API_KEY="your-gemini-api-key"- Run the application:
python app.pyAPI_KEY: Your custom API key for protecting the endpointGEMINI_API_KEY: Google Gemini API key (get from Google AI Studio)FLASK_ENV: Flask environment (development/production)
- Go to Google AI Studio
- Sign in with your Google account
- Create a new API key
- Copy the key to your
.envfile
curl -X POST http://localhost:5000/scrape \
-H "Content-Type: application/json" \
-H "X-API-Key: your-secret-api-key" \
-d '{
"links": ["https://example.com"]
}'curl -X POST http://localhost:5000/scrape \
-H "Content-Type: application/json" \
-H "X-API-Key: your-secret-api-key" \
-d '{
"links": ["https://example.com"],
"raw": true
}'curl -X POST http://localhost:5000/scrape \
-H "Content-Type: application/json" \
-H "X-API-Key: your-secret-api-key" \
-d '{
"links": ["https://example.com"],
"raw": false,
"summarize": true
}'curl -X POST http://localhost:5000/scrape \
-H "Content-Type: application/json" \
-H "X-API-Key: your-secret-api-key" \
-d '{
"links": [
"https://example.com",
"https://another-site.com",
"https://third-site.com"
]
}'- Concurrent Processing: Processes up to 4 URLs simultaneously
- Batch Processing: Large numbers of URLs are processed in batches
- Timeout Handling: 30-second timeout per request
- Memory Efficient: Content is processed in streams where possible
- Error Resilience: Individual URL failures don't affect other URLs
The API provides detailed error information:
- 401: Invalid or missing API key
- 400: Invalid request format or missing required fields
- 408: Request timeout
- 404: URL not found
- 500: Internal server error
- API key authentication required for all scraping operations
- Non-root user in Docker container
- Input validation and sanitization
- Rate limiting through concurrent request limits
- Health check endpoint at
/health - Comprehensive logging
- Docker health checks included
- Request/response tracking
# Install test dependencies
pip install pytest pytest-asyncio
# Run tests
pytest├── app.py # Main Flask application
├── requirements.txt # Python dependencies
├── Dockerfile # Docker configuration
├── docker-compose.yml # Docker Compose setup
├── .env.example # Environment variables template
└── README.md # This file
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
MIT License - see LICENSE file for details.