A serverless web scraper that extracts content from specified websites and returns structured data for downstream processing. This service focuses solely on scraping and organizing responses - other services handle summarization, keyword matching, and storage.
- EventBridge triggers the Lambda function daily using a cron schedule
rate(1 day) - Lambda Function pulls the Docker container image from Amazon ECR and executes it
- Docker Container runs inside Lambda's execution environment containing:
- Playwright with headless Chromium browser to scrape configured websites
- Scraper Service that extracts content, cleans it, and structures the response
- Notification Service receives real-time updates via HTTP POST to
BOTLINE_ENDPOINT
- Node.js 22+
- pnpm
- Docker
- AWS CLI configured
# Install dependencies
pnpm install
# Build the project
pnpm build
# Build docker /scripts/docker-build.sh
chmod +x /scripts/docker-build.sh
# Run docker /scripts/docker-run.sh
/scripts/docker-run.sh
# Test the function
curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'- Build the project
pnpm build- Auth AWS, Create ECR repository, Build and tag the Docker image, Push to ECR.
# Bash
IMAGE_TAG=your-tag ./scripts/_main-deploy-to-ecr.shOR
# PowerShell
.\scripts\_main-deploy-to-ecr-win.ps1 [-Tag <string>]- Deploy the Lambda function using Serverless Framework.
pnpm deploysrc/
βββ index.ts # Main Lambda handler
βββ scraper/ # Web scraping logic
β βββ scraper-service.ts # Playwright-based scraper
β βββ mock-scraper-service.ts # Mock scraper for local development
βββ utils/ # Utilities and configurations
βββ website-configs.ts # Website scraping configurations
- Lambda has ECR permissions to pull the container image
- Outbound HTTPS connections to target websites (no VPC required)
- External notification service accessed via environment variables (
BOTLINE_TOKEN) - Container runs in AWS Lambda's managed runtime environment
The service returns a structured JSON response with scraped content:
{
"success": true,
"timestamp": "2025-08-21T10:30:00.000Z",
"sitesProcessed": 2,
"totalSitesConfigured": 2,
"results": [
{
"name": "Australian Embassy in Argentina - Twitter",
"url": "https://x.com/EmbAustraliaBA",
"title": "Page Title",
"content": "Scraped content...",
"keywords": ["keyword1", "keyword2"],
"contentLength": 1250,
"scrapedAt": "2025-08-21T10:30:00.000Z",
"status": "success"
}
],
"executionTime": 5432
}This service is responsible for:
- Web scraping using Playwright
- Content extraction and cleaning
- Structured response formatting
- Error handling and resilience
This service is NOT responsible for:
- Content summarization (handled by downstream services)
- Keyword matching (handled by downstream services)
- Data storage (handled by downstream services)
- File system operations