Web Scraper Service

A serverless web scraper that extracts content from specified websites and returns structured data for downstream processing. This service focuses solely on scraping and organizing responses - other services handle summarization, keyword matching, and storage.

Communication Flow

EventBridge triggers the Lambda function daily using a cron schedule rate(1 day)
Lambda Function pulls the Docker container image from Amazon ECR and executes it
Docker Container runs inside Lambda's execution environment containing:
- Playwright with headless Chromium browser to scrape configured websites
- Scraper Service that extracts content, cleans it, and structures the response
Notification Service receives real-time updates via HTTP POST to BOTLINE_ENDPOINT

🚀 Quick Start

Prerequisites

Node.js 22+
pnpm
Docker
AWS CLI configured

Installation

# Install dependencies
pnpm install

# Build the project
pnpm build

# Build docker /scripts/docker-build.sh
chmod +x /scripts/docker-build.sh

# Run docker /scripts/docker-run.sh
/scripts/docker-run.sh

# Test the function
curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'

AWS Deployment

Build the project

pnpm build

Auth AWS, Create ECR repository, Build and tag the Docker image, Push to ECR.

# Bash
IMAGE_TAG=your-tag ./scripts/_main-deploy-to-ecr.sh

OR

# PowerShell
.\scripts\_main-deploy-to-ecr-win.ps1 [-Tag <string>]

Deploy the Lambda function using Serverless Framework.

pnpm deploy

📁 Project Structure

src/
├── index.ts                 # Main Lambda handler
├── scraper/                 # Web scraping logic
│   ├── scraper-service.ts   # Playwright-based scraper
│   └── mock-scraper-service.ts # Mock scraper for local development
└── utils/                   # Utilities and configurations
    └── website-configs.ts   # Website scraping configurations

Network & Security

Lambda has ECR permissions to pull the container image
Outbound HTTPS connections to target websites (no VPC required)
External notification service accessed via environment variables (BOTLINE_TOKEN)
Container runs in AWS Lambda's managed runtime environment

📊 Response Format

The service returns a structured JSON response with scraped content:

{
  "success": true,
  "timestamp": "2025-08-21T10:30:00.000Z",
  "sitesProcessed": 2,
  "totalSitesConfigured": 2,
  "results": [
    {
      "name": "Australian Embassy in Argentina - Twitter",
      "url": "https://x.com/EmbAustraliaBA",
      "title": "Page Title",
      "content": "Scraped content...",
      "keywords": ["keyword1", "keyword2"],
      "contentLength": 1250,
      "scrapedAt": "2025-08-21T10:30:00.000Z",
      "status": "success"
    }
  ],
  "executionTime": 5432
}

🎯 Service Boundaries

This service is responsible for:

Web scraping using Playwright
Content extraction and cleaning
Structured response formatting
Error handling and resilience

This service is NOT responsible for:

Content summarization (handled by downstream services)
Keyword matching (handled by downstream services)
Data storage (handled by downstream services)
File system operations

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
.github		.github
public		public
scripts		scripts
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
AWS-cheatsheet.md		AWS-cheatsheet.md
Dockerfile		Dockerfile
README.md		README.md
build.mjs		build.mjs
docker-entrypoint.sh		docker-entrypoint.sh
eslint.config.js		eslint.config.js
package.json		package.json
serverless.prod.yml		serverless.prod.yml
serverless.yml		serverless.yml
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Scraper Service

Communication Flow

🚀 Quick Start

Prerequisites

Installation

AWS Deployment

📁 Project Structure

Network & Security

📊 Response Format

🎯 Service Boundaries

About

Uh oh!

Releases

Packages

Languages

agustinlozano/webscanner

Folders and files

Latest commit

History

Repository files navigation

Web Scraper Service

Communication Flow

🚀 Quick Start

Prerequisites

Installation

AWS Deployment

📁 Project Structure

Network & Security

📊 Response Format

🎯 Service Boundaries

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages