Skip to content

sashasch/web-crawler

Repository files navigation

Web Crawler API

This project is a simple web service for fetching and storing content from given URLs. It is built using NestJS, MongoDB, and Docker.

🚀 Features

  • POST /urls: Submit a list of URLs to crawl. The response includes submissionId for further usage with get requests. It is possible also to pass you own submissionId in the body of the request.
  • GET /urls: Get the most recent 20 crawled results.
  • GET /urls/:submissionId: Get results for a specific submission batch. Limited by 20 as well.

🐳 Running with Docker

📦 Build the Docker image:

docker build -t web-crawler .

▶️ Run the container:

docker run -d -p 27017:27017 --name mongo mongo:latest
docker run -d -p 8080:3000 --name web-crawler web-crawler

ℹ️ The app listens on port 3000 inside the container, so we expose it as 8080 on the host.


🐳 Running with docker compose

docker-compose up --build -d
docker-compose down

🔌 API Endpoints

📤 POST /urls

Submit URLs to be fetched.

Request:

{
  "urls": ["https://example.com", "https://google.com"]
}

Response:

{
  "submissionId": "c1f6b5b3-d93b-46b2-a8dc-b80e367f2a59",
  "results": [
    {
      "url": "https://example.com",
      "finalUrl": "https://example.com",
      "status": "success",
      "content": "<!doctype html>...",
      "submissionId": "c1f6b5b3-d93b-46b2-a8dc-b80e367f2a59"
    }
  ]
}

Curl Example:

curl -X POST http://localhost:8080/urls \
  -H "Content-Type: application/json" \
  -d '{"urls":["https://example.com","https://google.com"]}'

📥 GET /urls/:submissionId

Returns results for a specific submission.

Curl Example:

curl http://localhost:8080/urls/c1f6b5b3-d93b-46b2-a8dc-b80e367f2a59

📚 GET /urls

Returns the most recent 20 URL results from all submissions.

Curl Example:

curl http://localhost:8080/urls

🧪 Running Tests

Run all unit and e2e tests with coverage:

npm run test:cov

📁 Project Structure

src/
├── url/
│   ├── url.controller.ts
│   ├── url.service.ts
│   ├── url.schema.ts
│   └── dto/
│       └── fetch-urls.dto.ts
├── app.module.ts
└── main.ts

📦 Requirements

  • Node.js (v18+)
  • Docker (for deployment)
  • MongoDB (Docker image auto-connects to internal MongoDB if configured)

📝 License

MIT

About

Downloads content of required urls

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published