Skip to content

TonyHou0925/search-engine-cloud

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Video link

https://drive.google.com/drive/folders/1JiYwIzF31NI75TKS56Q_CZ2nA-ZUnt8b?usp=share_link

Team Members

Tony Hou (kuanminh) & Judy Yen (judyy)

App Structure

Lightweight App Implementation: docker pull tony0925/frontend:v23

Frontend

Users interact with the system via a Flask-based frontend web. The app sends requests to a backend system via Kafka.

Users can:

  • Upload .tar.gz compressed datasets.
  • Trigger inverted index construction.
  • Search for a specific keyword.
  • Request a Top-N frequent term list.

Message Queue

Kafka acts as a bridge between the frontend and backend with the following 2 topics:

  • search-requests: Receives requests from the frontend.
  • search-responses: Sends processed results back from the backend.

Backend

The backend listens to Kafka topics using a Kafka consumer.

Once a message is received, the consumer:

  • Parses the request type (e.g., build index, search, top-n).
  • Triggers the corresponding Apache Spark job on the cloud (Dataproc).

Three types of Spark jobs:

  • build index: Parses uploaded files and creates an inverted index.
  • search term: Searches for the frequency and locations of a term.
  • top n : Aggregates and returns the most frequent terms across documents.

Infrastructure Deployment with Terraform

  1. GKE Cluster: Provisioned via Terraform with a custom e2-medium node pool, based on the region in terraform.tfvars. Terraform waits for readiness before continuing.
  2. Kafka (via Helm): Deployed using the Bitnami Helm chart with single-replica Kafka and Zookeeper. Helm connects to GKE using dynamic credentials.
  3. Frontend and Backend Deployment: Kubernetes manifests deploy the frontend (frontend-deployment.yaml), backend consumer (consumer-deployment.yaml), and service (service.yaml) using kubectl_manifest.

Steps to run

  1. Create a GCS Bucket to store the remote Terraform state.
  2. Deploy Infrastructure Using Terraform with following commands:

terraform apply -target=google_container_cluster.gke_cluster

terraform apply -target=google_container_node_pool.gke_nodes

terraform apply

  1. Get the External IP of the Frontend Service on the GKE cluster service
  2. Access the Web Interface

Assumptions

  1. The application accepts only .tar.gz files for upload.
  2. Search input must be a single word (e.g., sir, good).
  3. After uploading a file, please wait 3–4 minutes before performing a search — even if the web interface displays “Inverted indices were constructed successfully!” The backend processing may still be in progress.

Development

app/ — Frontend Web Application (Flask)

Contains the lightweight Flask-based web interface that allows users to upload files, initiate index construction, and perform search or Top-N queries.

  • app.py: Entry point that launches the Flask server and handles routing.
  • producer.py: Kafka producer that sends user actions (e.g., search or indexing requests) to the Kafka topic.
  • requirements.txt: Lists Python dependencies for the frontend application.
  • static/: (Optional) Folder for static assets such as CSS or JavaScript. templates/: HTML templates used to render the UI.

backend/

Handles request processing, Spark job execution, and integration with Kafka.

  • consumer.py: Kafka consumer that listens for requests and triggers the appropriate Spark job.
  • build_index.py: Constructs an inverted index from the uploaded datasets.
  • search_term.py: Searches the index for a given keyword and retrieves its frequency and location.
  • top_n.py: Identifies and returns the top N most frequent terms.

terraform/

  • main.tf: Main Terraform configuration for provisioning GKE, Kafka, and other cloud resources.
  • output.tf: Specifies the outputs to be displayed after infrastructure provisioning.
  • terraform.tfvars: Contains actual values (e.g., project ID, region) for the declared variables.
  • variables.tf: Declares all configurable variables used throughout the Terraform scripts.
  • deployment folder: Contains 3 yaml files: frontend-deployment.yaml:,consumer-deployment.yaml, and service.yaml: Exposes the frontend and consumer deployments as a Kubernetes service.

Dockerfile

  • Dockerfile.consumer: This Dockerfile sets up a Spark-based Kafka consumer that listens for requests and triggers Spark jobs like inverted index or Top-N term processing. It uses the Bitnami Spark image and runs consumer.py with spark-submit.

  • Dockerfile.frontend: This Dockerfile builds a Flask web app container that serves the user interface for file uploads and search queries. It installs dependencies and runs the app on port 5001.