https://drive.google.com/drive/folders/1JiYwIzF31NI75TKS56Q_CZ2nA-ZUnt8b?usp=share_link
Tony Hou (kuanminh) & Judy Yen (judyy)
Lightweight App Implementation: docker pull tony0925/frontend:v23
Users interact with the system via a Flask-based frontend web. The app sends requests to a backend system via Kafka.
Users can:
- Upload .tar.gz compressed datasets.
- Trigger inverted index construction.
- Search for a specific keyword.
- Request a Top-N frequent term list.
Kafka acts as a bridge between the frontend and backend with the following 2 topics:
- search-requests: Receives requests from the frontend.
- search-responses: Sends processed results back from the backend.
The backend listens to Kafka topics using a Kafka consumer.
Once a message is received, the consumer:
- Parses the request type (e.g., build index, search, top-n).
- Triggers the corresponding Apache Spark job on the cloud (Dataproc).
Three types of Spark jobs:
- build index: Parses uploaded files and creates an inverted index.
- search term: Searches for the frequency and locations of a term.
- top n : Aggregates and returns the most frequent terms across documents.
- GKE Cluster: Provisioned via Terraform with a custom e2-medium node pool, based on the region in terraform.tfvars. Terraform waits for readiness before continuing.
- Kafka (via Helm): Deployed using the Bitnami Helm chart with single-replica Kafka and Zookeeper. Helm connects to GKE using dynamic credentials.
- Frontend and Backend Deployment: Kubernetes manifests deploy the frontend (frontend-deployment.yaml), backend consumer (consumer-deployment.yaml), and service (service.yaml) using kubectl_manifest.
- Create a GCS Bucket to store the remote Terraform state.
- Deploy Infrastructure Using Terraform with following commands:
terraform apply -target=google_container_cluster.gke_cluster
terraform apply -target=google_container_node_pool.gke_nodes
terraform apply
- Get the External IP of the Frontend Service on the GKE cluster service
- Access the Web Interface
- The application accepts only .tar.gz files for upload.
- Search input must be a single word (e.g., sir, good).
- After uploading a file, please wait 3–4 minutes before performing a search — even if the web interface displays “Inverted indices were constructed successfully!” The backend processing may still be in progress.
Contains the lightweight Flask-based web interface that allows users to upload files, initiate index construction, and perform search or Top-N queries.
- app.py: Entry point that launches the Flask server and handles routing.
- producer.py: Kafka producer that sends user actions (e.g., search or indexing requests) to the Kafka topic.
- requirements.txt: Lists Python dependencies for the frontend application.
- static/: (Optional) Folder for static assets such as CSS or JavaScript. templates/: HTML templates used to render the UI.
Handles request processing, Spark job execution, and integration with Kafka.
- consumer.py: Kafka consumer that listens for requests and triggers the appropriate Spark job.
- build_index.py: Constructs an inverted index from the uploaded datasets.
- search_term.py: Searches the index for a given keyword and retrieves its frequency and location.
- top_n.py: Identifies and returns the top N most frequent terms.
- main.tf: Main Terraform configuration for provisioning GKE, Kafka, and other cloud resources.
- output.tf: Specifies the outputs to be displayed after infrastructure provisioning.
- terraform.tfvars: Contains actual values (e.g., project ID, region) for the declared variables.
- variables.tf: Declares all configurable variables used throughout the Terraform scripts.
- deployment folder: Contains 3 yaml files: frontend-deployment.yaml:,consumer-deployment.yaml, and service.yaml: Exposes the frontend and consumer deployments as a Kubernetes service.
-
Dockerfile.consumer: This Dockerfile sets up a Spark-based Kafka consumer that listens for requests and triggers Spark jobs like inverted index or Top-N term processing. It uses the Bitnami Spark image and runs consumer.py with spark-submit.
-
Dockerfile.frontend: This Dockerfile builds a Flask web app container that serves the user interface for file uploads and search queries. It installs dependencies and runs the app on port 5001.