Ultimate monitoring using Prometheus: Ensuring
optimal performance & Reliability
Components:
   •   Prometheus: It’s an open-source tool for monitoring and alerting applications. It
       uses the concept of scrapping when target systems metric points are contacted to
       fetch data at regular intervals.
   •   Node exporter: It is a monitoring agent that is installed on all target machines so that
       Prometheus can fetch the data from all the metrics endpoints.
   •   Blackbox exporter: It is used to get information from the website like traffic is
       coming from the website or not
   •   Alert manager: The Alert manager handles alerts sent by client applications such as
       the Prometheus server. Used to set alert based on conditions so we will be notified ex
       if the website is down for continuous 1- 5 minutes, service unavailability
Pre-requisites to start:
Created a security group with the following ports open:
      • 22 for SSH
      • 80 for HTTP
      • 443 for HTTPS
      • 25 for SMTP
      • 465 for SMTPS
      • 587 for SMTP
      • 9090 for Prometheus
      • 9093 for Alert manager
      • 9115 for Blackbox Exporter
      • 9100 for Node Exporter
Project steps:
Step 1: Lunched a 2 Ec2 Instance with ubuntu AMI, instance type=t2.medium, storage=20GB
and name them as Virtual machine 1 and Virtual machine 2
Prometheus components exporter tar files: https://prometheus.io/download/
Step 2: In Virtual machine 1:
Downloaded Node Exporter and start
→ sudo apt update
## Download Node Exporter
→ wget
https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-
1.8.1.linux-amd64.tar.gz
##Extract Node Exporter
→ tar xvfz node_exporter-1.8.1.linux-amd64.tar.gz
→ mv node_exporter-1.8.1.linux-amd64 node_exporter
##Start Node Exporter
→ cd node_exporter
→ ./node_exporter &
Step 3:
In Virtual machine 2 install Prometheus, Alert manager, Blackbox Exporter
Install Prometheus
→ sudo apt update
→ wget
https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-
2.52.0.linux-amd64.tar.gz
→ tar xvfz prometheus-2.52.0.linux-amd64.tar.gz
→ mv prometheus-2.52.0.linux-amd64 prometheus
→ cd prometheus
→ ./prometheus --config.file=prometheus.yml &
Alert Manager
→ wget
https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-
0.27.0.linux-amd64.tar.gz
→ tar xvfz alertmanager-0.27.0.linux-amd64.tar.gz
→ mv alertmanager-0.27.0.linux-amd64 alertmanager
→ cd alertmanager
→ ./alertmanager --config.file=alertmanager.yml &
Blackbox Exporter
→ wget
https://github.com/prometheus/blackbox_exporter/releases/download/v0.25.0/blackbox_
exporter-0.25.0.linux-amd64.tar.gz
→ tar xvfz blackbox_exporter-0.25.0.linux-amd64.tar.gz
→ mv blackbox_exporter-0.25.0.linux-amd64 blackbox_exporter
→ cd blackbox_exporter
→ ./blackbox_exporter &
Once completing the above steps, we will be able to see all folders like this
Once the VM-1 node exporter is up and running we can see the below webpage
Step 4:
Now let’s run a simple game application to monitor
To run the boardgame application on the website page we need to have java and maven to
build, so will install them using the below commands
→ cd Boardgame
→ sudo apt install openjdk-11-jre-headless
→ sudo apt install maven -y
→ mvn package                                    // to build the project
We can execute the jar file to run the application on browser
→ cd target
→ ls          // can see .jar file
→ java -jar database_service_project-0.0.4.jar
Now will access the game application at: http://3.135.20.106:8080/
Step 5:
Next, go to VM-2 to configure the Prometheus server by defining alert-rules for the
different scenarios. and based on these rules we will get the alerts
→ cd Prometheus
→ ./Prometheus &
Can access the Prometheus server at: http://3.145.128.69:9090/graph
For now, we can’t see any alert rules so let’s create a new alert_rules.yaml file to configure
alert rules in Prometheus server
vi alert_rules.yaml
groups:
- name: alert_rules            # Name of the alert rules group
 rules:
   - alert: InstanceDown
    expr: up == 0            # Expression to detect instance down
    for: 1m
    labels:
      severity: "critical"
   annotations:
    summary: "Endpoint {{ $labels.instance }} down"
    description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute."
  - alert: WebsiteDown
   expr: probe_success == 0      # Expression to detect website down
   for: 1m
   labels:
     severity: critical
   annotations:
     description: The website at {{ $labels.instance }} is down.
     summary: Website down
  - alert: HostOutOfMemory
   expr: node_memory_MemAvailable / node_memory_MemTotal * 100 < 25 # Expression to detect
low memory
   for: 5m
   labels:
     severity: warning
   annotations:
     summary: "Host out of memory (instance {{ $labels.instance }})"
     description: "Node memory is filling up (< 25% left)\n VALUE = {{ $value }}\n LABELS: {{
$labels }}"
  - alert: HostOutOfDiskSpace
   expr: (node_filesystem_avail{mountpoint="/"} * 100) / node_filesystem_size{mountpoint="/"} <
50 # Expression to detect low disk space
   for: 1s
   labels:
     severity: warning
   annotations:
     summary: "Host out of disk space (instance {{ $labels.instance }})"
     description: "Disk is almost full (< 50% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
  - alert: HostHighCpuLoad
   expr: (sum by (instance) (irate(node_cpu{job="node_exporter_metrics",mode="idle"}[5m]))) > 80
# Expression to detect high CPU load
   for: 5m
   labels:
     severity: warning
   annotations:
     summary: "Host high CPU load (instance {{ $labels.instance }})"
     description: "CPU load is > 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
  - alert: ServiceUnavailable
   expr: up{job="node_exporter"} == 0                       # Expression to detect service
unavailability
   for: 2m
   labels:
     severity: critical
   annotations:
    summary: "Service Unavailable (instance {{ $labels.instance }})"
    description: "The service {{ $labels.job }} is not available\n VALUE = {{ $value }}\n LABELS: {{
$labels }}"
 - alert: HighMemoryUsage
  expr: (node_memory_Active / node_memory_MemTotal) * 100 > 90 # Expression to detect high
memory usage
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "High Memory Usage (instance {{ $labels.instance }})"
    description: "Memory usage is > 90%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
  - alert: FileSystemFull
   expr: (node_filesystem_avail / node_filesystem_size) * 100 < 10 # Expression to detect file system
almost full
   for: 5m
   labels:
     severity: critical
   annotations:
     summary: "File System Almost Full (instance {{ $labels.instance }})"
     description: "File system has < 10% free space\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
Now we need to input the above rules file to Prometheus server by updating
prometheus.yml file
Now to view these alert rules on our Prometheus website page, you need to restart the
Prometheus server
→ pgrep prometheus                    //to get process id
→ kill id
→ ./prometheus &
Step 6:
Now we need to connect both Alert manager and VM-1 node exporter to prometheus server
by updating prometheus.yml file
After restarting the Prometheus server and we should be able to see the node exporter on
Prometheus target section
Next, need to configure the Blackbox exporter to scrape the data from the website
application, so let’s update scrapping configs on prometheus.yml file
vi prometheus.yml file
Restart the Prometheus server to reflect the changes
Need to start the Blackbox exporter
When we start Alert manager, and we won’t be able see any alerts as of now since we
haven’t configured alert manager
So, let’s configure it
Now we need to configure email notification to get emails when the defined conditions are
met
To receive email notification, we need to enable 2 step verifications on the Gmail account
Step 7:
Next, go to https://myaccount.google.com/apppasswords
And enter name and get a app password which can be used for routing configuration
cd alertmanager
vi alertmanager.yml
---
route:
 group_by:
   - alertname
 group_wait: 30s
 group_interval: 5m
 repeat_interval: 1h
 receiver: email-notifications
receivers:
 - name: email-notifications
   email_configs:
     - to: jayasample1234@gmail.com
       from: monitor@example.com
       smarthost: smtp.gmail.com:587
       auth_username: jayasample1234@gmail.com
       auth_identity: jayasample1234@gmail.com
       auth_password: luwg yvge wwez fjti
       send_resolved: true
inhibit_rules:
 - source_match:
     severity: critical
   target_match:
     severity: warning
   equal:
     - alertname
     - dev
     - instance
Now, Restart the alert manager and check
Hurray, the monitoring setup complete!!!!
Everything seems fine now
Step 8:
Next, will try check the entire functionality by shutting down the game application.
The status is in pending state
After 1 minute the status will change to firing state and soon will receive an email
notification
Can view the notification on alert manager
Next will try terminating the node exporter
Terminating node exporter will send the notification for both ec2 instance as well the
service