A Prometheus exporter that gathers Operating System Network Socket Statistics as metrics, providing deep insights into network performance and connection behavior.
This exporter leverages pyroute2 (specifically the ss2 module) to collect detailed TCP socket statistics from the Linux kernel, exposing them as Prometheus metrics for monitoring and alerting.
- Real-time TCP metrics: Round-trip time, congestion window, delivery rates
- Histogram support: Latency distributions and flow statistics
- Flexible filtering: Filter by process, network, or port ranges
- Label compression: Reduce metric cardinality with configurable folding
- Network performance monitoring and troubleshooting
- TCP connection health tracking
- Application network behavior analysis
- Infrastructure capacity planning
- SLA monitoring and alerting
Although the label space for individual flows is bounded by the underlying flow constraints configured at the kernel level, very dynamic traffic patterns with numerous flows being created and phased out can lead to an overwhelming label space, even for short periods.
Therefore, depending on your specific granularity requirements and available resources, consider the following mitigation strategies:
1. Prometheus Retention Configuration
Use the --storage.tsdb.retention=<duration> option on your Prometheus server to bound the metric label data retention period. This can significantly relieve capacity constraints and is suitable for both ad-hoc introspection deployments and larger-scale scenarios within Prometheus federations.
2. Flow Selection Filtering Reduce flow metric label cardinality by using the exporter's data selection configuration to limit collection to only flows with specific characteristics (process, network, port ranges, etc.).
3. Label Compression Configure label folding options to compress flow identifiers and reduce metric cardinality while preserving essential information.
The exporter exposes the following Prometheus metrics:
| Metric | Description | Labels |
|---|---|---|
tcp_rtt |
TCP round-trip time in milliseconds | flow - Connection identifier with source/destination IP and port |
tcp_cwnd |
TCP congestion window size | flow - Connection identifier with source/destination IP and port |
tcp_delivery_rate |
TCP delivery rate in bytes per second | flow - Connection identifier with source/destination IP and port |
| Metric | Description | Labels | Buckets |
|---|---|---|---|
tcp_rtt_hist_ms |
TCP round-trip time distribution histogram | le - bucket boundary |
Configurable (default: 0.001ms to 1000ms) |
| Metric | Description | Labels |
|---|---|---|
tcp_data_segs_in |
Number of TCP data segments received | flow - Connection identifier with source/destination IP and port |
tcp_data_segs_out |
Number of TCP data segments sent | flow - Connection identifier with source/destination IP and port |
The flow label contains connection information in the format:
flow="(SRC#<source_ip>|<source_port>)(DST#<dest_ip>|<dest_port>)"Example:
flow="(SRC#192.168.10.58|39366)(DST#104.19.199.151|443)"# TYPE tcp_rtt gauge
tcp_rtt{flow="(SRC#192.168.10.58|39366)(DST#104.19.199.151|443)"} 53.376
tcp_rtt{flow="(SRC#192.168.10.58|50484)(DST#212.227.17.170|993)"} 39.129
tcp_rtt{flow="(SRC#127.0.0.1|58334)(DST#127.0.0.1|22)"} 2.178
tcp_rtt{flow="(SRC#127.0.0.1|43908)(DST#127.0.0.1|8020)"} 0.038
tcp_rtt{flow="(SRC#192.168.10.58|36918)(DST#104.16.233.151|443)"} 51.017
tcp_rtt{flow="(SRC#192.168.10.58|36534)(DST#172.217.22.6|443)"} 44.335
tcp_rtt{flow="(SRC#127.0.0.1|58640)(DST#127.0.0.1|5037)"} 0.011
tcp_rtt{flow="(SRC#192.168.10.58|47192)(DST#194.25.134.114|993)"} 36.689
tcp_rtt{flow="(SRC#192.168.10.58|42410)(DST#216.58.205.226|443)"} 40.242
tcp_rtt{flow="(SRC#192.168.10.58|54412)(DST#104.16.22.133|443)"} 56.069
tcp_rtt{flow="(SRC#192.168.10.58|50626)(DST#172.217.18.2|443)"} 46.931
tcp_rtt{flow="(SRC#127.0.0.1|5037)(DST#127.0.0.1|58640)"} 0.01
tcp_rtt{flow="(SRC#192.168.10.58|45014)(DST#192.30.253.124|443)"} 120.324
tcp_rtt{flow="(SRC#127.0.0.1|8020)(DST#127.0.0.1|43908)"} 0.015
# HELP tcp_cwnd tcp socket perflow congestionwindow stats
# TYPE tcp_cwnd gauge
tcp_cwnd{flow="(SRC#192.168.10.58|39366)(DST#104.19.199.151|443)"} 10.0
tcp_cwnd{flow="(SRC#192.168.10.58|50484)(DST#212.227.17.170|993)"} 10.0
tcp_cwnd{flow="(SRC#127.0.0.1|58334)(DST#127.0.0.1|22)"} 10.0
tcp_cwnd{flow="(SRC#127.0.0.1|43908)(DST#127.0.0.1|8020)"} 10.0
tcp_cwnd{flow="(SRC#192.168.10.58|36918)(DST#104.16.233.151|443)"} 10.0
tcp_cwnd{flow="(SRC#192.168.10.58|36534)(DST#172.217.22.6|443)"} 10.0
tcp_cwnd{flow="(SRC#127.0.0.1|58640)(DST#127.0.0.1|5037)"} 10.0
tcp_cwnd{flow="(SRC#192.168.10.58|47192)(DST#194.25.134.114|993)"} 10.0
tcp_cwnd{flow="(SRC#192.168.10.58|42410)(DST#216.58.205.226|443)"} 10.0
tcp_cwnd{flow="(SRC#192.168.10.58|54412)(DST#104.16.22.133|443)"} 10.0
tcp_cwnd{flow="(SRC#192.168.10.58|50626)(DST#172.217.18.2|443)"} 10.0
tcp_cwnd{flow="(SRC#127.0.0.1|5037)(DST#127.0.0.1|58640)"} 10.0
tcp_cwnd{flow="(SRC#192.168.10.58|45014)(DST#192.30.253.124|443)"} 10.0
tcp_cwnd{flow="(SRC#127.0.0.1|8020)(DST#127.0.0.1|43908)"} 10.0
# HELP tcp_rtt_hist_ms tcp flowslatency outline
# TYPE tcp_rtt_hist_ms histogram
tcp_rtt_hist_ms_bucket{le="0.1"} 4.0
tcp_rtt_hist_ms_bucket{le="0.5"} 0.0
tcp_rtt_hist_ms_bucket{le="1.0"} 0.0
tcp_rtt_hist_ms_bucket{le="5.0"} 1.0
tcp_rtt_hist_ms_bucket{le="10.0"} 0.0
tcp_rtt_hist_ms_bucket{le="50.0"} 5.0
tcp_rtt_hist_ms_bucket{le="100.0"} 3.0
tcp_rtt_hist_ms_bucket{le="200.0"} 1.0
tcp_rtt_hist_ms_bucket{le="500.0"} 0.0
tcp_rtt_hist_ms_bucket{le="+Inf"} 10.0
tcp_rtt_hist_ms_count 10.0
tcp_rtt_hist_ms_sum 24.0
The exporter is configured via a YAML file that controls metric collection, flow filtering, and label compression. Use the --config command line argument to specify the configuration file path.
Example Configuration Files:
example/config.yml- Full-featured configuration with all metricsexample/config_minimal_disabled.yml- All metrics disabled (minimal output)example/config_partial.yml- Selective metrics enabledexample/config_minimal.yml- Basic configuration with essential metrics
---
# Core metrics collection configuration
logic:
metrics:
histograms:
active: true
rtt:
active: true
bucketBounds: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 25.0, 50.0, 100.0, 250.0, 500.0, 1000.0]
gauges:
active: true
rtt: { active: true }
cwnd: { active: true }
deliveryRate: { active: true }
counters:
active: true
dataSegsIn: { active: true }
dataSegsOut: { active: true }
compression:
labelFolding: "raw_endpoint" # or "pid_condensed"
selection:
# [Optional] filtering rules
process:
pids: [1000, 2000] # Specific process IDs
cmds: ["nginx", "apache2"] # Process names
peering:
addresses: ["8.8.8.8", "1.1.1.1"] # Specific IPs
networks: ["10.0.0.0/8", "192.168.0.0/16"] # CIDR networks
hosts: ["example.com"] # Hostnames
portRanges:
- lower: 80
upper: 443
- lower: 8000
upper: 9000Controls which metrics are collected and exposed.
Histograms
active: Enable/disable histogram collection globallyrtt.active: Enable TCP round-trip time histogramrtt.bucketBounds: Array of latency bucket boundaries in milliseconds
Gauges
active: Enable/disable gauge collection globallyrtt.active: Enable TCP round-trip time gaugecwnd.active: Enable TCP congestion window gaugedeliveryRate.active: Enable TCP delivery rate gauge
Counters
active: Enable/disable counter collection globallydataSegsIn.active: Enable incoming data segments counterdataSegsOut.active: Enable outgoing data segments counter
Reduces metric cardinality by compressing flow labels.
Label Folding
labelFolding: Folding strategy"raw_endpoint": Full IP and port information"pid_condensed": Replace source IP/port with process ID
Effect on Labels:
# raw_endpoint: flow="(SRC#192.168.10.58|39366)(DST#104.19.199.151|443)"
# pid_condensed: flow="(20005)(DST#104.19.199.151|443)"Note: To enable pid_condensed folding, run the container with root privileges (--user 0:0) to access process information from the host /proc filesystem.
Filter which flows to monitor. All sections are optional.
Process Filtering
process:
pids: [1000, 2000, 3000] # Monitor flows from specific process IDs
cmds: ["nginx", "apache2"] # Monitor flows from specific command namesNetwork/Address Filtering
peering:
addresses: ["8.8.8.8", "1.1.1.1"] # Specific IP addresses
networks: ["10.0.0.0/8", "192.168.0.0/16"] # CIDR networks (IPv4 only)
hosts: ["api.example.com", "db.internal"] # HostnamesPort Range Filtering
portRanges:
- lower: 80 # Port 80
upper: 443 # Up to port 443
- lower: 8000 # Port 8000
upper: 9000 # Up to port 9000Minimal Configuration
---
logic:
metrics:
gauges:
active: true
rtt: { active: true }
counters:
active: trueAll Metrics Disabled (Minimal Output)
---
logic:
metrics:
gauges:
active: false
rtt: { active: false }
cwnd: { active: false }
deliveryRate: { active: false }
histograms:
active: false
rtt: { active: false, bucketBounds: [] }
counters:
active: false
dataSegsIn: { active: false }
dataSegsOut: { active: false }
compression:
labelFolding: "raw_endpoint"
selection:
process:
pids: []
cmds: []
peering:
addresses: []
networks: []
hosts: []
portRanges: []active: false), individual metric fields (like rtt, cwnd, etc.) must be present in the configuration file with their own active: false setting. This is required for proper YAML parsing.
Web Server Monitoring
---
logic:
metrics:
gauges:
active: true
rtt: { active: true }
deliveryRate: { active: true }
histograms:
latency:
active: true
bucket_bounds: [0.1, 0.5, 1, 5, 10, 50, 100, 200]
selection:
process:
cmds: ["nginx", "apache2", "httpd"]
portranges:
- lower: 80
upper: 443Selective Metrics (Partial Configuration)
---
logic:
metrics:
gauges:
active: true
rtt: { active: true } # Enable only RTT gauge
cwnd: { active: false } # Disable other gauges
deliveryRate: { active: false }
histograms:
active: true
rtt: {
active: true,
bucketBounds: [0.1, 0.5, 1, 5, 10, 50, 100, 200]
}
counters:
active: false # Disable all counters
dataSegsIn: { active: false }
dataSegsOut: { active: false }High-Cardinality Environment
---
logic:
metrics:
gauges:
active: true
rtt: { active: true }
histograms:
latency:
active: true
bucket_bounds: [1, 5, 10, 50, 100, 500]
compression:
labelFolding: "raw_endpoint" # or "pid_condensed"
selection:
peering:
networks: ["10.0.0.0/8", "192.168.0.0/16"]The exporter is available as a Docker container from the GitHub Container Registry. Due to the need to access kernel socket statistics, the container requires elevated privileges.
# Set configuration
YOUR_CONFIG_FILE=/path/to/your/config.yml
RELEASE_TAG=2.1.1
IMAGE="ghcr.io/cherusk/prometheus_ss_exporter:${RELEASE_TAG}"
# Run the container
docker run --privileged --network host --pid host --rm \
-p 8020:8020 \
-v "${YOUR_CONFIG_FILE}:/config.yml:ro" \
--name=prometheus_ss_exporter \
"${IMAGE}" --port=8020 --config=/config.ymlFor pid_condensed label folding, run with root privileges to access process information:
docker run --privileged --network host --pid host --rm \
-p 8020:8020 \
-v "${YOUR_CONFIG_FILE}:/config.yml:ro" \
--user 0:0 \
--name=prometheus_ss_exporter \
"${IMAGE}" --port=8020 --config=/config.ymlversion: '3.8'
services:
prometheus_ss_exporter:
image: ghcr.io/cherusk/prometheus_ss_exporter:2.1.1
container_name: prometheus_ss_exporter
privileged: true
network_mode: host
pid: host
restart: unless-stopped
ports:
- "8020:8020"
command: ["./prometheus_ss_exporter", "--port=8020", "--config=/config.yml"]
volumes:
- ./config.yml:/config.yml:ro
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:8020/health"]
interval: 30s
timeout: 10s
retries: 3Why privileged mode is required:
- The exporter needs access to
/proc/net/tcpand other kernel network statistics - Socket information is only accessible with elevated privileges
- The container needs to attach to the host network namespace to see all connections
Alternative security approaches:
# More restricted capabilities (if your kernel supports it)
docker run --cap-add=NET_RAW --cap-add=NET_ADMIN \
--network host --pid host \
-p 8020:8020 \
-v "./config.yml:/config.yml:ro" \
ghcr.io/cherusk/prometheus_ss_exporter:2.1.1 \
--port=8020 --config=/config.ymlapiVersion: apps/v1
kind: DaemonSet
metadata:
name: prometheus-ss-exporter
labels:
app: prometheus-ss-exporter
spec:
selector:
matchLabels:
app: prometheus-ss-exporter
template:
metadata:
labels:
app: prometheus-ss-exporter
spec:
hostNetwork: true
hostPID: true
containers:
- name: exporter
image: ghcr.io/cherusk/prometheus_ss_exporter:2.1.1
securityContext:
privileged: true
ports:
- containerPort: 8020
hostPort: 8020
protocol: TCP
command: ["./prometheus_ss_exporter", "--port=8020", "--config=/config.yml"]
volumeMounts:
- name: config
mountPath: /config.yml
readOnly: true
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "100m"
volumes:
- name: config
configMap:
name: ss-exporter-configAdd the following to your prometheus.yml:
scrape_configs:
- job_name: 'socket-stats'
static_configs:
- targets: ['localhost:8020']
scrape_interval: 15s
metrics_path: /metrics
# Optional: Relabel to add instance information
relabel_configs:
- source_labels: [__address__]
target_label: instance
replacement: 'socket-exporter'Monitor HTTP/HTTPS traffic to your web servers:
---
logic:
metrics:
gauges:
active: true
rtt: { active: true }
cwnd: { active: true }
deliveryRate: { active: true }
histograms:
latency:
active: true
bucket_bounds: [0.1, 0.5, 1, 5, 10, 50, 100, 200]
counters:
active: true
dataSegsIn: { active: true }
dataSegsOut: { active: true }
selection:
process:
cmds: ["nginx", "apache2", "httpd"]
portranges:
- lower: 80
upper: 443Use case: Track web server response times and connection health for SLA monitoring.
Monitor database connection patterns:
---
logic:
metrics:
gauges:
active: true
rtt: { active: true }
deliveryRate: { active: true }
histograms:
latency:
active: true
bucket_bounds: [0.5, 1, 2, 5, 10, 25, 50, 100]
compression:
labelFolding: "raw_endpoint" # or "pid_condensed"
selection:
process:
cmds: ["postgres", "mysqld", "mongod", "oracle"]
portranges:
- lower: 3306 # MySQL
upper: 3306
- lower: 5432 # PostgreSQL
upper: 5432
- lower: 27017 # MongoDB
upper: 27017Use case: Monitor database connection latency and throughput for performance tuning.
Monitor internal service communication:
---
logic:
metrics:
gauges:
active: true
rtt: { active: true }
histograms:
latency:
active: true
bucket_bounds: [0.01, 0.05, 0.1, 0.5, 1, 5, 10]
compression:
labelFolding: "raw_endpoint" # or "pid_condensed"
selection:
peering:
networks: ["10.0.0.0/8", "192.168.0.0/16", "172.16.0.0/12"]
portranges:
- lower: 8000
upper: 9999
- lower: 3000
upper: 3999Use case: Track microservice-to-service communication patterns in a Kubernetes cluster.
Monitor edge servers with high connection volumes:
---
logic:
metrics:
gauges:
active: true
rtt: { active: true }
histograms:
latency:
active: true
bucket_bounds: [1, 5, 10, 25, 50, 100, 250]
compression:
labelFolding: "raw_endpoint" # or "pid_condensed"
selection:
# Monitor only external traffic (skip internal networks)
peering:
networks:
exclude: ["10.0.0.0/8", "192.168.0.0/16", "172.16.0.0/12"]
portranges:
- lower: 80
upper: 443
- lower: 8080
upper: 8080Use case: Monitor CDN or edge server performance while managing metric cardinality.
Lightweight monitoring for development:
---
logic:
metrics:
gauges:
active: true
rtt: { active: true }
counters:
active: false
histograms:
active: false
selection:
process:
cmds: ["node", "python", "java", "go"]
portranges:
- lower: 3000
upper: 9000Use case: Development debugging with minimal overhead and focused metrics.
# Alert on high latency
- alert: HighSocketLatency
expr: histogram_quantile(0.95, tcp_rtt_hist_ms_bucket) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "High socket latency detected"
description: "95th percentile latency is {{ $value }}ms"
# Alert on connection count
- alert: TooManyConnections
expr: count(tcp_rtt) > 10000
for: 10m
labels:
severity: critical
annotations:
summary: "Excessive number of TCP connections"
description: "{{ $value }} TCP connections detected"
# Alert on delivery rate issues
- alert: LowDeliveryRate
expr: avg(tcp_delivery_rate) < 1000
for: 5m
labels:
severity: warning
annotations:
summary: "Low TCP delivery rate"
description: "Average delivery rate is {{ $value }} bytes/sec"