If you run a homelab with multiple Docker services, VMs, and physical hosts, you have no real visibility into what is happening. A container goes OOM, a disk fills up, or a service stops responding — and you only discover it when something breaks.

A Prometheus Grafana monitoring stack built on Docker Compose gives you metric collection, alerting, and visualization for your entire infrastructure. This guide walks through deploying Prometheus for time-series storage, Grafana for dashboards, Node Exporter for host metrics, cAdvisor for container metrics, and Alertmanager for push notifications — all behind a single Traefik reverse proxy.


Architecture Overview

The stack consists of five components:

Component Role Exposes
Prometheus Time-series database and metric scraper Port 9090
Grafana Dashboard and visualization layer Port 3000
Node Exporter Host-level metrics (CPU, RAM, disk, network) Port 9100
cAdvisor Container-level metrics (per-container resource usage) Port 8080
Alertmanager Handles and routes alerts from Prometheus Port 9093

Prometheus scrapes Node Exporter and cAdvisor at configurable intervals. Grafana queries Prometheus as a data source. Alertmanager receives firing alerts and sends notifications via email, Telegram, or webhook.


Docker Compose Configuration

Create a project directory and compose.yaml:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
# /opt/monitoring/compose.yaml
services:
  prometheus:
    image: prom/prometheus:v2.55.1
    container_name: prometheus
    hostname: prometheus
    restart: unless-stopped
    command:
      - '--config.file=/etc/prometheus/prometheus.yaml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus.yaml:/etc/prometheus/prometheus.yaml:ro
      - ./alert-rules.yaml:/etc/prometheus/alert-rules.yaml:ro
      - prometheus_data:/prometheus
    ports:
      - "127.0.0.1:9090:9090"
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:11.4.0
    container_name: grafana
    restart: unless-stopped
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
      - GF_INSTALL_PLUGINS=grafana-piechart-panel
      - GF_SERVER_ROOT_URL=https://grafana.gntech.home
      - GF_AUTH_ANONYMOUS_ENABLED=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/datasources:/etc/grafana/provisioning/datasources:ro
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
      - ./grafana/dashboards-json:/var/lib/grafana/dashboards:ro
    ports:
      - "127.0.0.1:3000:3000"
    networks:
      - monitoring
    depends_on:
      - prometheus

  node_exporter:
    image: prom/node-exporter:v1.8.2
    container_name: node_exporter
    restart: unless-stopped
    command:
      - '--path.rootfs=/host'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    pid: host
    volumes:
      - /:/host:ro,rslave
    ports:
      - "127.0.0.1:9100:9100"
    networks:
      - monitoring

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    devices:
      - /dev/kmsg:/dev/kmsg
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    ports:
      - "127.0.0.1:8080:8080"
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yaml'
      - '--storage.path=/alertmanager'
    volumes:
      - ./alertmanager.yaml:/etc/alertmanager/alertmanager.yaml:ro
      - alertmanager_data:/alertmanager
    ports:
      - "127.0.0.1:9093:9093"
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

networks:
  monitoring:
    name: monitoring
    external: false

Note: All ports are bound to 127.0.0.1 so they are not directly exposed. Access goes through a reverse proxy (Traefik or Nginx) with authentication. For a single-host setup, you can omit the ports sections and use the internal Docker network exclusively.


Prometheus Configuration

Create prometheus.yaml with scrape targets and alerting rules:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# /opt/monitoring/prometheus.yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'homelab'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - 'alert-rules.yaml'

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter on this host
  - job_name: 'node'
    static_configs:
      - targets: ['node_exporter:9100']

  # cAdvisor for container metrics
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  # Additional Node Exporters on other hosts
  # - job_name: 'node-proxmox'
  #   static_configs:
  #     - targets:
  #       - '10.0.20.30:9100'   # SRV1
  #       - '10.0.20.31:9100'   # SRV2

For Node Exporter on remote hosts (Proxmox nodes, other VMs), install the agent:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# On each remote host
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
tar xzf node_exporter-1.8.2.linux-amd64.tar.gz
sudo install node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin/

# Create systemd service
sudo tee /etc/systemd/system/node_exporter.service << 'SVC'
[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
Type=simple
User=nobody
Group=nogroup
ExecStart=/usr/local/bin/node_exporter \
  --path.rootfs=/host \
  --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
Restart=always

[Install]
WantedBy=multi-user.target
SVC

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

Uncomment the remote targets in prometheus.yaml after installing Node Exporter on each remote host.


Alerting Rules

Create alert-rules.yaml with practical homelab alerts:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# /opt/monitoring/alert-rules.yaml
groups:
  - name: homelab_node
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}% for 5 minutes."

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}%."

      - alert: DiskSpaceRunningOut
        expr: (node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"}) * 100 < 10
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Only {{ $value }}% available on {{ $labels.mountpoint }}."

      - alert: DiskSpaceWarning
        expr: (node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"}) * 100 < 20
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk space warning on {{ $labels.instance }}"
          description: "Only {{ $value }}% available on {{ $labels.mountpoint }}."

      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "Node {{ $labels.instance }} has been unreachable for 1 minute."

  - name: homelab_container
    interval: 30s
    rules:
      - alert: ContainerHighCPU
        expr: rate(container_cpu_usage_seconds_total{name!=""}[2m]) * 100 > 200
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High CPU in container {{ $labels.name }}"
          description: "Container {{ $labels.name }} is using {{ $value }}% CPU."

      - alert: ContainerHighMemory
        expr: container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} * 100 > 90
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High memory in container {{ $labels.name }}"
          description: "Container {{ $labels.name }} at {{ $value }}% of memory limit."

      - alert: ContainerRestarting
        expr: changes(container_last_seen{name!=""}[15m]) > 3
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} restarting"
          description: "Container {{ $labels.name }} restarted {{ $value }} times in 15 minutes."

      - alert: ContainerOOMKilled
        expr: container_oom_events_total{name!=""} > 0
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} OOM killed"
          description: "Container {{ $labels.name }} was killed by OOM."

These rules cover the most common homelab failure scenarios: a node going offline, disk filling up, a container in a restart loop, or a process hitting its memory limit.


Alertmanager Notification Configuration

Create alertmanager.yaml to route alerts to Telegram:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# /opt/monitoring/alertmanager.yaml
route:
  receiver: 'telegram'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'telegram'
      repeat_interval: 1h

receivers:
  - name: 'telegram'
    telegram_configs:
      - bot_token: ${TELEGRAM_BOT_TOKEN}
        chat_id: ${TELEGRAM_CHAT_ID}
        message: |-
          {{ range .Alerts }}
          🔴 *{{ .Labels.alertname }}*
          *Instance:* {{ .Labels.instance }}
          *Severity:* {{ .Labels.severity }}
          *Summary:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          {{ end }}
        parse_mode: 'MarkdownV2'
        send_resolved: true

Set the environment variables in a .env file:

1
2
3
4
# /opt/monitoring/.env
GRAFANA_PASSWORD=your-strong-password
TELEGRAM_BOT_TOKEN=123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11
TELEGRAM_CHAT_ID=-1001234567890

For email alerts, switch to:

1
2
3
4
5
6
7
  - name: 'email'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'mail.gntech.dev:587'
        auth_username: '[email protected]'
        auth_password: ${SMTP_PASSWORD}

Grafana Provisioning — Automatic Dashboards

Avoid clicking around in the Grafana UI. Use provisioning files so the dashboards survive container recreations.

Datasource Provisioning

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# /opt/monitoring/grafana/datasources/datasources.yaml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

Dashboard Provisioning

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# /opt/monitoring/grafana/dashboards/dashboards.yaml
apiVersion: 1

providers:
  - name: 'Homelab'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: true
    editable: false
    options:
      path: /var/lib/grafana/dashboards

Import Community Dashboards

Download the most popular dashboards for Node Exporter and cAdvisor:

1
2
3
4
5
6
7
8
# /opt/monitoring/grafana/dashboards-json/
# Node Exporter Full (ID 1860)
curl -sL https://grafana.com/api/dashboards/1860/revisions/latest/download \
  -o grafana/dashboards-json/node-exporter-full.json

# Docker Monitoring (ID 17906)
curl -sL https://grafana.com/api/dashboards/17906/revisions/latest/download \
  -o grafana/dashboards-json/docker-monitoring.json

Grafana loads these on startup. No manual importing needed.


Reverse Proxy with Traefik

If you use Traefik (covered in earlier posts), add labels to expose Grafana and Prometheus securely:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
  grafana:
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.grafana.rule=Host(`monitor.gntech.home`)"
      - "traefik.http.routers.grafana.entrypoints=https"
      - "traefik.http.routers.grafana.tls=true"
      - "traefik.http.services.grafana.loadbalancer.server.port=3000"

  prometheus:
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.prometheus.rule=Host(`prometheus.gntech.home`)"
      - "traefik.http.routers.prometheus.entrypoints=https"
      - "traefik.http.routers.prometheus.tls=true"
      - "traefik.http.services.prometheus.loadbalancer.server.port=9090"
      - "traefik.http.routers.prometheus.middlewares=auth@file"

Security: Always put Prometheus behind authentication. The Prometheus UI has no auth by default. Use Traefik forward auth (Authentik or basic auth middleware) or add Prometheus’s --web.auth.file flag with a bcrypt-hashed password file.


Deployment

Start the stack:

1
2
cd /opt/monitoring
docker compose up -d

Verify all services are running:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
docker compose ps

# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].labels.job'

# Query a metric
curl -s 'http://localhost:9090/api/v1/query?query=up' | jq '.data.result[] | {job: .metric.job, instance: .metric.instance, up: .value[1]}'

# Test an alert rule
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | {name: .name, state: .state}'

Access Grafana at http://localhost:3000 (or your Traefik URL). Login with admin and your configured GRAFANA_PASSWORD. The Prometheus datasource and dashboards are preconfigured.


Adding Monitoring to a Multi-Node Homelab

For a Proxmox cluster with multiple hosts, install Node Exporter on every node and point Prometheus at each:

1
2
3
4
5
6
7
8
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets:
        - 'node_exporter:9100'        # Docker host
        - '10.0.20.30:9100'           # SRV1
        - '10.0.20.31:9100'           # SRV2
        - '10.0.20.32:9100'           # SRV3 (backup node)

For Proxmox-specific metrics (which Node Exporter does not cover), install the Proxmox VE Exporter:

1
2
3
4
5
6
7
8
docker run -d \
  --name=pve-exporter \
  --restart=unless-stopped \
  -e PVE_USER=monitoring@pve \
  -e PVE_PASSWORD=secret \
  -e PVE_HOST=10.0.20.30 \
  -p 127.0.0.1:9221:9221 \
  prompve/prometheus-pve-exporter:latest

Add it as a scrape target in prometheus.yaml:

1
2
3
  - job_name: 'proxmox'
    static_configs:
      - targets: ['pve-exporter:9221']

Disk Usage Considerations

Prometheus can consume significant disk depending on retention and scrape volume. Estimate:

  • 15s scrape interval × ~500 time series = ~2 GB/month
  • 15s scrape interval × ~2000 time series (multi-node) = ~8 GB/month
  • 30d retention with multi-node = ~8-10 GB total

Allocate a dedicated Docker volume on fast storage. For ZFS users, set the mountpoint on a dataset with recordsize=16K (optimized for Prometheus write patterns):

1
zfs create -o recordsize=16K -o compression=lz4 tank/monitoring

Then mount it via volume bind:

1
2
volumes:
  - /tank/monitoring/prometheus:/prometheus

Maintenance and Updates

Backing Up Grafana

Grafana stores dashboards, datasources, and users in SQLite. Back up the database and provisioning files:

1
2
docker exec grafana sqlite3 /var/lib/grafana/grafana.db ".backup /tmp/grafana-backup.db"
docker cp grafana:/tmp/grafana-backup.db ./backups/

Or use provisioning (recommended) — export dashboards as JSON and store them in your git repo.

Updating Images

1
2
3
docker compose pull
docker compose up -d
docker image prune -f

Adding a New Host

  1. Install Node Exporter on the host (systemd service above).
  2. Open port 9100 in the host firewall.
  3. Add the IP to prometheus.yaml scrape_configs.
  4. Reload Prometheus: curl -X POST http://localhost:9090/-/reload.

Summary

This Prometheus Grafana monitoring stack gives you complete observability into your homelab with minimal overhead. Five Docker containers provide:

  • Host metrics — CPU, RAM, disk, network from every node
  • Container metrics — per-container resource usage and health from cAdvisor
  • Alerting — Telegram notifications when things go wrong
  • Preconfigured dashboards — visual monitoring on day one
  • Multi-node support — scrape any host running Node Exporter

The stack runs on any Docker host, uses minimal resources (~500 MB RAM total for the full stack), and alerts you before problems become outages. Deploy it today and you will catch your next disk-full or container-OOM before it takes down your services.