Every homelab needs observability. Not because you’re running a production SLA — because you can’t fix what you can’t see. Running out of disk on the ZFS pool at 3 AM, a Docker container silently OOM-killed, or the Frigate NVR eating 100% CPU for hours — these are the things you catch with a monitoring stack, not by noticing the UI feels sluggish.

This post covers a full Prometheus + Grafana + Loki stack deployed on Docker in a Proxmox LXC, with metrics from the host, Docker containers, and system logs collected into one dashboard.

┌─────────────────────────────────────────────────────────┐
│                   Docker Host (LXC/VM)                    │
│                                                           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐               │
│  │Prometheus │  │  Loki     │  │  Grafana │               │
│  │ :9090     │  │ :3100     │  │ :3000     │               │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘               │
│       │              │              │                      │
│  ┌────▼─────┐  ┌────▼─────┐       │                      │
│  │node_exp  │  │docker_exp│       │                      │
│  │ (host)   │  │ (docker) │       │                      │
│  └──────────┘  └──────────┘       │                      │
│                                   │                      │
│  ┌────────────────────────────────▼────────┐             │
│  │          promtail (log collector)       │             │
│  │  /var/log/*.log → Loki                  │             │
│  └─────────────────────────────────────────┘             │
└─────────────────────────────────────────────────────────┘

Stack Overview

Component Role Port
Prometheus Time-series metrics database and alert evaluator 9090
Grafana Visualization, dashboards, alerting UI 3000
Loki Log aggregation (Prometheus-like, but for logs) 3100
promtail Log collector, ships container and system logs to Loki
node_exporter Host metrics (CPU, RAM, disk, network, ZFS) 9100
cadvisor (optional) Container-level resource metrics 8080

All run as Docker containers via a single Compose file. The only exception is node_exporter, which runs directly on the Proxmox host (or in a privileged LXC — I’ll cover both).


Directory Layout

/opt/docker/monitoring/
├── compose.yml
├── .env
├── prometheus/
│   ├── prometheus.yml
│   └── rules/
│       └── alerts.yml
├── grafana/
│   ├── grafana.ini
│   ├── dashboards/      (provisioned JSON)
│   └── datasources/     (provisioned YAML)
├── loki/
│   └── loki-config.yml
└── promtail/
    └── promtail-config.yml

1. Compose File

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# /opt/docker/monitoring/compose.yml
services:
  prometheus:
    image: prom/prometheus:v2.54.1
    container_name: prometheus
    restart: unless-stopped
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=${PROM_RETENTION:-30d}'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    ports:
      - "9090:9090"
    networks:
      - monitoring

  loki:
    image: grafana/loki:3.0.0
    container_name: loki
    restart: unless-stopped
    volumes:
      - ./loki:/etc/loki
      - loki_data:/loki
    command:
      - '-config.file=/etc/loki/loki-config.yml'
    ports:
      - "3100:3100"
    networks:
      - monitoring

  promtail:
    image: grafana/promtail:3.0.0
    container_name: promtail
    restart: unless-stopped
    volumes:
      - ./promtail:/etc/promtail
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
    command:
      - '-config.file=/etc/promtail/promtail-config.yml'
    networks:
      - monitoring
    depends_on:
      - loki

  grafana:
    image: grafana/grafana:11.3.0
    container_name: grafana
    restart: unless-stopped
    volumes:
      - ./grafana/datasources:/etc/grafana/provisioning/datasources
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_USER=${GF_ADMIN_USER:-admin}
      - GF_SECURITY_ADMIN_PASSWORD=${GF_ADMIN_PASSWORD:-admin}
      - GF_INSTALL_PLUGINS=${GF_PLUGINS:-}
      - GF_SERVER_HTTP_PORT=3000
    ports:
      - "3000:3000"
    networks:
      - monitoring
    depends_on:
      - prometheus

  docker_exporter:
    image: prometheuscommunity/docker-exporter:latest
    container_name: docker_exporter
    restart: unless-stopped
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    ports:
      - "9101:9101"
    networks:
      - monitoring

volumes:
  prometheus_data:
  loki_data:
  grafana_data:

networks:
  monitoring:
    name: monitoring
    external: false

Notes:

  • Retention defaults to 30 days — tune PROM_RETENTION in .env.
  • Promtail needs access to /var/lib/docker/containers to scrape Docker container logs. On hosts with SELinux or AppArmor, you may need additional rules.
  • The docker_exporter exposes container CPU/mem/network stats at port 9101.
1
2
3
4
5
# /opt/docker/monitoring/.env
PROM_RETENTION=30d
GF_ADMIN_USER=admin
GF_ADMIN_PASSWORD=changeme!
GF_PLUGINS=grafana-piechart-panel

2. Prometheus Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# /opt/docker/monitoring/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

rule_files:
  - "rules/alerts.yml"

scrape_configs:
  # Local node_exporter (on the Proxmox host)
  - job_name: 'node'
    static_configs:
      - targets: ['10.0.20.30:9100']
        labels:
          host: srv1

  # Docker exporter (running in compose)
  - job_name: 'docker'
    static_configs:
      - targets: ['docker_exporter:9101']
        labels:
          host: srv1

  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Why node_exporter runs outside Docker: Docker networking adds a NAT layer that makes host-level metrics inaccurate. Running node_exporter directly on the Proxmox host (or inside a privileged LXC with host network) gives you real CPU, disk, and network numbers.

Installing node_exporter on Proxmox (or LXC)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Download and install (replace version as needed)
NODE_EXPORTER_VER=1.8.2
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VER}/node_exporter-${NODE_EXPORTER_VER}.linux-amd64.tar.gz
tar xzf node_exporter-${NODE_EXPORTER_VER}.linux-amd64.tar.gz
sudo cp node_exporter-${NODE_EXPORTER_VER}.linux-amd64/node_exporter /usr/local/bin/
rm -rf node_exporter-${NODE_EXPORTER_VER}.linux-amd64*

# Create systemd service
sudo tee /etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
Type=simple
User=nobody
Group=nogroup
ExecStart=/usr/local/bin/node_exporter \
  --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/) \
  --collector.textfile.directory=/var/lib/node_exporter/textfile
Restart=always

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

# Verify
curl http://localhost:9100/metrics | head -5

If you run this inside an unprivileged LXC, you’ll hit issues with ZFS and disk metrics. For full host metrics, deploy node_exporter on the Proxmox host itself — it’s lightweight (~30 MB RAM, negligible CPU).


3. Loki Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# /opt/docker/monitoring/loki/loki-config.yml
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  instance_addr: 127.0.0.1
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  ingestion_rate_mb: 4
  ingestion_burst_size_mb: 8

table_manager:
  retention_deletes_enabled: true
  retention_period: 720h  # 30 days

This is a single-binary, single-instance Loki config — fine for a homelab. If you want to scale later, Loki supports S3/GCS backends and horizontal sharding.


4. Promtail Configuration

Promtail runs as a Docker container but needs access to Docker’s log files to extract container names and labels.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# /opt/docker/monitoring/promtail/promtail-config.yml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: varlogs
          __path__: /var/log/*.log

  - job_name: docker
    pipeline_stages:
      - docker: {}
    static_configs:
      - targets:
          - localhost
        labels:
          job: docker
          __path__: /var/lib/docker/containers/*/*-log.json

What this does:

  • system: Scrapes all .log files in /var/log/ — syslog, auth, kern, daemon, etc.
  • docker: Reads JSON log files from Docker containers, extracts container labels, and adds them as Loki labels.

5. Grafana Provisioning

Provision datasources automatically so Grafana is ready on first boot.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# /opt/docker/monitoring/grafana/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: false

For dashboards, drop JSON exports from Grafana.com into:

/opt/docker/monitoring/grafana/dashboards/

The two I use daily:

  • Node Exporter Full (ID 1860) — Host metrics (CPU, RAM, disk, network, temp)
  • Docker Monitoring (ID 12220) — Container resource usage

To provision them automatically:

1
2
3
4
5
6
# Download dashboard JSONs at deploy time
mkdir -p /opt/docker/monitoring/grafana/dashboards
curl -s -o /opt/docker/monitoring/grafana/dashboards/node_exporter.json \
  "https://grafana.com/api/dashboards/1860/revisions/38/download"
curl -s -o /opt/docker/monitoring/grafana/dashboards/docker_monitoring.json \
  "https://grafana.com/api/dashboards/12220/revisions/5/download"

Then add a dashboard provider:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# /opt/docker/monitoring/grafana/dashboards/dashboard.yml
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: true
    editable: true
    options:
      path: /etc/grafana/provisioning/dashboards

6. Alerts — Catching Problems Before You Wake Up

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# /opt/docker/monitoring/prometheus/rules/alerts.yml
groups:
  - name: host_alerts
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CPU > 80% on {{ $labels.instance }} for 10m"

      - alert: HighDiskUsage
        expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|devtmpfs"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay|devtmpfs"})) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk > 85% on {{ $labels.instance }} - {{ $labels.mountpoint }}"

      - alert: NodeDown
        expr: up{job="node"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is unreachable"

      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "RAM > 90% on {{ $labels.instance }}"

Prometheus doesn’t send notifications on its own — for that, add Alertmanager:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Quick Alertmanager to Telegram if you already have a bot
services:
  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    volumes:
      - ./alertmanager:/etc/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
    ports:
      - "9093:9093"
    networks:
      - monitoring
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# /opt/docker/monitoring/alertmanager/alertmanager.yml
route:
  receiver: telegram
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: telegram
    telegram_configs:
      - bot_token: YOUR_BOT_TOKEN
        chat_id: YOUR_CHAT_ID
        parse_mode: HTML

7. Deploying the Stack

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
cd /opt/docker/monitoring

# Pull images and start
docker compose pull
docker compose up -d

# Check all running
docker compose ps

# Verify endpoints
curl -s http://localhost:9090/-/ready    # Prometheus
curl -s http://localhost:3100/ready      # Loki
curl -s http://localhost:3000/api/health # Grafana

# Prometheus targets (should show UP for node, docker, prometheus)
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].labels'

First-time login to Grafana at http://<host>:3000 with admin / changeme! (overridable in .env). Prometheus and Loki datasources are pre-configured. Open the provisioned dashboards and confirm data is flowing.


8. Adding the Proxmox Host Itself

To also scrape Proxmox VE metrics, enable the Proxmox API exporter:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
services:
  proxmox_exporter:
    image: prompve/prometheus-pve-exporter:4.1
    container_name: proxmox_exporter
    restart: unless-stopped
    volumes:
      - ./pve-exporter:/etc/pve-exporter
    ports:
      - "9221:9221"
    networks:
      - monitoring
1
2
3
4
5
6
# /opt/docker/monitoring/pve-exporter/pve.yml
default:
  user: prometheus@pam
  password: your-pve-password
  # Or use: token: "USER@REALM!TOKENID=SECRET"
  verify_ssl: false

Add a Prometheus scrape target:

1
2
3
4
5
  - job_name: 'proxmox'
    static_configs:
      - targets: ['proxmox_exporter:9221']
        labels:
          host: srv1

This exposes PVE-specific metrics: VM/LXC state, QEMU guest agent status, node uptime, storage pool usage, and cluster health.


Resource Usage

Grafana + Prometheus + Loki with 30-day retention consumes roughly:

Component RAM Disk (30d)
Prometheus ~200 MB ~2-5 GB (depends on scrape targets)
Loki ~150 MB ~3-8 GB (depends on log volume)
Grafana ~100 MB ~100 MB (dashboards, DB)
promtail ~30 MB
node_exporter ~30 MB
Total ~500 MB ~5-15 GB

Tiny footprint for what you get. SSD recommended for Loki/Prometheus TSDB to avoid write amplification on spinning disks.


What This Catches

Real problems I’ve caught with this stack:

  • ZFS pool at 97% → Prometheus alerted → found a 200 GB Docker overlay directory from an abandoned container.
  • Docker container restart loop → promtail showed OOMKilled in frigate logs → increased RAM limit in the LXC.
  • Network interface saturation → node_exporter + Grafana graph showed the enp3s0 interface hitting 950 Mbps → found a Sonarr import hammering NFS.
  • Proxmox storage filling → PVE exporter alerted on rpool usage → pruned old PBS backups.

Without the stack, every single one of these would have been discovered when something broke, not when it was trending that direction.


Summary

1
2
3
4
5
6
7
# Fast deploy — copy, edit .env, run
git clone https://github.com/yourfork/homelab-monitoring /opt/docker/monitoring
cd /opt/docker/monitoring
# Install node_exporter on host (not in Docker)
# Edit .env with your admin password
docker compose up -d
# Browse to http://<host>:3000  (admin / your-password)

A monitoring stack isn’t optional in a serious homelab. It’s the difference between managing by glance and managing by data. Prometheus + Grafana + Loki run on a single Docker host with barely 500 MB of RAM and give you full observability — metrics from the host, from Docker containers, from Proxmox itself, and all logs searchable in one place.