You have five services running in Docker. Some are critical — your DNS resolver, your reverse proxy, your database. If one of them crashes or a disk fills up, how do you know?
SSHing into every host and running htop does not scale past one
machine. You need a monitoring stack that collects metrics from
every host and service, visualizes them in dashboards, and alerts
you when something needs attention.
Prometheus scrapes metrics on a schedule. Grafana turns those metrics into dashboards. Node Exporter exposes Linux kernel and hardware counters. cAdvisor exposes Docker container resource usage. Alertmanager routes notifications to email, Slack, or Telegram. Together they form the standard observability stack for Linux infrastructure.
This guide deploys the entire stack with a single Docker Compose file, configures it for a multi-host homelab, and walks through the first dashboards and alerts.
Architecture Overview
The stack has four components, all deployed via Docker Compose on one “monitoring” host:
┌─────────────────────────────────────┐
│ Docker Network │
│ monitoring-net │
│ │
│ ┌──────────┐ ┌───────────┐ │
│ │ Grafana │──│ Prometheus │ │
│ │ :3000 │ │ :9090 │ │
│ └────┬─────┘ └─────┬─────┘ │
│ │ │ │
│ │ ┌──────────┴──────────┐ │
│ └───┤ Alertmanager :9093 │ │
│ └──────────┬──────────┘ │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ Notification │ │
│ │ (Email/Slack/TG) │ │
│ └─────────────────────┘ │
└─────────────────────────────────────┘
▲ ▲
│ scrape │ scrape
│ :9100 │ :8080
┌────┴─────┐ ┌────┴─────┐
│Node │ │ cAdvisor │
│Exporter │ │ :8080 │
│(each host)│ │(per host)│
└──────────┘ └──────────┘
Prometheus is the central collector. It pulls metrics from each Node Exporter and cAdvisor instance on a configurable interval (default 15 seconds). Grafana queries Prometheus to render dashboards. Alertmanager evaluates alert rules and fires notifications.
Step 1 — Create the Directory Structure
|
|
Set ownership so Grafana and Prometheus can write their data:
|
|
The UIDs are the default user inside each container (grafana=472, prometheus=nobody=65534). Mismatch these and the containers will crash on startup with permission errors.
Step 2 — Prometheus Configuration
Write /opt/monitoring/prometheus/prometheus.yml:
|
|
Alert Rules
Write /opt/monitoring/prometheus/alerts.yml:
|
|
Step 3 — Alertmanager Configuration
Write /opt/monitoring/alertmanager/alertmanager.yml:
|
|
For Telegram notifications, add a webhook receiver:
|
|
Use environment variables for secrets. The Compose file passes
$SMTP_PASSWORD into Alertmanager. Never hardcode tokens in
config files committed to git.
Step 4 — Docker Compose Configuration
Write /opt/monitoring/docker-compose.yml:
|
|
Data Retention
The --storage.tsdb.retention.time=30d flag keeps 30 days of
metrics. For a homelab collecting ~1000 series per host, this
uses roughly 500 MB–1 GB of disk per month. Adjust to 7d or
90d based on your storage budget.
Step 5 — Deploy Node Exporter and cAdvisor on Each Host
You need Node Exporter and cAdvisor running on every machine you want to monitor. Node Exporter provides host-level metrics (CPU, RAM, disk, network). cAdvisor provides per-container resource usage.
Node Exporter (systemd service)
Create /etc/systemd/system/node-exporter.service:
|
|
Install the binary:
|
|
Verify it is collecting metrics:
|
|
cAdvisor (Docker container)
|
|
cAdvisor requires --privileged to access kernel cgroup metrics.
On Proxmox LXC containers, you may need lxc.cgroup2: "" in the
container config for cAdvisor to work properly.
Firewall Rules
If you use UFW or nftables, open ports 9100 and 8080 to your monitoring host only:
|
|
Never expose Node Exporter or cAdvisor ports to the internet. They serve unauthenticated metrics endpoints that leak system information.
Step 6 — Start the Stack
|
|
Check that everything is running:
|
|
Check Prometheus targets:
|
|
All targets should show "health": "up". If any show "down",
check firewall rules and verify that Node Exporter / cAdvisor are
running on the target host.
Step 7 — Import Grafana Dashboards
Log into Grafana at http://your-host:3000 (default credentials:
admin / admin — change immediately on first login).
Add Prometheus as a Data Source
- Configuration → Data Sources → Add data source
- Select Prometheus
- Set URL to
http://prometheus:9090(Docker internal DNS) - Click Save & Test
Import the Node Exporter Full Dashboard
The best dashboard for host metrics is Node Exporter Full (dashboard ID: 1860):
- Dashboards → Import
- Enter dashboard ID
1860 - Select the Prometheus data source
- Click Import
Import the Docker Monitoring Dashboard
For container metrics, use the Docker Monitoring dashboard (ID: 193):
- Dashboards → Import
- Enter dashboard ID
193 - Select the Prometheus data source
- Click Import
Create a Simple CPU Dashboard (Manual)
If you prefer a minimal dashboard over importing large ones:
- Dashboards → New Dashboard → Add visualization
- Select Prometheus as data source
- Enter this PromQL query for CPU:
|
|
- Set title to “CPU Usage by Host”
- Add another panel with this memory query:
|
|
- Add a disk panel:
|
|
- Save the dashboard
Key PromQL Queries for Homelab Monitoring
| What to measure | PromQL query |
|---|---|
| CPU per host (%) | 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) |
| Memory used (%) | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 |
| Disk used per mount (%) | (1 - (node_filesystem_free_bytes / node_filesystem_size_bytes)) * 100 |
| Disk IOPS (reads) | rate(node_disk_reads_completed_total[5m]) |
| Disk IOPS (writes) | rate(node_disk_writes_completed_total[5m]) |
| Network throughput (rx) | rate(node_network_receive_bytes_total{device!="lo"}[5m]) |
| Network throughput (tx) | rate(node_network_transmit_bytes_total{device!="lo"}[5m]) |
| Container CPU per name | sum by(name) (rate(container_cpu_usage_seconds_total{name!=""}[5m])) |
| Container memory per name | container_memory_usage_bytes{name!=""} |
| Uptime (days) | (time() - node_boot_time_seconds) / 86400 |
| Load average (1m) | node_load1 |
Step 8 — Test an Alert
Trigger a test alert to verify Alertmanager is working:
|
|
Check Alertmanager status:
|
|
If email or Slack webhook is configured, you should receive the test notification within a few seconds.
Step 9 — Long-Term Data and Backup
Prometheus stores metrics as TSDB blocks on disk. After 30 days (default from this guide), old blocks are deleted. If you want longer retention, either increase the retention period or add remote write to VictoriaMetrics or Thanos.
Back up Grafana
Grafana dashboards and data sources live in the SQLite database
at /opt/monitoring/grafana/data/grafana.db. Back it up daily:
|
|
Better approach: use Grafana’s
Provisioning API
to define dashboards as YAML files in ./grafana/provisioning/dashboards/.
This way, dashboards are version-controlled and survive container
recreations.
Grafana Provisioning Example
Create /opt/monitoring/grafana/provisioning/datasources/prometheus.yml:
|
|
Create /opt/monitoring/grafana/provisioning/dashboards/dashboard.yml:
|
|
Drop exported JSON dashboard files into ./grafana/dashboards/
and they will load automatically on container restart.
Step 10 — Adding More Hosts
Adding a new host to the monitoring stack takes two minutes:
|
|
The --web.enable-lifecycle flag on Prometheus enables the
/-/reload endpoint without a full restart. Use it whenever you
add or remove scrape targets.
Security Considerations
- Never expose ports 9090 (Prometheus), 9093 (Alertmanager), or 3000 (Grafana) to the internet. These are management interfaces. Route Grafana through your reverse proxy with HTTPS and authentication.
- Node Exporter and cAdvisor serve unauthenticated metrics. Firewall them to your monitoring subnet only. A reverse proxy is not enough — these should not be accessible from outside your LAN at all.
- Change Grafana admin password on first login. In the Compose
file, use
$GRAFANA_PASSWORDfrom an.envfile instead of the default. - Alertmanager SMTP password should be in an
.envfile, not committed to git.
Create /opt/monitoring/.env:
|
|
Add .env to .gitignore if you version-control this directory.
Troubleshooting
Prometheus targets show “down”
|
|
Grafana says “No data”
|
|
Container metrics missing in cAdvisor
On Proxmox LXC containers, cAdvisor needs cgroup v2 access:
|
|
Then restart the container.
Prometheus storage growing too fast
Reduce retention or limit scraped metrics:
|
|
This keeps only CPU, memory, and filesystem metrics from Node Exporter, dropping disk scheduler stats, network hardware info, and other rarely-used metrics.
Summary
The Prometheus + Grafana stack gives you full observability into every host and container in your homelab. With today’s setup you get:
- Host metrics — CPU, RAM, disk, network, and load from every machine via Node Exporter
- Container metrics — per-container CPU, memory, network, and filesystem via cAdvisor
- Dashboards — pre-built Node Exporter and Docker monitoring dashboards in Grafana
- Alerts — disk space, CPU, memory, and container-down alerts routed to email or Slack via Alertmanager
- Easy expansion — add hosts by deploying Node Exporter + cAdvisor and adding one line to prometheus.yml
Start with the single-host Compose stack, add Node Exporters to your other machines one at a time, and within an afternoon you will have full visibility into your entire lab. When a disk fills up at 3 AM, your phone will buzz instead of a service crashing silently.