Docker Container Security — Non-Root Users, Capabilities, and Runtime Hardening

Most homelabs run Docker containers as root by default. The Dockerfile says FROM nginx:latest and nginx runs as root inside the container. A container escape, a vulnerable web app, or a malicious image — and an attacker has root on your Docker host.

This isn’t theoretical. CVEs targeting container escape vectors (CVE-2024-21626 for runc, CVE-2026-42945 for Nginx) are published regularly. The fix is straightforward: stop running containers as root, drop unnecessary capabilities, lock down the filesystem, and apply runtime security profiles.

This guide covers the exact configurations you need for a hardened Docker setup in a homelab — with real docker-compose.yml examples you can apply today.

Why Default Docker Permissions Are Dangerous

Docker containers share the host kernel. When a process runs as root (UID 0) inside a container, it has the same UID 0 outside the container. The only thing separating it from the host is the Linux kernel’s namespace isolation and cgroups.

Check what user your containers are running as:

1
2
3
4
5


docker exec my-container whoami
# root ← this is a problem

docker exec my-container id
# uid=0(root) gid=0(root) groups=0(root)

If an attacker exploits a kernel vulnerability from inside that container, they get root on your Proxmox host. The runc escape CVE (CVE-2019-5736, CVE-2024-21626) proves this isn’t theoretical — it’s been exploited in the wild against CI pipelines and cloud deployments.

The fix: every container should run as a non-root user with the minimum Linux capabilities it actually needs.

Hardening Layer 1: Non-Root Users in Dockerfiles

The first and most important layer: define a non-root user in your Dockerfile and switch to it with the USER directive.

Bad Dockerfile:

1
2
3
4
5
6
7


FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
CMD ["node", "server.js"]
# Runs as root — one exploit = full host compromise

Good Dockerfile:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


FROM node:20-alpine

# Create a non-root user with known UID/GID
RUN addgroup -S appgroup && \
    adduser -S appuser -G appgroup -u 1001

WORKDIR /app
COPY package*.json ./
RUN npm install && \
    chown -R appuser:appgroup /app
COPY --chown=appuser:appgroup . .

# Switch to non-root user
USER appuser

CMD ["node", "server.js"]

The chown is critical — if the app writes files (logs, uploads, caches), the non-root user must own them or the container crashes with “Permission denied.”

For Docker Compose, you can also enforce the user without modifying the image:

1
2
3
4


services:
  app:
    image: some-image-that-runs-as-root
    user: "1001:1001"    # Force container to run as UID 1001

But this breaks if the image writes to files owned by root. Fixing it at the image level with a proper Dockerfile is the correct approach.

Hardening Layer 2: Dropping Linux Capabilities

Linux capabilities break the root user’s blanket privileges into discrete units: CAP_NET_BIND_SERVICE (bind to ports <1024), CAP_SYS_ADMIN (mount filesystems, access kernel features), and many more.

Docker gives containers a default set of capabilities. Most are unnecessary for a standard web app.

Check what capabilities your container has:

1
2


docker run --rm ubuntu:22.04 capsh --print
docker exec my-container capsh --print | grep -o '=[^=]*cap_[^+]*'

Drop all capabilities, then add back only what’s needed:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


services:
  nginx:
    image: nginx:alpine
    container_name: web
    cap_drop:
      - ALL                         # Remove every capability
    cap_add:
      - NET_BIND_SERVICE            # Only allow binding to port 80/443
      - CHOWN                       # Allow changing file ownership
      - SETGID                      # Allow setting group ID
      - SETUID                      # Allow setting user ID

For a typical web app (Node.js, Python, Go), you usually only need:

1
2
3
4
5
6
7


services:
  webapp:
    image: my-web-app
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE   # Optional — only if binding to ports <1024

If your app binds to port 8080 (not 80), you don’t even need NET_BIND_SERVICE. Drop everything:

1
2
3
4
5
6


services:
  webapp:
    image: my-web-app
    cap_drop:
      - ALL
    cap_add: []    # No capabilities at all — cleanest config

What about databases?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


services:
  postgres:
    image: postgres:16-alpine
    cap_drop:
      - ALL
    cap_add:
      - CHOWN
      - DAC_OVERRIDE     # Needed for file permission operations
      - SETUID
      - SETGID
      - NET_BIND_SERVICE # Postgres binds to port 5432

Signs a capability is missing:

nginx: [emerg] socket() failed: (13: Permission denied)
→ Add NET_BIND_SERVICE

change_attributes: Operation not permitted
→ Add CHOWN or DAC_OVERRIDE

fork failed: Resource temporarily unavailable
→ Check pids_limit, not a capability issue

Hardening Layer 3: Read-Only Root Filesystem

Most containers only need to write to specific directories (uploads, caches, databases). Everything else should be read-only. This prevents attackers from modifying binaries or writing malicious files.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


services:
  my-app:
    image: my-web-app
    cap_drop:
      - ALL
    read_only: true
    tmpfs:                          # Writable temp directories in RAM
      - /tmp:noexec,nosuid,size=64M
      - /var/run:noexec,nosuid,size=32M
    volumes:
      - app_data:/app/data          # Persistent writable directory

The read_only: true flag makes the entire container filesystem read-only. The tmpfs mounts provide small RAM-backed writable directories for runtime data (pids, sockets, temp files). The named volume provides a persistent writable directory for application data.

For PostgreSQL:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


services:
  postgres:
    image: postgres:16-alpine
    cap_drop:
      - ALL
    cap_add:
      - CHOWN
      - DAC_OVERRIDE
      - SETUID
      - SETGID
    read_only: true
    tmpfs:
      - /tmp:noexec,nosuid,size=64M
    volumes:
      - pg_data:/var/lib/postgresql/data  # Data dir must be writable
    environment:
      POSTGRES_INITDB_ARGS: "--data-checksums"

PostgreSQL writes to /var/lib/postgresql/data — that volume gives it write access. Everything else (binaries, libraries) stays read-only.

For Nginx with read-only root:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


services:
  nginx:
    image: nginx:alpine
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE
      - CHOWN
    read_only: true
    tmpfs:
      - /var/cache/nginx:noexec,nosuid,size=128M
      - /var/run:noexec,nosuid,size=64M
      - /tmp:noexec,nosuid,size=64M
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./html:/usr/share/nginx/html:ro
    ports:
      - "80:80"
      - "443:443"

Nginx needs writable /var/cache/nginx for its cache and /var/run for the PID file. Everything else is read-only — including the config and static files mounted with :ro.

Hardening Layer 4: Security Options (seccomp, AppArmor, no-new-privileges)

Docker applies a default seccomp profile that blocks ~44 of ~300 Linux syscalls. You can tighten it further, or apply a custom profile.

Start with these security options for every container:

1
2
3
4
5
6
7
8


services:
  my-app:
    image: my-web-app
    security_opt:
      - no-new-privileges:true      # Prevent privilege escalation via suid binaries
      - seccomp=unconfined          # ⚠️ Only for debugging — don't use in production
    # Do this instead for production:
    # seccomp: profiles/default.json  # Docker's default is usually fine

The no-new-privileges:true flag is the single most impactful security option. It prevents processes inside the container from gaining additional privileges through suid binaries or capset syscalls.

For a full security-optimized stack:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


services:
  app:
    image: my-app
    cap_drop:
      - ALL
    read_only: true
    tmpfs:
      - /tmp:noexec,nosuid,size=64M
    security_opt:
      - no-new-privileges:true
    # Use default seccomp profile (Docker applies it automatically)

Hardening Layer 5: Resource Limits (prevent DoS)

A single compromised container shouldn’t be able to exhaust host resources. Set explicit limits:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


services:
  my-app:
    image: my-web-app
    deploy:
      resources:
        limits:
          cpus: "1.0"              # Max 1 CPU core
          memory: 512M             # Max 512MB RAM
        reservations:
          cpus: "0.25"            # Reserve 0.25 CPU
          memory: 256M            # Reserve 256MB RAM

    # Legacy compose syntax (v2):
    # mem_limit: 512m
    # cpus: "1.0"

    # PID limit — prevent fork bombs inside the container
    pids_limit: 200

The pids_limit: 200 is often overlooked. A process that fork()s in an infinite loop can exhaust the host PID table. Limiting to 200 PIDs prevents this while being more than enough for most applications.

Restart policies with limits:

1
2
3


    restart: unless-stopped
    stop_grace_period: 30s       # Give the app time to shut down cleanly
    oom_kill_disable: false      # Allow OOM killer — true can hide memory leaks

Hardening Layer 6: Volume Mount Safety

Bind mounts are inherently risky. If you mount /:/host (don’t do this), the container can modify anything. Safer patterns:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


services:
  my-app:
    volumes:
      # Read-only bind mounts — the container cannot modify them
      - ./config:/app/config:ro

      # Named volumes — safer than bind mounts, managed by Docker
      - app_data:/app/data

      # Never mount the Docker socket unless you absolutely need it
      # - /var/run/docker.sock:/var/run/docker.sock:ro  # ⚠️ Container escape risk

If you must mount the Docker socket (Portainer, Traefik, Watchtower), read-only is safer but not safe — socket access is effectively root:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


services:
  watchtower:
    image: containrrr/watchtower
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro  # Minimal exposure

  # Better alternative: use a Docker socket proxy
  docker-proxy:
    image: tecnativa/docker-socket-proxy
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    environment:
      CONTAINERS: 1     # Only allow container operations
      NETWORKS: 0       # Block network operations
      IMAGES: 1         # Allow image operations
      SERVICES: 0       # Block service operations
      TASKS: 0          # Block task operations
      POST: 1           # Allow POST requests

The docker-socket-proxy pattern (covered in depth in the Docker Socket Proxy post) lets you whitelist specific API endpoints instead of giving full socket access. For security-minded homelabs, this is the correct approach.

Hardening Layer 7: Image Hygiene

Your container is only as secure as the base image you build from.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


# Scan for vulnerabilities
docker scout quickstart
docker scout cves nginx:alpine    # Check base image for known CVEs

# Use Trivy for offline scanning
docker pull aquasec/trivy:latest
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
  aquasec/trivy:latest image my-app

# Use specific tags, not "latest"
FROM node:20-alpine        # Specific major version — good
FROM node:alpine           # No version — bad, "latest" changes
FROM node:20.14.0-alpine   # Pinned to patch — best for production

# Distroless images have no shell, no package manager, no utilities
FROM gcr.io/distroless/nodejs20-debian12:latest
# If an attacker gets code execution, there's no bash, no curl, no wget

For a homelab, distroless images are overkill for most services. But pinning versions and scanning for known CVEs is table stakes.

Dockerfile security checklist:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


FROM node:20-alpine AS build
# Build stage — has all the build tools

FROM node:20-alpine AS production
# Runtime stage — minimal dependencies
RUN addgroup -S appgroup && adduser -S appuser -G appgroup -u 1001

COPY --from=build --chown=appuser:appgroup /app /app
USER appuser

# Health check prevents the orchestrator from routing traffic to dead containers
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD wget -qO- http://localhost:3000/health || exit 1

# Explicit EXPOSE documents the port
EXPOSE 3000

ENTRYPOINT ["node", "server.js"]

The Complete Hardened Compose Template

Combining all layers into a single production-ready template:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68


version: "3.9"

services:
  web:
    image: my-web-app:1.0.0           # Pinned version, not "latest"
    container_name: my-web-app
    restart: unless-stopped
    stop_grace_period: 30s

    # === User ===
    user: "1001:1001"                  # Non-root user

    # === Capabilities ===
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE               # Bind to port 80/443

    # === Filesystem ===
    read_only: true
    tmpfs:
      - /tmp:noexec,nosuid,size=64M
    volumes:
      - app_data:/app/data             # Persistent data
      - ./config/app.conf:/etc/app.conf:ro  # Read-only config

    # === Security Options ===
    security_opt:
      - no-new-privileges:true

    # === Resource Limits ===
    deploy:
      resources:
        limits:
          cpus: "1.0"
          memory: 512M
        reservations:
          cpus: "0.25"
          memory: 256M
    pids_limit: 200

    # === Networking ===
    networks:
      - internal-net
    ports:
      - "127.0.0.1:8080:3000"          # Bind to localhost only — reverse proxy handles external

    # === Health Check ===
    healthcheck:
      test: ["CMD", "wget", "-qO-", "http://localhost:3000/health"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s

    # === Logging ===
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

networks:
  internal-net:
    internal: true                      # No external connectivity — locked down

volumes:
  app_data:

Verifying Your Hardening

After applying these changes, verify each layer is active:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


# Check user inside container
docker exec my-web-app whoami
# Should output: appuser (not root)

# Check capabilities
docker exec my-web-app capsh --print
# Current: = cap_net_bind_service+ep
# (No SYS_ADMIN, no NET_ADMIN, no DAC_OVERRIDE)

# Check read-only filesystem
docker exec my-web-app touch /test
# touch: /test: Read-only file system ✓

# Check writable paths
docker exec my-web-app touch /app/data/test
# Should succeed ✓

# Check security options
docker inspect my-web-app --format '{{.HostConfig.SecurityOpt}}'
# [no-new-privileges:true]

# Check resource limits
docker stats my-web-app --no-stream
# CPU %    MEM USAGE / LIMIT
# 0.05%    128MiB / 512MiB

Common Pitfalls When Hardening Containers

“Permission denied” on startup: You set read_only: true but the app needs to write to a directory you didn’t map. Fix: add a tmpfs or volume for the required path.

“Operation not permitted” for database: Databases need CHOWN, SETUID, SETGID to manage file permissions. Drop them and the container crashes on initdb.

App runs but can’t serve traffic: Port 80 requires NET_BIND_SERVICE. Switch to port 8080 or add the capability.

“Bad system call” with seccomp: Some apps use syscalls blocked by the default seccomp profile (rare, but happens with Chrome/Puppeteer). Generate a custom profile instead of going unconfined:

1
2
3
4


# Generate a seccomp profile from a running container
docker run --rm it --security-opt seccomp=unconfined my-app
# Capture strace output, generate profile from filtered syscalls
# or use: dockersec generate my-app > custom-profile.json

Summary: The Minimal Hardening Checklist

These five changes cover 95% of the security improvement with minimal breakage:

Define a non-root USER in every Dockerfile or use user: in Compose — prevents container escape from yielding root on the host
Drop ALL capabilities and add back only what the app needs — eliminates privilege escalation vectors
Set read_only: true with tmpfs for runtime dirs — prevents attackers from modifying binaries or writing malware
Enable no-new-privileges:true — blocks suid-based privilege escalation entirely
Set pids_limit and memory limits — prevents a compromised container from DoS-ing the host

Apply these to every new container you deploy. Retrofit them onto existing containers one at a time, testing each change. The first time you catch a container escape exploit failing because of these settings, you’ll wonder why you ever ran containers as root.

Why Default Docker Permissions Are Dangerous#

Hardening Layer 1: Non-Root Users in Dockerfiles#

Hardening Layer 2: Dropping Linux Capabilities#

Hardening Layer 3: Read-Only Root Filesystem#

Hardening Layer 4: Security Options (seccomp, AppArmor, no-new-privileges)#

Hardening Layer 5: Resource Limits (prevent DoS)#

Hardening Layer 6: Volume Mount Safety#

Hardening Layer 7: Image Hygiene#

The Complete Hardened Compose Template#

Verifying Your Hardening#

Common Pitfalls When Hardening Containers#

Summary: The Minimal Hardening Checklist#