Your homelab generates paperwork — invoices, ISP contracts, device manuals, tax documents, warranty cards. Paperless-ngx turns that pile of PDFs and scans into a fully searchable, auto-tagged digital archive with OCR, machine learning classification, and multi-user access. Deploying it with Docker Compose keeps the stack isolated, reproducible, and easy to upgrade.
This guide walks through a production-ready Paperless-ngx deployment with PostgreSQL, Redis, Gotenberg, and Apache Tika, plus consumption templates, automated email ingestion, and Traefik reverse proxy configuration.
Prerequisites
- Docker and Docker Compose v2 installed on your host
- PostgreSQL 16 — can run as a container within the same Compose file (this guide uses external for production reliability)
- A reverse proxy — Traefik, nginx, or Caddy for HTTPS access
- Storage directories — separate volumes for consumption (document import), media (archive), and database persistence
Docker Compose Configuration for Paperless-ngx
The full Paperless-ngx stack consists of five services: the main application, PostgreSQL, Redis for the task queue, Gotenberg for document conversion, and Apache Tika for metadata extraction.
|
|
Create a .env file in the same directory to store secrets:
PAPERLESS_SECRET_KEY=<run: openssl rand -base64 48>
PAPERLESS_DB_PASSWORD=<run: openssl rand -base64 32>
First-Run Setup
Create the required directory structure, generate the secret key, and start the stack:
|
|
The first startup runs database migrations automatically. Watch the logs with docker compose logs -f paperless to confirm completion before creating the superuser.
OCR and Machine Learning Configuration
Paperless-ngx runs OCR on every ingested document using Tesseract. The configuration above uses English and Spanish language packs (eng+spa) — add more languages by separating with + (e.g., deu+eng+fra+spa).
The PAPERLESS_OCR_MODE: redo setting forces re-OCR on every document, which is useful when documents come from various sources with inconsistent quality. For production use where you control the scanner quality, switch to skip to avoid unnecessary processing.
Machine learning classification is enabled with:
PAPERLESS_ENABLE_MATCHING_ALGORITHMS: auto
This activates Paperless-ngx’s built-in ML model that learns from manual document corrections. Over time, it automatically assigns correspondents, document types, and tags based on your patterns. Heuristics like exact-match and fuzzy-match supplement the ML algorithm for reliable results from day one.
Consumption Templates for Automatic Processing
Consumption templates let you define rules that automatically classify incoming documents based on content, filename, or correspondent. Create a YAML file in the consume directory:
|
|
Document types, correspondents, and tags must exist in Paperless-ngx before the template can reference them. Create them through the web UI first, then templates apply automatically on document ingestion.
Automated Ingestion Pipeline
Paperless-ngx watches the consume/ directory for new files. Drop a PDF in there and it gets OCR’d, classified, and archived within seconds. Set up automated ingestion sources:
Email ingestion — Paperless can pull documents from email accounts:
|
|
Scanner integration — Point your SANE network scanner to write to the consume/ directory via an SMB or NFS share mounted at /consume. Many MFPs support scan-to-folder or scan-to-email. Configure scan-to-email for the most reliable pipeline.
Mobile app upload — The Paperless-ngx mobile app (Android/iOS) supports direct document upload through the API. Enable this in Settings → Mobile → App Authentication. You can also configure the share-to-Paperless shortcut on iOS.
Reverse Proxy Integration with Traefik
Expose Paperless-ngx through your reverse proxy for HTTPS access. Traefik labels for the Paperless service:
|
|
For nginx proxy:
|
|
Set PAPERLESS_URL to match your public or internal URL (e.g., https://docs.example.com). Without this, Paperless generates incorrect links in emails and API responses.
Backup Strategy
Two components must be backed up: the media directory containing archived documents and the PostgreSQL database for metadata and tags.
|
|
Run this daily via cron or a systemd timer. For off-site protection, pipe the backups to restic or borg against a remote repository.
Conclusion
Paperless-ngx transforms document management in the homelab from a mess of random PDFs into a structured, searchable, and automated archive. Docker Compose makes the full stack — OCR, ML classification, document conversion, and metadata extraction — deployable with a single docker compose up -d. Start with the consumption directory workflow, add email ingestion, and let the machine learning model learn your document patterns over time.
For more details, visit the Paperless-ngx documentation and the GitHub repository.