Paperless-NGX
Overview
What is Paperless-NGX? (Live-Demo)
DEMO
Introduction
Paperless-NGX is an open-source document management system designed to help users automate the process of digitizing, organizing, and archiving documents. By offering user-friendly interfaces, robust indexing/search capabilities, and smooth integration with popular container solutions, Paperless-NGX is a flexible choice for both personal and small-to-medium business use.
Architecture
Paperless-NGX typically runs as a set of containerized services that work together to ingest documents, perform Optical Character Recognition (OCR), and store data efficiently.
- Core Service: Manages the document database, user authentication, and metadata storage.
- OCR Backend: Processes and extracts text from images/PDFs for powerful search capabilities.
- Web Interface: Provides a clean dashboard to upload, review, categorize, and search documents.
Features
- ✨ Automated Document Ingestion: Upload files via drag-and-drop, email import, or direct folder scanning.
- 🔍 OCR & Search: Full-text search across documents thanks to integrated OCR in multiple languages.
- 🔖 Tagging & Labeling: Organize documents using tags, labels, or other metadata fields.
- 📂 PDF & Image Processing: Convert, merge, and enhance files for easy retrieval.
- 📊 Statistics & Analytics: Review usage patterns or track import history for better organization.
- 🚀 Docker Support: Easy deployment via Docker containers with minimal configuration.
- 👥 User Management: Granular permissions and multi-user access for teams and organizations.
- 🔐 Secure & Private: Access control, password policies, and TLS support for data protection.
Screenshots
More Information
Getting Started
Quick Paperless Stack Setup Guide
Paperless NGX is an open-source document management solution that allows you to digitize and efficiently manage your paperwork. In this guide, we will deploy Paperless NGX on a Docker Swarm cluster using a shared storage volume provided by GlusterFS (or a similar NAS-mounted setup) to ensure all nodes share the same data. If you intend to expose Paperless NGX to the internet, you can use Traefik as a reverse proxy for SSL termination.
Prerequisites
- Docker Swarm
- GlusterFS (or a similar NAS mount) so that all nodes have access to the same directories
- (Optional) Traefik, if you plan to make Paperless NGX accessible externally (recommended for TLS/SSL)
Step 1: Set Up Directory Structure
Create the directories for Paperless NGX data, ensuring they reside on your GlusterFS (or equivalent) mount so that data is shared among all Swarm nodes. For example:
mkdir -p /mnt/glustermount/data/paperless/
mkdir -p /mnt/glustermount/data/paperless/redisdata
mkdir -p /mnt/glustermount/data/paperless/data
mkdir -p /mnt/glustermount/data/paperless/media
mkdir -p /mnt/glustermount/data/paperless/export
mkdir -p /mnt/glustermount/data/paperless/consume
mkdir -p /mnt/glustermount/data/paperless/postgresqldata
Step 2: Create Your Docker Compose File
Important: In all configurations and code snippets below, replace YOUR-DOMAIN.com
with your actual domain wherever applicable.
Below is an example docker-compose.yml
that sets up Paperless NGX alongside Redis, PostgreSQL, Gotenberg, and Apache Tika. This file is intended for Docker Swarm with a GlusterFS-backed volume. You can adapt paths and replicas to your needs.
version: "3.7"
services:
broker:
image: docker.io/library/redis:7
restart: unless-stopped
volumes:
- /mnt/glustermount/data/paperless/redisdata:/data
deploy:
mode: replicated
replicas: 1
networks:
- internal
webserver:
image: ghcr.io/paperless-ngx/paperless-ngx:latest
restart: unless-stopped
depends_on:
- broker
- gotenberg
- tika
ports:
- "8000:8000"
volumes:
- /mnt/glustermount/data/paperless/data:/usr/src/paperless/data
- /mnt/glustermount/data/paperless/media:/usr/src/paperless/media
- /mnt/glustermount/data/paperless/export:/usr/src/paperless/export
- /mnt/glustermount/data/paperless/consume:/usr/src/paperless/consume
environment:
PAPERLESS_REDIS: "redis://broker:6379"
PAPERLESS_TIKA_ENABLED: 1
PAPERLESS_TIKA_GOTENBERG_ENDPOINT: "http://gotenberg:3000"
PAPERLESS_TIKA_ENDPOINT: "http://tika:9998"
PAPERLESS_URL: "https://paperless.YOUR-DOMAIN.com"
PAPERLESS_OCR_LANGUAGE: "eng"
PAPERLESS_TIME_ZONE: "Europe/Zurich"
PAPERLESS_ADMIN_USER: "${PAPERLESS_ADMIN_USER}"
PAPERLESS_ADMIN_PASSWORD: "${PAPERLESS_ADMIN_PW}"
PAPERLESS_ADMIN_MAIL: "${PAPERLESS_ADMIN_EMAIL}"
PAPERLESS_SECRET_KEY: "${PAPERLESS_SECRET_KEY}"
PAPERLESS_DBHOST: "db"
PAPERLESS_DBNAME: "${PAPERLESS_POSTGRES_DB}"
PAPERLESS_DBUSER: "${PAPERLESS_POSTGRES_USER}"
PAPERLESS_DBPASS: "${PAPERLESS_POSTGRES_PASSWORD}"
deploy:
mode: replicated
replicas: 1
labels:
- "traefik.enable=true"
- "traefik.http.routers.webserver.rule=Host(`paperless.YOUR-DOMAIN.com`)"
- "traefik.http.routers.webserver.entrypoints=websecure"
- "traefik.http.services.webserver.loadbalancer.server.port=8000"
- "traefik.docker.network=management_net"
networks:
- management_net
- internal
db:
image: docker.io/library/postgres:16
restart: unless-stopped
volumes:
- /mnt/glustermount/data/paperless/postgresqldata:/var/lib/postgresql/data
environment:
POSTGRES_DB: "${PAPERLESS_POSTGRES_DB}"
POSTGRES_USER: "${PAPERLESS_POSTGRES_USER}"
POSTGRES_PASSWORD: "${PAPERLESS_POSTGRES_PASSWORD}"
deploy:
mode: replicated
replicas: 1
networks:
- internal
gotenberg:
image: docker.io/gotenberg/gotenberg:8.7
restart: unless-stopped
command:
- "gotenberg"
- "--chromium-disable-javascript=true"
- "--chromium-allow-list=file:///tmp/.*"
deploy:
mode: replicated
replicas: 1
networks:
- internal
tika:
image: docker.io/apache/tika:latest
restart: unless-stopped
deploy:
mode: replicated
replicas: 1
networks:
- internal
networks:
management_net:
external: true
internal:
driver: overlay
ipam:
config:
- subnet: 172.16.58.0/24
Why We Define a Custom Subnet for the internal
Network
The internal
network is an overlay network dedicated to internal communication between Paperless NGX services (like Redis, Gotenberg, Tika, and PostgreSQL). By assigning a specific subnet (172.16.58.0/24
), you ensure:
- Isolation: Only these containers communicate on this internal overlay, reducing exposure to the outside world.
- Predictability: Having a known subnet range helps avoid IP conflicts with other networks.
Defining Environment Variables
If you are using Portainer, you can define environment variables such as stack.env
directly in the Portainer Web-GUI when deploying the stack.
If you are not using Portainer, create a .env
file in the same directory as your docker-compose.yml
and specify:
services:
webserver:
...
env_file:
- .env
db:
...
env_file:
- .env
Then add the following variables in your .env
file:
PAPERLESS_POSTGRES_DB=
PAPERLESS_POSTGRES_USER=
PAPERLESS_POSTGRES_PASSWORD=
PAPERLESS_ADMIN_USER=
PAPERLESS_ADMIN_PW=
PAPERLESS_ADMIN_EMAIL=
PAPERLESS_SECRET_KEY=
You can also check out the official sample environment file for Paperless NGX to see additional variables you may configure.
Environment Variables Used in the Compose File
Below is a brief explanation of some key environment variables in the docker-compose.yml
file. For a full list of available variables and their usage, refer to the Paperless NGX Configuration Documentation.
- PAPERLESS_URL: The base URL where Paperless NGX is accessible (e.g.,
https://paperless.your-domain.com
). - PAPERLESS_REDIS: The Redis connection string (host:port) for caching and background tasks.
- PAPERLESS_TIKA_ENABLED: Enables Apache Tika integration for advanced document parsing.
- PAPERLESS_TIKA_GOTENBERG_ENDPOINT / PAPERLESS_TIKA_ENDPOINT: Endpoints for Gotenberg and Tika services, respectively, used to convert and parse documents.
- PAPERLESS_TIME_ZONE: Sets the timezone inside the container (e.g.,
Europe/Zurich
). - PAPERLESS_ADMIN_USER / PAPERLESS_ADMIN_PASSWORD / PAPERLESS_ADMIN_MAIL: Credentials for the default Paperless NGX admin account.
- PAPERLESS_SECRET_KEY: A secret key used by the Django framework within Paperless NGX for cryptographic functions.
- PAPERLESS_DBHOST / PAPERLESS_DBNAME / PAPERLESS_DBUSER / PAPERLESS_DBPASS: Connection details for PostgreSQL. These point to the
db
service and use the credentials defined in the environment variables.
Step 3: Deploy the Stack
docker stack deploy -c docker-compose.yml paperless
Alternatively, you can deploy the stack via Portainer or any other Docker Swarm management tool.
Step 4: Accessing Paperless NGX
- Once deployed, Paperless NGX will be available at the port you specified (
8000
by default) on the Swarm node. - If you configured Traefik and DNS correctly, you should be able to access your Paperless NGX instance at
https://paperless.your-domain.com
. - Log in with the admin credentials you set in the environment variables.
Additional Notes
- Multi-Node Persistence: Because GlusterFS (or an equivalent) is used, your Paperless NGX data and database files are stored on shared volumes accessible by all nodes in the Swarm.
- Security & SSL: If you’re exposing Paperless NGX externally, ensure you have valid SSL certificates set up (e.g., via Traefik with Let’s Encrypt).
- Scaling Services: You can increase the
replicas
value for different services if you want to distribute the load across multiple nodes.
Conclusion
By deploying Paperless NGX on a Docker Swarm, you gain the benefits of high availability and scalability, especially when backed by a distributed storage solution like GlusterFS. Whether you use Portainer for an easier management interface or rely on .env
files for more traditional Docker workflows, the key is consistent environment configuration and ensuring all nodes share the necessary data volumes. With this setup, your document management solution is primed for production use—secure, resilient, and easy to extend.
Configuration
Tags, Document Types, Correspondent & more
Paperless-ngx is a wonderful tool to scan, classify, and organize your documents. In this article, we’ll discuss three important organizational elements: Document Types, Correspondent, and Tags. Along the way, we’ll ask guiding questions to help you figure out how best to categorize any piece of paperwork you might want to store in Paperless-ngx.
Document Types
Document Types refer to the broad category of the document in question. Is it a letter, a receipt, or a bill? You don’t need to overthink this category; just assign the document to a generalized type. For example, you might have a Receipts doctype for all the receipts you scan in, or even confirmations you receive after paying certain bills.
- Are you dealing with a financial record, such as a bill or a receipt?
This helps to quickly decide if it goes under “Receipts,” “Invoices,” or a more generic “Bills” type. - Does the document represent correspondence or general information?
If so, you might use a “Letters” or “General Correspondence” document type. - Do you plan to reuse this broad category for similar documents in the future?
If yes, naming it broadly (e.g., “Medical Documents” or “Insurance Papers”) could be helpful.
Correspondent
The Correspondent is the person or organization associated with the document. A credit card bill from Capital One would have “Capital One” as the correspondent. A W2 might have the IRS as the correspondent. Broadly defining your correspondent is key so you don’t complicate future searches with overly specific labels.
- Which entity sent or provided this document?
Typically, it’s the name you see on the letterhead or the company from which you received the bill or notice. - Is it important to narrow down the specific department or branch?
Most of the time, you can stick to the main organization or sender name, unless you have a strong need to differentiate them. - Will you have many documents from the same company or individual?
If yes, consistent naming (e.g., “IRS” vs. “Internal Revenue Service”) will help you find them easier later.
Tags
Tags let you categorize documents by answering basic questions like who, what, and when the document references. They can also be used for special categories or important groups of documents.
- Who is this document referring to?
You might have tags for yourself, your spouse, your children, or pets—anything that quickly identifies whom the document is about. - What is it referring to?
Is it related to a car loan, home maintenance, or health records? Use a separate color or naming convention to mark these. - When is this information relevant?
You might create tags for each year (e.g., “2022,” “2023”) or even by month, if needed, so that you can later filter by time periods. - Does it belong to a special or critical category?
If it's crucial for annual taxes or contains personal legal information, you can tag it “Taxes” or “Important” to quickly filter it out.
OCR Considerations
Optical Character Recognition (OCR) is undoubtedly helpful for searching within the text of scanned documents. However, it shouldn’t be your only search strategy. Combining OCR with at least 1–2 well-chosen metadata fields (like Document Type or Correspondent) plus relevant Tags can make finding a specific document much easier—especially when you have years and years of paperwork.
Garbage In, Garbage Out
Like with any data system, the quality of your searches in Paperless-ngx is only as good as the data you choose to include. Spend a little extra time specifying at least one metadata field and adding a couple of relevant tags. This way, when you need to find an important document, you can rely on your carefully curated system to do the work for you.
In summary, Document Types, Correspondent, and Tags form a powerful trifecta in Paperless-ngx to keep your records neat and easily searchable. Leverage OCR, but don’t depend on it alone. And remember: the small effort to add good data up front will pay big dividends when you need to retrieve those documents later.
SMTP Setup
Setting up an SMTP server for the backend in Paperless-ngx allows you to send emails directly from the system, most commonly for password reset purposes. These environment variables closely mirror the corresponding Django email settings, ensuring easy configuration.
Environment Variables
- PAPERLESS_EMAIL_HOST (default:
localhost
) - PAPERLESS_EMAIL_PORT (default:
25
) - PAPERLESS_EMAIL_HOST_USER (default:
''
) - PAPERLESS_EMAIL_FROM (default: same as PAPERLESS_EMAIL_HOST_USER if not set)
- PAPERLESS_EMAIL_HOST_PASSWORD (default:
''
) - PAPERLESS_EMAIL_USE_TLS (default:
false
) - PAPERLESS_EMAIL_USE_SSL (default:
false
)
To configure these in a Docker environment, simply add them to your docker-compose.yml under the environment
section of the paperless-ngx service. For example:
services:
paperless-ngx:
image: ghcr.io/paperless-ngx/paperless-ngx:latest
environment:
- PAPERLESS_EMAIL_HOST=smtp.yourprovider.com
- PAPERLESS_EMAIL_PORT=587
- PAPERLESS_EMAIL_HOST_USER=youremail@provider.com
- PAPERLESS_EMAIL_HOST_PASSWORD=supersecretpassword
- PAPERLESS_EMAIL_FROM=youremail@provider.com
- PAPERLESS_EMAIL_USE_TLS=true
Once set, Paperless-ngx will use these SMTP settings to send necessary notifications, such as password reset emails. Adjust values as needed based on your email provider’s requirements.
It’s generally best practice to use TLS or SSL for secure email communication. Make sure you enable the correct protocol flags (PAPERLESS_EMAIL_USE_TLS
or PAPERLESS_EMAIL_USE_SSL
) for your provider.