A Simplified Data Stack: PostgreSQL, DuckDB, MongoDB, and Minio
After a decade of exploring data at home and at work, I've settled on a local data stack that handles structured, semi-structured and unstructured data quite well. In this post we'll focus on deploying PostgreSQL, MongoDB, Minio and other services in a linux environment using Docker. I have these services deployed on a local Ubuntu server. In a future post I'll share more about the machine that these services are running on, but for now let's focus on the software stack.
Everything mentioned is deployed via Docker and docker compose yaml files. All docker compose files and python scripts mentioned in this post can be found at my local-de-stack github repo. At the repo you'll also find some handy commands and links for getting Docker and Docker compose setup.
Structured Data
PostgreSQL is my go-to SQL database. It's fast, free and easy to use. I use PostgreSQL for larger projects where I need structure to the data and I know what's required. While I use PostgreSQL mainly for structured data, I've also used it to store images and binary data for a few projects. I've even started using it with pgvector to store vector embeddings for RAG applications. Use postgres-compose.yaml
to deploy PostgreSQL 16 locally:
Another lightweight option for structured data is DuckDB. It's an in-process relational database that punches WAY above its size. It excels at fast analytical workloads on local datasets. Many times I'll start a project with DuckDB and then transition to Postgres as needs and users grow. Check out the DuckDB website and extensive documentation to learn more. Here's duckdb_starter.py
showing DuckDB in action:
Semi-Structured Data
When handling semi-structured data, like JSON, MongoDB is excellent. The real draw to MongoDB is the flexibility it allows when working with data on projects. Being able to iterate quickly without a rigid schema is very useful. In the docker compose yaml below I've also included mongo-express, a web based admin interface for MongoDB. Use mongodb-express-compose.yaml
to deploy MongoDB and mongo-express locally:
Unstructured Data
For everything else, I use Minio object storage. Minio is a self-hosted, S3-compatible object storage solution that allows for simple and scalable storage and retrieval of files. Minio's API is incredibly simple, making it easy to work with files of any size. Use minio_starter.py
to get started:
Files are organized into "buckets" and are immutable, meaning that once a file is stored in Minio, its contents are frozen and can't be updated or modified, ensuring a high level of data integrity and auditability. One useful use case for me is combining Minio with PostgreSQL by storing file metadata in the database and the file itself in Minio, allowing me to keep metadata and files in sync while leveraging the strengths of both storage solutions. Deploy Minio locally using minio-compose.yaml
:
Data Access
To access SQL data I use DBeaver. It connects to all popular SQL databases, helping to keep my database admin applications to a minimum. There's even a CloudBeaver variant that you can host via docker and access as a web app. DBeaver doesn't just let you execute SQL against a database, it also helps with exporting/importing data, generating DDL and has a very customizable UI. Best of all, there are installers for Windows, Mac and Linux! Check out the free Community Edition here or use cloudbeaver-compose.yaml
to deploy a web app version:
To access MongoDB I use mongo-express. It has a really simple and straight forward UI for managing your documents. Use mongo-express-compose.yaml
below if you already have a MongoDB instance running, otherwise use mongodb-express-compose.yaml
to deploy both MongoDB and Mongo Express together.
To access Minio, simply use your local IP and port used in minio-compose.yaml
. For my local network, it would be @ http://192.168.8.115:9091
. Follow this URL and login with the $MINIO_USER
and $MINIO_PW
you used when setting up the docker container.
Conclusion
This local data engineering stack has served me well over the years. By deploying these services locally with Docker I've achieved a high degree of flexibility, scalability and control over my data stack. At any time I can stop, start or migrate these services to another computer or to the cloud. I'm constantly prototyping new ideas and knowing that I have local, private, and fast data storage makes my projects go that much faster. If you're looking to improve your data management workflows, I encourage you to explore the solutions and tools outlined in this post.
KEN