October 11, 2024

A Simplified Data Stack: PostgreSQL, DuckDB, MongoDB, and Minio

After a decade of exploring data at home and at work, I've settled on a local data stack that handles structured, semi-structured and unstructured data quite well. In this post we'll focus on deploying PostgreSQL, MongoDB, Minio and other services in a linux environment using Docker. I have these services deployed on a local Ubuntu server. In a future post I'll share more about the machine that these services are running on, but for now let's focus on the software stack.

Everything mentioned is deployed via Docker and docker compose yaml files. All docker compose files and python scripts mentioned in this post can be found at my local-de-stack github repo. At the repo you'll also find some handy commands and links for getting Docker and Docker compose setup.

Structured Data

PostgreSQL is my go-to SQL database. It's fast, free and easy to use. I use PostgreSQL for larger projects where I need structure to the data and I know what's required. While I use PostgreSQL mainly for structured data, I've also used it to store images and binary data for a few projects. I've even started using it with pgvector to store vector embeddings for RAG applications. Use postgres-compose.yaml to deploy PostgreSQL 16 locally:

services:
  postgres:
    image: postgres:16
    container_name: postgres
    ports:
      - "5432:5432"
    restart: always
    environment:
      POSTGRES_PASSWORD: ${PG_PASSWORD}
      POSTGRES_USER: ${PG_USER}
      PGDATA: /var/lib/postgresql/data/pgdata
    volumes:
      - ${PG_LOCAL_DIR}:/var/lib/postgresql/data

postgres-compose.yaml

Another lightweight option for structured data is DuckDB. It's an in-process relational database that punches WAY above its size. It excels at fast analytical workloads on local datasets. Many times I'll start a project with DuckDB and then transition to Postgres as needs and users grow. Check out the DuckDB website and extensive documentation to learn more. Here's duckdb_starter.py showing DuckDB in action:

import duckdb

con = duckdb.connect(':memory:') # duckdb.connect('my_database.duckdb')

# Create
con.execute('''
    CREATE TABLE people (
        id INTEGER,
        name VARCHAR,
        age INTEGER
    );
''')

# Insert
con.execute("INSERT INTO people VALUES (1, 'Alice', 29), (2, 'Bob', 35), (3, 'Charlie', 40);")

# Fetch
result = con.execute("SELECT * FROM people").fetchall()

for row in result:
    print(row)

duckdb_starter.py

Semi-Structured Data

When handling semi-structured data, like JSON, MongoDB is excellent. The real draw to MongoDB is the flexibility it allows when working with data on projects. Being able to iterate quickly without a rigid schema is very useful. In the docker compose yaml below I've also included mongo-express, a web based admin interface for MongoDB. Use mongodb-express-compose.yaml to deploy MongoDB and mongo-express locally:

services:
  mongodb:
    image: mongo:latest
    container_name: mongodb
    ports:
      - "27017:27017"
    environment:
      MONGO_INITDB_ROOT_USERNAME: ${MONGODB_USER}
      MONGO_INITDB_ROOT_PASSWORD: ${MONGODB_PW}
    volumes:
      - ${MONGODB_DIR}:/data/db


  mongo-express:
    image: mongo-express
    ports:
      - "8081:8081"
    environment:
      ME_CONFIG_MONGODB_URL: mongodb://${MONGODB_USER}:${MONGODB_PW}@mongodb:27017
      ME_CONFIG_BASICAUTH_USERNAME: admin
      ME_CONFIG_BASICAUTH_PASSWORD: admin
      ME_CONFIG_MONGODB_ENABLE_ADMIN: "false"
    depends_on:
      - mongodb

mongodb-express-compose.yaml

Unstructured Data

For everything else, I use Minio object storage. Minio is a self-hosted, S3-compatible object storage solution that allows for simple and scalable storage and retrieval of files. Minio's API is incredibly simple, making it easy to work with files of any size. Use minio_starter.py to get started:

from minio import Minio
import io

# Minio connection details
MINIO_ENDPOINT = 'your-minio-endpoint'
MINIO_ACCESS_KEY = 'your-minio-access-key'
MINIO_SECRET_KEY = 'your-minio-secret-key'

# Connect to Minio
minio = Minio(MINIO_ENDPOINT, MINIO_ACCESS_KEY, MINIO_SECRET_KEY, secure=False)

# Create a bucket
bucket_name = 'your-bucket-name'
if not minio.bucket_exists(bucket_name):
    minio.make_bucket(bucket_name)

# Insert a file
filename = 'example.txt'
file_content = b'Hello World!'
minio.put_object(bucket_name, filename, io.BytesIO(file_content), len(file_content))

# Retrieve the file
file = minio.get_object(bucket_name, filename)
print(file.read().decode('utf-8')) 

minio_starter.py

Files are organized into "buckets" and are immutable, meaning that once a file is stored in Minio, its contents are frozen and can't be updated or modified, ensuring a high level of data integrity and auditability. One useful use case for me is combining Minio with PostgreSQL by storing file metadata in the database and the file itself in Minio, allowing me to keep metadata and files in sync while leveraging the strengths of both storage solutions. Deploy Minio locally using minio-compose.yaml :

services:
  minio:
    image: quay.io/minio/minio
    container_name: minio
    restart: always
    ports:
      - "9000:9000"  # Service Port
      - "9091:9091"  # UI Port
    environment:
      MINIO_ROOT_USER: ${MINIO_USER}
      MINIO_ROOT_PASSWORD: ${MINIO_PW}
    volumes:
      - ${MINIO_DIR}:/data

minio-compose.yaml

Data Access

To access SQL data I use DBeaver. It connects to all popular SQL databases, helping to keep my database admin applications to a minimum. There's even a CloudBeaver variant that you can host via docker and access as a web app. DBeaver doesn't just let you execute SQL against a database, it also helps with exporting/importing data, generating DDL and has a very customizable UI. Best of all, there are installers for Windows, Mac and Linux! Check out the free Community Edition here or use cloudbeaver-compose.yaml to deploy a web app version:

services:
  cloudbeaver:
    image: dbeaver/cloudbeaver:latest
    container_name: cloudbeaver
    ports:
      - "8978:8978"
    volumes:
      - ${CLOUDBEAVER_DIR}:/opt/cloudbeaver/workspace

cloudbeaver-compose.yaml

To access MongoDB I use mongo-express. It has a really simple and straight forward UI for managing your documents. Use mongo-express-compose.yaml below if you already have a MongoDB instance running, otherwise use mongodb-express-compose.yaml to deploy both MongoDB and Mongo Express together.

services:
  mongo-express:
    image: mongo-express
    ports:
      - "8081:8081"
    environment:
      ME_CONFIG_MONGODB_URL: mongodb://${MONGODB_USER}:${MONGODB_PW}@${MONGODB_HOST}:27017
      ME_CONFIG_BASICAUTH_USERNAME: admin
      ME_CONFIG_BASICAUTH_PASSWORD: admin
      ME_CONFIG_MONGODB_ENABLE_ADMIN: "false"

mongo-express-compose.yaml

To access Minio, simply use your local IP and port used in minio-compose.yaml. For my local network, it would be @ http://192.168.8.115:9091. Follow this URL and login with the $MINIO_USER and $MINIO_PW you used when setting up the docker container.

Conclusion

This local data engineering stack has served me well over the years. By deploying these services locally with Docker I've achieved a high degree of flexibility, scalability and control over my data stack. At any time I can stop, start or migrate these services to another computer or to the cloud. I'm constantly prototyping new ideas and knowing that I have local, private, and fast data storage makes my projects go that much faster. If you're looking to improve your data management workflows, I encourage you to explore the solutions and tools outlined in this post.

KEN

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket