Reference Guide Detailed deployment notes with production context and concrete examples.

Flask Gunicorn Performance Tuning Guide

If you're trying to improve Flask performance in production or stabilize Gunicorn under load, this guide shows you how to tune Gunicorn step-by-step. The goal is to set safe defaults, match worker settings to your app profile, and validate performance with logs and runtime checks.

Quick Fix / Quick Setup

Use this as a safe baseline for a small to medium Flask app behind Nginx:

bash
gunicorn wsgi:app \
  --bind 127.0.0.1:8000 \
  --workers 3 \
  --worker-class gthread \
  --threads 4 \
  --timeout 30 \
  --graceful-timeout 30 \
  --keep-alive 2 \
  --max-requests 1000 \
  --max-requests-jitter 100 \
  --access-logfile - \
  --error-logfile -

Example systemd ExecStart:

ini
ExecStart=/path/venv/bin/gunicorn wsgi:app \
  --bind unix:/run/gunicorn.sock \
  --workers 3 \
  --worker-class gthread \
  --threads 4 \
  --timeout 30 \
  --graceful-timeout 30 \
  --keep-alive 2 \
  --max-requests 1000 \
  --max-requests-jitter 100

Use this baseline first, then increase workers gradually. Prefer gthread for apps that wait on database or external I/O. For CPU-heavy workloads, test sync workers first.

What’s Happening

Gunicorn performance depends on worker model, worker count, threads, timeouts, and process recycling. Too little concurrency causes request queueing and higher latency. Too much concurrency causes CPU contention, memory pressure, and unstable behavior. Reliable tuning is a balance between throughput, latency, and resource usage.

Step-by-Step Guide

  1. Measure the server before changing settings.
    Check CPU cores, memory, and current load:
    bash
    nproc
    free -m
    uptime
    
  2. Identify whether the app is CPU-bound or I/O-bound.
    • Use sync if requests spend most of their time executing Python code or heavy computation.
    • Use gthread if requests often wait on PostgreSQL, external APIs, storage, or other I/O.
  3. Start with a conservative worker count.
    For a small VPS, begin with 2 to 4 workers. For a larger host, test this baseline formula:
    text
    workers = (2 x CPU cores) + 1
    

    This is only a starting point. Final values must be validated under load.
  4. Add threads only for I/O-bound apps.
    Example baseline:
    bash
    gunicorn wsgi:app \
      --worker-class gthread \
      --workers 3 \
      --threads 4
    

    Keep thread counts modest, usually 2 to 8.
  5. Set conservative timeouts.
    Start with:
    bash
    --timeout 30 --graceful-timeout 30 --keep-alive 2
    

    Do not raise timeouts to hide slow application code unless the request is valid and cannot be moved out of the request path.
  6. Enable worker recycling.
    This helps reduce long-running memory growth and stale worker processes:
    bash
    --max-requests 1000 --max-requests-jitter 100
    
  7. Bind Gunicorn privately and let Nginx handle public traffic.
    Use either a localhost port or Unix socket:
    bash
    --bind 127.0.0.1:8000
    

    or
    bash
    --bind unix:/run/gunicorn.sock
    
  8. Move flags into a Gunicorn config file.
    Create gunicorn.conf.py:
    python
    bind = "unix:/run/gunicorn.sock"
    workers = 3
    worker_class = "gthread"
    threads = 4
    timeout = 30
    graceful_timeout = 30
    keepalive = 2
    max_requests = 1000
    max_requests_jitter = 100
    accesslog = "-"
    errorlog = "-"
    
  9. Update systemd to use the config file.
    Example unit fragment:
    ini
    [Service]
    User=www-data
    Group=www-data
    WorkingDirectory=/path/app
    Environment="PATH=/path/venv/bin"
    ExecStart=/path/venv/bin/gunicorn -c /path/app/gunicorn.conf.py wsgi:app
    Restart=always
    RestartSec=5
    

    Reload and restart:
    bash
    sudo systemctl daemon-reload
    sudo systemctl restart gunicorn
    sudo systemctl status gunicorn
    
  10. Load test after each change.

Test one change at a time:

bash
hey -n 1000 -c 20 https://your-domain/

or

bash
wrk -t2 -c20 -d30s https://your-domain/

Compare:

  • average latency
  • p95/p99 latency
  • error rate
  • CPU usage
  • memory usage
  1. Watch for signs of over-tuning.
  • If CPU is saturated continuously, reduce workers or threads.
  • If latency is high but CPU stays low, increase concurrency carefully.
  • If memory grows until the kernel kills processes, reduce concurrency and review recycling.
  1. Move slow work out of request handling.

Do not solve long-running requests by only increasing timeouts. Move tasks like:

  • email sending
  • report generation
  • file processing
  • multi-API fan-out

into a background job system.

  1. Validate Nginx and Gunicorn together.

Check that Nginx upstream settings, socket path, and timeout values do not override or hide Gunicorn tuning issues. If needed, review Flask Nginx Performance Tuning Guide and Fix Flask 502 Bad Gateway (Step-by-Step Guide).

  1. Keep a known-good rollback profile.

Save the previous working gunicorn.conf.py and systemd unit values before each tuning round.

Common Causes

  • Too many workers → excessive context switching, high memory use, CPU thrash → reduce worker count and retest.
  • Too few workers → requests queue during spikes, high latency with low CPU utilization → increase workers or threads gradually.
  • Wrong worker class → poor performance for the workload pattern → use sync for CPU-heavy handling and gthread for moderate I/O-bound traffic.
  • Too many threads per worker → increased contention or memory use with little gain → lower thread count and test again.
  • Timeout too low → workers killed during valid slow requests → optimize the request path or increase timeout only when justified.
  • Timeout too high → stuck requests consume capacity too long → reduce timeout and move long work to background jobs.
  • No worker recycling → memory growth over time from leaks or fragmentation → add max-requests and max-requests-jitter.
  • Nginx timeout mismatch → Gunicorn appears slow or broken when the proxy is the actual limit → align proxy and upstream settings.
  • Database pool too small → workers block waiting for DB connections → tune SQLAlchemy or database pool settings.
  • Small VPS RAM limits → OOM kills or swap thrashing after concurrency increases → reduce workers and threads, then retest.
  • Application code is the bottleneck → Gunicorn tuning has little effect → profile database queries, external calls, rendering, and caching.

Debugging Section

Check service state and logs:

bash
sudo systemctl status gunicorn
sudo journalctl -u gunicorn -n 200 --no-pager

Look for:

  • worker timeouts
  • boot failures
  • repeated restarts
  • import errors
  • signal exits

Check Gunicorn processes, CPU, and memory:

bash
ps -o pid,ppid,%cpu,%mem,rss,cmd -C gunicorn
ps -eLf | grep gunicorn | grep -v grep
top -H -p $(pgrep -d',' -f gunicorn)

Look for:

  • more processes or threads than expected
  • runaway memory usage
  • constant CPU saturation
  • workers restarting repeatedly

Validate listening sockets:

bash
ss -ltnp | grep 8000
ss -lx | grep gunicorn.sock

Confirm Gunicorn is listening on the same address or socket Nginx uses.

Check Nginx for upstream-related errors:

bash
sudo journalctl -u nginx -n 100 --no-pager
sudo tail -n 100 /var/log/nginx/error.log

Look for:

  • connect() failed
  • upstream timed out
  • no live upstreams
  • socket permission errors

Check for OOM events:

bash
dmesg -T | grep -i -E 'killed process|out of memory|oom'

If OOM events exist, lower concurrency or add memory.

Run a baseline load test:

bash
hey -n 1000 -c 20 https://your-domain/
wrk -t2 -c20 -d30s https://your-domain/

If using SQLAlchemy, compare Gunicorn concurrency with the application DB pool size. If concurrency exceeds pool capacity, requests may block even when Gunicorn itself is healthy.

Checklist

  • Gunicorn worker class matches the app workload.
  • Worker count is based on CPU and RAM, not guesswork.
  • Threads are used only when they improve I/O-bound concurrency.
  • timeout and graceful-timeout are set deliberately.
  • max-requests and max-requests-jitter are enabled for long-running stability.
  • systemd uses a persistent Gunicorn config file or explicit flags.
  • Nginx points to the correct Gunicorn socket or host:port.
  • CPU, memory, latency, and error rate were checked after each change.
  • Long-running work is removed from request/response paths where possible.
  • A rollback configuration is available if tuning degrades performance.

FAQ

Q: How many Gunicorn workers should I start with?
A: Start with 2 to 4 workers on a small server, then measure CPU, memory, and latency. Increase gradually only if the system has headroom.

Q: When should I use gthread?
A: Use gthread when requests spend time waiting on a database, API, or filesystem and your app does not require a fully async stack.

Q: Should I use sync or gthread for CPU-heavy endpoints?
A: Start with sync for CPU-heavy workloads. Threads usually help less when Python execution is the bottleneck.

Q: Does increasing timeout improve performance?
A: No. It only allows slow requests to run longer. If the root cause is not fixed, it can reduce available capacity.

Q: What does max-requests-jitter do?
A: It randomizes worker restarts so all workers do not recycle at the same time.

Q: Why is Gunicorn still slow after tuning?
A: The bottleneck is often outside Gunicorn: slow SQL queries, external APIs, filesystem latency, missing caching, or proxy configuration. Review Flask Nginx Performance Tuning Guide and Flask Production Checklist (Everything You Must Do).

Final Takeaway

Gunicorn tuning is controlled capacity planning, not random flag changes. Start with a conservative baseline, match the worker model to the workload, measure under load, and adjust one variable at a time. If tuning does not improve results, the bottleneck is usually in application code, the database, or reverse proxy configuration rather than Gunicorn itself.