Back to all posts
Saket's Blog

Deploying LLMs Locally Using Ollama & Docker

2025-03-05
12 min read
AILLMOllamaDocker

Deploying LLMs Locally Using Ollama & Docker: A Deep Dive

Running large language models (LLMs) locally is a game-changer for privacy, cost, and customization. In this in-depth guide, we'll explore not just the basics, but also advanced deployment, optimization, and integration techniques for technical practitioners.


Table of Contents

  1. Why Run LLMs Locally?
  2. Ollama & Docker: Overview
  3. System Requirements & Hardware Considerations
  4. Step-by-Step Deployment
  5. Advanced Configuration & Optimization
  6. Integrating with Web Apps (Next.js Example)
  7. Monitoring, Scaling, and Security
  8. Troubleshooting & Performance Tuning
  9. References & Further Reading

Why Run LLMs Locally?

  • Privacy: Data never leaves your infrastructure.
  • Cost Control: No per-token or per-query fees.
  • Customization: Full control over model parameters, fine-tuning, and environment.
  • Offline Access: Use AI capabilities without internet dependency.
  • Latency: Sub-millisecond inference possible on local hardware.

Ollama & Docker: Overview

  • Ollama: Open-source tool for running LLMs locally, with a simple API and model management.
  • Docker: Containerizes Ollama for reproducible, isolated deployments.

Architecture Diagram:

graph TD;
  User[User/Client] -->|HTTP API| Ollama[Ollama Container]
  Ollama -->|Model Inference| LLM[LLM Model (Llama2, Mistral, etc.)]
  Ollama -->|API| WebApp[Next.js/React App]
  Ollama -->|Logs/Stats| Monitoring[Prometheus/Grafana]

System Requirements & Hardware Considerations

  • RAM:
    • 7B models: 8–16GB (16GB+ recommended)
    • 13B models: 16–32GB
  • CPU: Modern multi-core (AVX2 support ideal)
  • GPU: NVIDIA CUDA for acceleration (optional, but highly recommended)
  • Disk: SSD for fast model loading

Step-by-Step Deployment

1. Install Docker

# Check Docker installation
docker --version

2. Create a Docker Compose File

version: '3'
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    restart: unless-stopped
volumes:
  ollama_data:

3. Start Ollama

docker-compose up -d

4. Pull and Run Your First Model

# Pull the Llama2 model (7B parameter version)
curl -X POST http://localhost:11434/api/pull -d '{"name": "llama2"}'

# Run an inference
curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Explain the concept of recursive neural networks in simple terms."
}'

Advanced Configuration & Optimization

Model Quantization

  • Use quantized models (e.g., 4-bit, 8-bit) for lower memory and faster inference.
  • Ollama supports quantized model variants out of the box.

GPU Acceleration

  • If using NVIDIA GPU, ensure Docker has access to GPU:
docker run --gpus all ...

Custom Model Paths & Fine-Tuning

  • Mount custom model directories as Docker volumes.
  • Use Ollama's model import for fine-tuned weights.

Integrating with Web Apps (Next.js Example)

Create a Next.js API route to proxy requests to your local Ollama instance:

// pages/api/generate.ts
import type { NextApiRequest, NextApiResponse } from 'next';

export default async function handler(req: NextApiRequest, res: NextApiResponse) {
  if (req.method !== 'POST') {
    return res.status(405).json({ error: 'Method not allowed' });
  }
  const { prompt } = req.body;
  try {
    const response = await fetch('http://localhost:11434/api/generate', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ model: 'llama2', prompt }),
    });
    const data = await response.json();
    return res.status(200).json(data);
  } catch (error) {
    console.error('Error calling Ollama:', error);
    return res.status(500).json({ error: 'Failed to generate text' });
  }
}

Frontend Integration:

  • Use fetch('/api/generate', { method: 'POST', ... }) from your React components.
  • Stream responses for chat UIs.

Monitoring, Scaling, and Security

Monitoring

  • Expose Ollama logs to Prometheus/Grafana for usage and performance metrics.
  • Use Docker health checks for container status.

Scaling

  • Run multiple Ollama containers behind a load balancer for high availability.
  • Use NGINX or Traefik as a reverse proxy.

Security

  • Restrict API access to trusted networks.
  • Use HTTPS for all endpoints.
  • Regularly update Docker images and dependencies.

Troubleshooting & Performance Tuning

Common Issues

  • Out of Memory: Use smaller/quantized models, increase swap, or upgrade hardware.
  • Slow Inference: Enable GPU, optimize Docker resource limits, use faster models.
  • API Errors: Check container logs (docker logs ollama), verify network/firewall settings.

Best Practices

  • Pin Docker image versions for reproducibility.
  • Automate deployment with CI/CD (GitHub Actions, etc.).
  • Regularly benchmark and monitor model performance.

References & Further Reading


Have questions or want to see more advanced LLM deployment guides? Leave a comment or reach out!