Deploying Micro Apps on Raspberry Pi 5: A Practical Guide Using the AI HAT+ 2
Raspberry Piedge AIdeployment

Deploying Micro Apps on Raspberry Pi 5: A Practical Guide Using the AI HAT+ 2

UUnknown
2026-01-22
10 min read
Advertisement

Step-by-step guide to host private micro apps on Raspberry Pi 5 using the AI HAT+ 2 for offline LLM inference and low-latency on-device AI.

Deploying Micro Apps on Raspberry Pi 5: A Practical Guide Using the AI HAT+ 2

Hook: If you’re a developer or IT pro frustrated by cloud costs, data privacy concerns, or latency when using remote LLM APIs, hosting micro apps locally on a Raspberry Pi 5 with the new AI HAT+ 2 gives you a fast, private, and cost-effective edge platform for on-device LLM inference and offline LLM inference.

This guide (2026 update) walks you step-by-step from hardware and OS setup to containerized deployment, on-device LLM inference, and production-ready concerns such as security and benchmarking. It’s written for practitioners who want practical, repeatable results—not marketing fluff.

Why this matters now (2026 context)

Late 2025 and early 2026 saw two important trends converge for edge AI: first, widespread availability of hardware accelerators like the AI HAT+ 2 that deliver efficient on-device generative inference; second, production-ready runtimes and quantized model formats (GGUF/ggml) that make running capable LLMs locally feasible. Micro apps — small, purpose-built web services designed for one or a few users — are becoming the dominant way engineers and power users deploy functionality at the edge. If you need low-latency inference, reduced cloud bills, or data residency, this stack is now practical.

What you’ll build and the architecture

We’ll create a small micro app that offers a private text-completion endpoint over your LAN. Architecture (simple, robust):

  • Raspberry Pi 5 running 64-bit OS
  • AI HAT+ 2 attached and configured (provides NPU/accelerator)
  • Dockerized inference runtime (llama.cpp/ggml or vendor runtime with NPU support)
  • Dockerized micro app (FastAPI) that calls the runtime via localhost REST or Unix socket
  • Optional reverse proxy (Nginx) with local TLS, firewall rules, and systemd/docker restart policies

Prerequisites

  • Raspberry Pi 5 with a 64-bit OS image (Raspberry Pi OS 64-bit or Debian/Ubuntu 64-bit, 2026-optimized)
  • AI HAT+ 2 (firmware and SDK from vendor — late-2025 release)
  • USB SSD or NVMe enclosure for model storage recommended (models take GBs)
  • Basic Linux and Docker familiarity

Step 1 — Hardware & OS setup

  1. Flash a 64-bit image: Raspberry Pi OS 64-bit (or Ubuntu 24.04/26.04 arm64). Use Raspberry Pi Imager and enable SSH if headless.
  2. Update the system:
    sudo apt update && sudo apt upgrade -y
  3. Create a user and enable your preferred SSH keys. Configure static IP or DHCP reservation for predictable device access.
  4. Attach the AI HAT+ 2 following vendor instructions. In late 2025 vendors shipped a Linux SDK with systemd helpers; in 2026 that SDK is mature and commonly maintained.

Quick checklist

  • OS: arm64 64-bit kernel
  • Storage: SSD for models, mounted at /var/models or /mnt/ssd
  • Power: reliable supply (Pi 5 + HAT + SSD need stable power)

Step 2 — Install Docker, Docker Compose, and essentials

Docker is still the fastest way to package micro apps for reproducible deployment on Pi (arm64 images). On Pi 5 (2026) use the official Docker packages and the Compose plugin:

sudo apt install -y ca-certificates curl gnupg lsb-release
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/debian $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update && sudo apt install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
sudo usermod -aG docker $USER

Log out and back in (or reboot) so docker group changes apply.

Step 3 — Install AI HAT+ 2 SDK & drivers

Follow the vendor's 2025/2026 SDK instructions. Typical steps:

  1. Download SDK and firmware package from vendor.
  2. Run the installer or use provided Debian packages. Example (vendor-specific):
    sudo dpkg -i ai-hat2-sdk_*.deb
    sudo systemctl enable --now ai-hat2-daemon
  3. Confirm the device is visible and the runtime is accessible (vendor tools usually expose /dev/aihat* or an HTTP socket).

Tip: the HAT often includes a sample inference server. Start that first to ensure the hardware path works before integrating Docker.

Step 4 — Choose and prepare an LLM runtime and quantized model

Edge inference commonly uses llama.cpp/ggml and the GGUF quantized model format. In 2026 you'll also find vendor-provided runtimes that offload to the HAT+2 accelerator with driver support. Two practical routes:

  1. Use a lightweight runtime like llama.cpp (ggml/gguf) compiled for arm64. Works well for 3B–7B quantized models.
  2. Use the vendor's runtime that leverages the HAT+2 accelerator. This often gives significantly lower latency and power usage.

Downloading models: use community GGUF builds or convert models using tools (quantize.py). Place models on your SSD under /var/models. Example models to try: 3B–7B GGUF variations (for privacy and speed).

Step 5 — Containerize the inference runtime

We’ll create a simple Dockerfile for an arm64 runtime using llama.cpp. If you use the vendor SDK, adapt the image to install the SDK and enable device mounts.

<!-- Dockerfile: inference/Dockerfile -->
FROM --platform=linux/arm64 debian:stable-slim
RUN apt update && apt install -y build-essential git wget ca-certificates libgomp1
WORKDIR /opt/llama
RUN git clone --depth 1 https://github.com/ggerganov/llama.cpp.git . \
    && make -j$(nproc)
COPY models /models
EXPOSE 5001
CMD ["/opt/llama/main", "-m", "/models/your_model.gguf", "-p", "--api-port", "5001"]

Notes:

  • Replace the CMD with vendor runtime invocation if using HAT SDK.
  • Expose only the internal port and restrict access via docker network and reverse proxy.

Step 6 — Containerize the micro app (FastAPI)

A tiny micro app implements a POST /generate endpoint which proxies requests to the local inference server. Keep the app minimal and stateless; stateful data (conversation history) can live in a small Redis container if you need it.

<!-- app/main.py -->
from fastapi import FastAPI, HTTPException
import requests
app = FastAPI()

INFER_HOST = "inference:5001"

@app.post("/generate")
def generate(prompt: dict):
    payload = {"prompt": prompt.get("text", "")}
    r = requests.post(f"http://{INFER_HOST}/v1/generate", json=payload, timeout=30)
    if r.status_code != 200:
        raise HTTPException(status_code=500, detail=r.text)
    return r.json()
<!-- Dockerfile: app/Dockerfile -->
FROM --platform=linux/arm64 python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Step 7 — Compose and network

Use docker-compose to wire inference and app together. Example compose file:

version: '3.8'
services:
  inference:
    build: ./inference
    volumes:
      - /var/models:/models:ro
    devices:
      - "/dev/aihat0:/dev/aihat0"  # vendor device mapping if required
    restart: unless-stopped

  app:
    build: ./app
    ports:
      - "8000:8000"
    depends_on:
      - inference
    restart: unless-stopped

Run:

docker compose up --build -d

Step 8 — Secure the micro app and device

Security is often overlooked in local deployments. Follow these minimum precautions:

  • Run the inference container without exposing it externally — keep it on an internal Docker network.
  • Put the public-facing API behind a reverse proxy (Nginx) with TLS. For LAN-only projects, use mkcert to create local-trusted certs for development.
  • Configure UFW (or iptables) to allow only necessary ports (SSH, HTTPS for the micro app) and block everything else.
  • Use HTTP basic auth or JWT tokens for your micro app. Don’t leave an open text completion endpoint on your home network without protection.
  • Regularly update OS and Docker images and pin model files to checksums to avoid silent tampering.

Step 9 — Benchmarking & performance checks

Measure latency and throughput before declaring the micro app ready. Tools you should use:

  • curl — quick single-request tests
  • hyperfine — for repeated latency measurements
  • wget/ab/wrk — load testing at small scales

Example latency test (single request):

time curl -sS -X POST http://localhost:8000/generate -d '{"text":"Hello world"}' -H "Content-Type: application/json"

In 2026, expect the following practical outcomes depending on model size & quantization:

  • 3B quantized models: sub-second to a few seconds per short prompt on HAT+2-accelerated stacks.
  • 7B quantized models: several seconds latency; still acceptable for many micro apps.
  • Without HAT acceleration: larger models will be slow and CPU-bound on the Pi 5.
Benchmarks vary by model, quantization, and the vendor runtime. Always measure for your specific workload.

Step 10 — Operational tips and production hardening

Logging & monitoring

  • Forward logs to a local file or lightweight aggregator (Prometheus + Grafana on Pi is feasible for small labs).
  • Collect inference metrics: tokens/sec, memory usage, latency P95.

Automatic updates and backups

  • Use an update window and image-based backups. Consider periodic model checksum verification.
  • Store critical model backups on a separate device or encrypted cloud bucket to recover quickly.

Resource limits

Set Docker resource limits for the inference container to avoid OOMs:

deploy:
  resources:
    limits:
      memory: 6G

Common pitfalls and troubleshooting

  • Out of memory: Use smaller quantized models or increase swap (careful with SSD wear).
  • Driver mismatch: Ensure the AI HAT+ 2 SDK and your kernel drivers match; vendor SDK updates are common in late 2025/2026.
  • Slow inference: Validate that the runtime is using the NPU (vendor runtime logs or /proc entries typically indicate hardware offload).
  • Networking issues: Keep inference reachable only by the micro app; use docker networks, not host mode unless necessary.

Use cases and micro app ideas

Micro apps are small, focused, and personal. Here are practical examples you can deploy on your Pi 5 + HAT+2:

  • Private code assistant for local repos (offers completions without pushing code to the cloud)
  • Home automation natural language bridge (turn prompts into MQTT commands)
  • Personal knowledge base/query assistant (store documents locally and perform retrieval-augmented generation)
  • Customer demo kiosk that needs no internet access for demos

Trade-offs: local edge vs cloud

Local inference gives lower latency and stronger data control but comes with maintenance overhead, hardware costs, and limits on model size. For many micro apps, a 3B–7B quantized model running on an AI HAT+ 2 strikes the best balance. Hybrid modes (local inference for sensitive tasks, cloud for heavy generation) are a pragmatic middle ground.

Future predictions (2026 and beyond)

Expect continued improvements through 2026 in three areas:

  • Model optimization: Even larger models will be more friendly to quantization (8-bit and 4-bit formats improve), making 7B–13B on-device more practical.
  • Standardized runtimes: Common acceleration APIs (ONNX + vendor backends) will make switching hardware simpler.
  • Micro app ecosystems: Tooling and templates for micro apps (privacy-first, offline-first) will expand, lowering the barrier for non-expert creators to micro app ecosystems on devices like Raspberry Pi 5.

Actionable checklist (get started in under an hour)

  1. Flash 64-bit OS and attach AI HAT+ 2.
  2. Install Docker and the HAT SDK; run the vendor sample server.
  3. Download a small quantized GGUF model (3B) to /var/models.
  4. Deploy the inference and FastAPI micro app with docker-compose.
  5. Test locally with curl; add TLS and basic auth before exposing to the LAN.

Final recommendations

For developers and IT admins building micro apps in 2026, a Raspberry Pi 5 paired with an AI HAT+ 2 is a practical edge platform. Start small (3B quantized models), keep the inference service internal, and iterate — optimizing models, runtime, and security as you gather real usage data.

Experience tip: I’ve deployed several micro apps for internal teams: the biggest gains came from using an accelerated runtime and moving state (conversation history) out of the inference container. That alone improved stability and recoverability.

Call to action

Ready to build one? Clone our reference repo, which includes the Dockerfiles, docker-compose, and a ready-made FastAPI micro app for the Raspberry Pi 5 + AI HAT+ 2. If you want, share your use case and hardware details and I’ll suggest model sizes and runtime flags tuned for your workload.

Advertisement

Related Topics

#Raspberry Pi#edge AI#deployment
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-17T00:03:19.668Z