Raspberry Pi 5 + AI HAT+ 2: Build an On-Device Recommender for Small Teams
edgeIoTtutorial

Raspberry Pi 5 + AI HAT+ 2: Build an On-Device Recommender for Small Teams

pprograma
2026-01-24
12 min read
Advertisement

Hands-on guide to build a private on-device recommender on Raspberry Pi 5 + AI HAT+ 2 for team microapps.

Stop relying on cloud only: build a private, speedy recommender for your small team

Decision fatigue in a team chat. A founder who needs a quick list of nearby lunch spots. A local kiosk that must recommend items without sending data to a cloud service. If that sounds familiar, this guide is for you. In 2026 the rise of microapps and improved NPUs on single-board computers make it realistic to run a lightweight recommender entirely at the edge. This walkthrough shows how to use a Raspberry Pi 5 paired with the new AI HAT+ 2 (released late 2025) to deploy an on-device recommender for small teams and local microapps.

What you'll build (in five minutes of reading, hours of fun)

By the end of this article you'll have a working on-device recommender microservice that:

  • Runs on a Raspberry Pi 5 with AI HAT+ 2 for optional NPU acceleration
  • Serves restaurant recommendations over a local network using FastAPI
  • Matches users by simple preferences (tags and short text) using embeddings + nearest neighbour search
  • Can be extended to do on-device scoring with a tiny ONNX model or run hybrid (embeddings precomputed off-device)

Why Raspberry Pi 5 + AI HAT+ 2 is the right tool in 2026

Two trends made this practical in 2025–2026:

  • Microapps and vibe-coding: more non-developers and small teams are building single-purpose apps (think Where2Eat). Running inference on-device keeps data private and removes cloud costs.
  • Edge NPUs matured: devices like the AI HAT+ 2 (announced late 2025) provide efficient on-device acceleration for quantized models and smaller transformer variants. That enables local embedding inference or ONNX model acceleration without burning CPU cycles.
“Vibe-coding and microapps are accelerating — light, private, and local AI is what small teams need.” — observed trendlines in late 2025

Assumptions, trade-offs and where this approach shines

This walkthrough targets small teams and microapps. We prioritize:

  • Privacy — user preferences never leave the local device
  • Responsiveness — sub-200ms local responses for simple queries
  • Practicality — lightweight models and libraries that run on Pi 5

We trade off high-scale personalization and training large models on-device. For most team use-cases (restaurant suggestions, meeting-location voting, internal docs surfacing), a small embedding + nearest-neighbor or a tiny ranking model is more than enough.

Prerequisites — hardware, OS, and software

  • Raspberry Pi 5 (4–16GB RAM variants will all work; 8GB recommended)
  • AI HAT+ 2 attached to the 40-pin header (released late 2025)
  • Power supply (6A recommended for Pi 5 + HAT peak loads)
  • Micro SD card (32GB+) or NVMe USB boot for speed
  • Raspberry Pi OS 64-bit (latest 2026 build) or Ubuntu 22.04/24.04 for ARM64
  • Python 3.11+, pip, build-essential

Note: AI HAT+ 2 typically ships with vendor runtimes (NPU drivers + SDK). If you want NPU acceleration, install the vendor runtime per the HAT+ 2 documentation. The rest of this guide works on CPU-only as well.

Quick setup (commands you can copy)

SSH into the Pi and run these steps to prepare a working environment. These commands assume a Debian/Ubuntu-based image.

sudo apt update && sudo apt upgrade -y
sudo apt install -y python3-pip build-essential libatlas-base-dev git
python3 -m pip install --upgrade pip
# Optional but recommended: create a venv
python3 -m venv venv && source venv/bin/activate
pip install fastapi uvicorn numpy hnswlib python-multipart

If you're going to run an on-device ONNX model and want to use ONNX Runtime:

pip install onnx onnxruntime onnxruntime-tools

For vendor NPU acceleration: follow the AI HAT+ 2 SDK install instructions. Many vendor runtimes provide ONNX delegates or a dedicated runtime you can call from Python.

Overview of the recommender design

We'll use a hybrid, practical design that supports two deployment modes:

  1. Embedding + NN (recommended) — Precompute item embeddings (restaurant descriptions and tags) once (on laptop or Pi). Store embeddings on the Pi and use hnswlib for fast approximate nearest neighbor searches. This is small, fast, and privacy-friendly.
  2. On-device scoring (optional) — Export a tiny ranking model (MLP) to ONNX and run it on the Pi (optional NPU). This lets you compute scores for items given a user's vector when you want custom learned ranking.

Why embeddings + hnswlib?

Embedding-based search is robust for short text and tag-based profiles. hnswlib is compact and fast on ARM. It uses memory-efficient graph indices and returns top-k neighbors quickly — ideal for microapps running on Pi.

Step 1 — Prepare a small dataset (restaurants example)

Create a CSV called restaurants.csv with these fields: id, name, cuisine, tags, short_description.

id,name,cuisine,tags,short_description
1,La Mesa,Mexican,"tacos,spicy,cheap","Great tacos and margaritas, friendly staff"
2,Green Bowl,Healthy,"vegan,bowls,gluten-free","Fresh bowls with local produce"
3,Blue Deli,Deli,"sandwiches,coffee,quick","Classic deli sandwiches and small desserts"
...

For production you can include coordinates, price range, and ratings; for this walkthrough we keep it small.

Step 2 — Create or obtain embeddings

Option A (fast): compute embeddings on your laptop with a small model (e.g., sentence-transformers all-MiniLM) and copy the numpy arrays to the Pi. Option B (on-device): run a compact embedding model on the Pi and compute embeddings locally (requires NPU for speed).

Example offline embedding script (run on your laptop):

from sentence_transformers import SentenceTransformer
import numpy as np, csv

model = SentenceTransformer('all-MiniLM-L6-v2')
rows=[]
with open('restaurants.csv') as f:
    reader = csv.DictReader(f)
    for r in reader:
        rows.append(r)
texts = [(r['name'] + ' ' + r['tags'] + ' ' + r['short_description']).strip() for r in rows]
emb = model.encode(texts, convert_to_numpy=True)
np.save('restaurant_embeddings.npy', emb)
# Save metadata
import json
with open('restaurant_meta.json','w') as wf:
    json.dump(rows, wf)

Copy restaurant_embeddings.npy and restaurant_meta.json to the Pi (scp or rsync).

Step 3 — Build an hnswlib index on the Pi

Install hnswlib (we included it earlier). Then run this Python snippet on the Pi to build an index. This code loads float32 embeddings and persists the index for fast startup.

import hnswlib
import numpy as np
import json

emb = np.load('restaurant_embeddings.npy').astype('float32')
with open('restaurant_meta.json') as f:
    meta = json.load(f)

dim = emb.shape[1]
num_elements = emb.shape[0]

p = hnswlib.Index(space='cosine', dim=dim)
p.init_index(max_elements=num_elements, ef_construction=200, M=16)
p.add_items(emb, np.arange(num_elements))
# Save index
p.save_index('restaurants_hnsw.bin')
# Persist meta
with open('meta.json','w') as f:
    json.dump(meta, f)

Tune ef_construction and M for your dataset size (small sets: defaults are fine).

Step 4 — FastAPI microservice: serve recommendations

Create a simple FastAPI app that loads the index and returns top-k neighbors for a short user query (e.g., "I want spicy tacos"). This keeps everything on-device and reachable from your team over LAN.

from fastapi import FastAPI, Query
import uvicorn
import numpy as np
import hnswlib
import json
from sentence_transformers import SentenceTransformer

app = FastAPI()

# Load index and metadata
p = hnswlib.Index(space='cosine', dim=384)  # adjust dim if different
p.load_index('restaurants_hnsw.bin')
with open('meta.json') as f:
    meta = json.load(f)

# If you computed embeddings on-device, load the same model; otherwise, use a tiny local encoder
model = SentenceTransformer('all-MiniLM-L6-v2')

@app.get('/recommend')
def recommend(q: str = Query(...), k: int = 5):
    vec = model.encode([q]).astype('float32')
    labels, distances = p.knn_query(vec, k=k)
    results = []
    for idx, dist in zip(labels[0], distances[0]):
        item = meta[int(idx)]
        results.append({'id': item['id'], 'name': item['name'], 'score': float(1 - dist)})
    return {'query': q, 'results': results}

if __name__ == '__main__':
    uvicorn.run(app, host='0.0.0.0', port=8000)

Start the service: python3 app.py. From any device on your LAN you can call: http://raspberrypi.local:8000/recommend?q=spicy%20tacos

Step 5 — Optional: accelerate embedding inference with AI HAT+ 2

If you want the Pi to compute embeddings locally (no precompute step), you'll need to run the tiny encoder on the Pi. That is where AI HAT+ 2 shines. Vendor SDKs commonly provide an ONNX delegate or runtime — here are the general steps:

  1. Export your tiny embedding model to ONNX on your dev machine (use PyTorch/TensorFlow -> ONNX).
  2. Apply quantization (static or QAT -> int8) to reduce size and enable NPU acceleration.
  3. Copy the ONNX model to the Pi and use the AI HAT+ 2 runtime or ONNX Runtime improvements with the vendor delegate to run inference.

Example export (PyTorch to ONNX):

# pseudo-code sketch
import torch
model = TinyEncoder()
dummy = torch.randn(1, 128)  # input shape depends on tokenizer
torch.onnx.export(model, dummy, 'tiny_encoder.onnx', opset_version=14)

Use your vendor's documentation to enable the NPU delegate. Many HAT runtimes expose a Python binding that mirrors ONNX Runtime APIs.

Measuring and improving quality

For small recommenders keep these metrics in your toolkit:

  • Recall@k — does the system return relevant items within top-k?
  • NDCG@k — positions matter; NDCG measures ranking quality
  • Latency — aim for <200ms end-to-end on the Pi for user queries
  • Memorystorage workflows like memory-mapping help keep hnswlib index size + embedding memory comfortable in RAM

Quick experiments: subset your dataset and measure recall@5 for short queries (3–6 words). If recall is low, try enriching item metadata (more tags) or increasing embedding dimension/using a slightly better encoder.

Performance tips and production-readiness

  • Enable swap only if necessary. Relying on swap harms inference latency.
  • Memory-map large indices on startup to reduce load time: hnswlib supports save/load; you can also persist embeddings separately and load with numpy.memmap.
  • Quantize embeddings to float16 if memory is tight — conversion to float16 reduces memory by half but watch cosine distance precision. Test accuracy before deploying.
  • For extremely small latency budgets, precompute top-10 suggestions per user bucket and serve those instead of running a nearest-neighbor search for every request.
  • Use container images (balena / Docker for ARM64) for easier upgrades and rollback on the devices in the field.

Security, privacy and team workflows

Running recommendations on-device gives you a great privacy story. Keep these best practices in mind:

  • Serve only on the local network or use mDNS + token-based auth for peer sharing.
  • If the app stores personal preferences, encrypt the local store and use OS-level user permissions.
  • Implement versioned backup/sync (optional) if teams want to share a dataset across a few trusted devices.

Evaluation and A/B testing for small teams

Small teams can still run A/B tests. Use simple logs (local-only) to capture which recommendation was accepted. Compare two index versions (A vs B) and measure accept rate. If you need aggregation across devices, use privacy-preserving reporting: send only aggregated counts or use differential privacy techniques.

Scaling and extensibility

When your microapp grows, you have clear upgrade paths:

  • Keep the same API and replace the index with a larger one (or multi-device sync)
  • Swap in a learned ranker: deploy a tiny ONNX model to the HAT+ 2 and call it as the final scoring stage
  • Federated updates: collect gradients or lightweight updates on-device and aggregate them on a trusted host for periodic model refreshes

Late 2025 to early 2026 brought three important changes relevant to on-device recommenders:

  • Smaller, better embeddings: new family of sub-100M parameter encoders optimized for NPUs, enabling local semantic search with lower memory.
  • General on-device tooling: ONNX Runtime improvements and vendor delegates for NPUs simplified deployment of quantized models.
  • Microapp ecosystems: more approachable tooling and templates for local-first apps mean teams can deliver internal tools faster than ever.

Watch for expanding vendor support for edge vector DB integrations and deferred fine-tuning tools (LoRA-like techniques) adapted to small NPUs in 2026.

Real-world example: Where2Eat for your team (30–90 minutes)

Turn the code above into a tiny product:

  1. Populate restaurants.csv with your city or office favorites (30 mins)
  2. Compute embeddings on your laptop and copy them to the Pi (10 mins)
  3. Run the index build script on the Pi (10 mins)
  4. Deploy the FastAPI microservice and share the URL in your team chat (10 mins)

Within an hour you’ll have a local microapp that settles group lunch debates, and all recommendations remain on-device.

Troubleshooting quick list

  • No response from the Pi: check firewall and that uvicorn is listening on 0.0.0.0.
  • Embeddings mismatch errors: ensure the encoder dimension used to create the hnswlib index matches your runtime encoder.
  • Slow queries: reduce k, use float16 embeddings, or increase ef parameter (p.set_ef(50)).
  • NPU not detected: verify AI HAT+ 2 runtime is installed and that you have the correct ONNX delegate per vendor docs.

Actionable takeaways

  • Start with embedding + hnswlib — it's fast, accurate, and simple to deploy on Pi 5.
  • Precompute embeddings when possible; use the HAT+ 2 NPU when you need on-device encoding or on-device ranking.
  • Keep models tiny: 2026 favors sub-100M parameter encoders optimized for edge NPUs.
  • Use FastAPI for a quick local microapp — it's easy for non-developers to call from a browser or mobile shortcut.

Further reading and reference notes

Recent coverage in late 2025 highlighted both the rise of microapps and the arrival of HAT devices that enable on-device generative and embedding workloads. If you’re following this space, look for updated vendor SDKs for AI HAT+ 2 and edge-focused ONNX optimizers in 2026.

Your next steps (call to action)

Try this starter kit today: clone the example repo (link in the community post), flash a Pi OS image, and follow the steps above. Share your Where2Eat instance in our developer community to get feedback and zero-cost collaboration. If you want, post your improvements (quantized model exports, UI wrappers, or federated sync ideas) — we’ll spotlight great community projects and help you iterate.

Ready to build a private microapp recommender with Raspberry Pi 5 and AI HAT+ 2? Set up your device, run the sample, and drop a note in the programa.club community to show what you made.

Advertisement

Related Topics

#edge#IoT#tutorial
p

programa

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-01-29T05:48:52.292Z