Mama AI — Building an Offline Voice Assistant for Nigerian Midwives

Samuel Mariwa — Tue, 26 May 2026 09:00:00 GMT

The Stakes

Nigeria loses roughly 512 mothers per 100,000 live births — one of the highest maternal mortality rates in the world. A significant share of those deaths happen during delivery, in communities hours from the nearest hospital, attended by midwives who may be formally trained, traditionally trained, or in the worst cases an untrained family member who simply happened to be present.

The knowledge gap is real. So is the connectivity gap. And so is the language gap — the tools that do exist assume English fluency, stable internet, and expensive hardware. None of those assumptions hold in the field.

Mama AI is our answer to that gap: a fully on-device voice assistant that a midwife can speak to in Nigerian Pidgin and receive accurate, protocol-based clinical guidance — spoken back in the same language. No internet. No cloud fee. No English required. Just the phone in her pocket.

What We Built

Mama AI is a Flutter app that runs a complete AI pipeline on a mid-range Android phone:

The midwife speaks in Nigerian Pidgin
A fine-tuned Whisper model transcribes the speech to written Pidgin text
A fine-tuned Gemma 4 E2B model reasons over the query and responds in Pidgin
The phone's built-in TTS speaks the response aloud

Every component runs locally. The app works in a village with no cell signal. Sessions are logged to a local SQLite database so a qualified clinician can audit what happened after the fact.

┌─────────────────────────────────────────────────────────────┐
│                          MAMA AI                            │
│                                                             │
│   🎙️  Midwife speaks Nigerian Pidgin                        │
│         │                                                   │
│         ▼                                                   │
│   ┌─────────────────────────────────────────┐               │
│   │   Whisper-small (Pidgin fine-tuned)     │               │
│   │   Speech → written Pidgin text          │               │
│   │   Runs on-device via Cactus STT         │               │
│   └─────────────────────────────────────────┘               │
│         │  Pidgin text                                      │
│         ▼                                                   │
│   ┌─────────────────────────────────────────┐               │
│   │   Gemma 4 E2B  (via Cactus SDK)         │               │
│   │   Fine-tuned: midwifery knowledge       │               │
│   │   Pidgin understanding via system prompt│               │
│   │   Context: 2048 tokens + rolling summary│               │
│   └─────────────────────────────────────────┘               │
│         │  Pidgin text response                             │
│         ▼                                                   │
│   📱  Phone built-in TTS                                    │
│         │                                                   │
│         ▼                                                   │
│   🔊  Mama AI speaks back                                   │
└─────────────────────────────────────────────────────────────┘

Why the Pipeline Looks Like This

The original architecture was simpler: Gemma 4 has a native audio encoder, so the first plan was to pipe raw audio directly into the model and let it handle speech understanding and clinical reasoning in a single forward pass.

Testing killed that idea quickly. Gemma 4's audio encoder had no exposure to Nigerian Pidgin. WER (word error rate) was unusable for a medical application — roughly 1 in 2 words wrong means you cannot trust the transcription, and in a clinical setting that is dangerous, not inconvenient.

Separating the pipeline into distinct components let us optimise each independently. A fine-tunable speech model handles Pidgin transcription; Gemma 4 focuses on what it does best — clinical reasoning over text.

The TTS decision followed similar logic. We originally planned to deploy a third model (Kokoro-82M via ONNX Runtime) for speech synthesis. Memory profiling showed that three on-device models pushed total footprint above the iOS app memory limit. The phone's built-in voice-over gives us natural speech output for zero additional RAM cost.

The Pidgin Challenge

Nigerian Pidgin is not a dialect of English — it is a distinct creole with its own grammar, vocabulary, and tense system. Fine-tuning Gemma 4 on English midwifery data (which is all the quality clinical data that exists) gives you a model that reasons well but responds in English. That is not good enough.

We solved this in two layers.

Layer 1 — ASR fine-tuning: We fine-tuned Whisper-small on Nigerian Pidgin speech data. Out of the box, Whisper-small on Pidgin was sitting at roughly 48% WER. Fine-tuning on Pidgin audio/transcript pairs brought this into a range where the transcriptions are clinically reliable. The fine-tuned model runs on the Cactus STT engine.

Layer 2 — Gemma 4 system prompt (few-shot Pidgin grammar): We couldn't just tell the model "respond in Pidgin." We had to actually teach it Pidgin grammar rules in the system prompt. The prompt encodes the full grammar Gemma 4 needs to both parse incoming Pidgin and generate fluent Pidgin responses:

PIDGIN TENSE: dey+V=ongoing. don+V=completed. go+V=future. wan+V=about to. fit+V=can. no+V=negation.
PIDGIN COPULAS: na=X is Y. dey=X is at place. zero copula=X is ADJ.
  NEVER write is/are/was in Pidgin. Right: "Your BP high." Wrong: "Your BP is high."
PRONOUNS: am=him/her/it. una=you plural. dem=they/them.
SERIAL VERBS: carry come=bring. carry go=take away. small small=gradually. now now=immediately.
KEY VOCAB: belle=belly/pregnancy. pikin=baby. born=give birth. pain dey come and go=contractions.
  water don break=membranes ruptured. body hot=fever. blood dey comot=bleeding.

The model also understands clinical mode-switching. "BP", "PPH", "G3P2", "CTG" in a message signals a trained clinician who wants concise, clinical shorthand. Pidgin phrasing about personal symptoms signals a patient or untrained attendant who needs warm, step-by-step guidance. The system prompt handles both modes and locks the response language to whatever language the first message used.

Keeping the Model on the Phone

Gemma 4 E2B defaults to a 128K token context window. At that size the KV cache alone would be around 3 GB — more than a mid-range iPhone can give a single app. We patch config.txt at runtime before loading the model, clamping context_length to 2048. That brings the KV cache down to about 47 MB.

The same patch zeroes out the vision and audio encoder fields in config, forcing the runtime to instantiate the lightweight text-only model path instead of the full multimodal Gemma4MmModel. Without this, even though we never send audio or images, the binary memory-maps ~4–5 GB of vision and audio encoder weights. After the patch, total model footprint is roughly 2 GB — within budget for the app.

// In cactus_service.dart — applied before model load
final visionAudioPatches = {
  RegExp(r'vision_num_layers=(?!0\b)\d+'): 'vision_num_layers=0',
  RegExp(r'audio_num_layers=(?!0\b)\d+'): 'audio_num_layers=0',
  RegExp(r'use_image_tokens=true'): 'use_image_tokens=false',
  // ...
};

Context Management Across a Long Delivery

A delivery can last hours. With a 2048-token ceiling, a long consultation will eventually overflow the context window. We handle this with a rolling summary strategy:

Every 6 message turns, the older portion of the conversation is compressed by calling the model itself with a clinical summary prompt. The summary — 2 to 3 sentences of third-person clinical narrative — is prepended to the system prompt on the next turn. Only the most recent exchanges are passed as full chat history. The model always has a clinical picture of what happened earlier without the full token cost.

System prompt + rolling clinical summary
  + last N messages as full history
  + current user message

How the model was trained and quantized

The Gemma 4 E2B model went through two distinct stages before it could run on a phone:

Stage 1 — Fine-tuning on Kaggle (Unsloth): We used Unsloth to train a LoRA adapter on top of google/gemma-4-e4b-it using 536 midwifery Q&A pairs. Unsloth requires an NVIDIA GPU — Kaggle's free T4 sessions provided that. Training used LoRA rank 16, lora_alpha=16, lora_dropout=0, and use_rslora=True for rank-stabilised scaling. After training, the adapter was merged back into the base model weights inside Kaggle (using Unsloth's push_to_hub_merged) and pushed to HuggingFace Hub as a standard safetensors model.

Kaggle (NVIDIA T4)
  ├─ Load google/gemma-4-e4b-it in 4-bit via Unsloth
  ├─ Train LoRA adapter on 536 midwifery pairs, 3 epochs
  ├─ Merge adapter into base model weights (Unsloth handles Gemma4ClippableLinear internally)
  └─ Push merged model → samariwa/gemma4-mama-ai-merged on HuggingFace Hub

Stage 2 — Quantization and conversion (Cactus CLI): The merged HuggingFace model cannot run on a phone as-is. The Cactus CLI (cactus convert) downloads it, quantizes it to INT8, and converts it into Cactus's on-device .weights format — the only format the Cactus SDK can load. This step runs locally on a Mac.

# One-time setup
git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup

# Download, quantize, and convert to Cactus on-device format
cactus convert samariwa/gemma4-mama-ai-merged ./mama-ai-model

The output is a folder of .weights files (one per layer) plus config.txt and a chat template — everything the Flutter app needs to load and run the model offline.

For the full fine-tuning workstream — including the PEFT blocker we hit with Gemma4ClippableLinear and why the local adapter method failed — see the companion post: Fine-tuning Gemma 4 with Unsloth and Cactus.

The Cloud Handoff

The offline-first design is non-negotiable for areas without connectivity. But in urban communities or clinics with internet access, we can do better. When the on-device model hedges — phrases like "i no sure", "consult a doctor", "beyond my knowledge" — the app detects the uncertainty and, if the user has enabled cloud mode and has internet, hands the query to GPT-4o-mini via the OpenAI API.

The cloud path uses an identical Pidgin-aware system prompt so responses are consistent in voice and language regardless of which backend answered. The user configures their own API key in the settings screen — nothing is baked into the app.

This design keeps the offline experience intact for the majority use case while providing a better answer in the minority of cases where the on-device model is genuinely uncertain and connectivity exists.

The App

The screenshot above is a real consultation. A newborn has just been delivered in Abuja. The midwife reports back in Pidgin: "di baby is now breathing well she is crying and i think its all good di mother is now not in pain she feels much better." Mama AI responds in Patient Mode — warm, reassuring, clinical: keep watching the baby, let the mother rest, monitor closely. The animated waveform at the bottom shows the response being spoken aloud through the phone's TTS as the text streams in.

Notice the Ask cloud button below the conversation — this surfaces when the on-device model detects its own uncertainty, giving the midwife one tap to escalate to GPT-4o-mini while staying in the same conversation thread.

Audit Trail

Every consultation is stored in a local SQLite database. Sessions are identified by patient name and location, entered by the midwife at the start of each encounter. Every turn — the midwife's spoken query and Mama AI's response — is written to the database with a timestamp.

If anything goes wrong during or after a delivery, a qualified clinician can open the app and trace exactly what was asked and what guidance was given. The log is the full conversation, not a summary. This matters for both clinical accountability and for training data collection as the app matures.

What We're Targeting

Mama AI was built for the Kaggle Gemma 4 Good Hackathon.

Repo

The full source — Flutter app, Kaggle fine-tuning notebook, conversion scripts, and process logs — is on GitHub:

github.com/LowUp/Gemma4_hackathon

Mama AI — because every mother deserves a midwife who speaks her language.

TechNotes by Sam (Posts about midwifery)