<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>TechNotes by Sam (Posts about midwifery)</title><link>https://samariwa.github.io</link><description></description><atom:link href="https://samariwa.github.io/categories/midwifery.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2026 &lt;a href="mailto:samuelmariwa@gmail.com"&gt;Samuel Mariwa&lt;/a&gt; </copyright><lastBuildDate>Tue, 26 May 2026 14:02:42 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Mama AI — Building an Offline Voice Assistant for Nigerian Midwives</title><link>https://samariwa.github.io/posts/mama-ai-offline-midwifery-assistant/</link><dc:creator>Samuel Mariwa</dc:creator><description>&lt;h3&gt;The Stakes&lt;/h3&gt;
&lt;p&gt;Nigeria loses roughly &lt;strong&gt;512 mothers per 100,000 live births&lt;/strong&gt; — one of the highest maternal mortality rates in the world. A significant share of those deaths happen during delivery, in communities hours from the nearest hospital, attended by midwives who may be formally trained, traditionally trained, or in the worst cases an untrained family member who simply happened to be present.&lt;/p&gt;
&lt;p&gt;The knowledge gap is real. So is the connectivity gap. And so is the language gap — the tools that do exist assume English fluency, stable internet, and expensive hardware. None of those assumptions hold in the field.&lt;/p&gt;
&lt;p&gt;Mama AI is our answer to that gap: a fully on-device voice assistant that a midwife can speak to in &lt;strong&gt;Nigerian Pidgin&lt;/strong&gt; and receive accurate, protocol-based clinical guidance — spoken back in the same language. No internet. No cloud fee. No English required. Just the phone in her pocket.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;What We Built&lt;/h3&gt;
&lt;p&gt;Mama AI is a Flutter app that runs a complete AI pipeline on a mid-range Android phone:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The midwife speaks in Nigerian Pidgin&lt;/li&gt;
&lt;li&gt;A fine-tuned Whisper model transcribes the speech to written Pidgin text&lt;/li&gt;
&lt;li&gt;A fine-tuned Gemma 4 E2B model reasons over the query and responds in Pidgin&lt;/li&gt;
&lt;li&gt;The phone's built-in TTS speaks the response aloud&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Every component runs locally. The app works in a village with no cell signal. Sessions are logged to a local SQLite database so a qualified clinician can audit what happened after the fact.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;┌─────────────────────────────────────────────────────────────┐
│                          MAMA AI                            │
│                                                             │
│   🎙️  Midwife speaks Nigerian Pidgin                        │
│         │                                                   │
│         ▼                                                   │
│   ┌─────────────────────────────────────────┐               │
│   │   Whisper-small (Pidgin fine-tuned)     │               │
│   │   Speech → written Pidgin text          │               │
│   │   Runs on-device via Cactus STT         │               │
│   └─────────────────────────────────────────┘               │
│         │  Pidgin text                                      │
│         ▼                                                   │
│   ┌─────────────────────────────────────────┐               │
│   │   Gemma 4 E2B  (via Cactus SDK)         │               │
│   │   Fine-tuned: midwifery knowledge       │               │
│   │   Pidgin understanding via system prompt│               │
│   │   Context: 2048 tokens + rolling summary│               │
│   └─────────────────────────────────────────┘               │
│         │  Pidgin text response                             │
│         ▼                                                   │
│   📱  Phone built-in TTS                                    │
│         │                                                   │
│         ▼                                                   │
│   🔊  Mama AI speaks back                                   │
└─────────────────────────────────────────────────────────────┘
&lt;/pre&gt;&lt;/div&gt;

&lt;hr&gt;
&lt;h3&gt;Why the Pipeline Looks Like This&lt;/h3&gt;
&lt;p&gt;The original architecture was simpler: Gemma 4 has a native audio encoder, so the first plan was to pipe raw audio directly into the model and let it handle speech understanding and clinical reasoning in a single forward pass.&lt;/p&gt;
&lt;p&gt;Testing killed that idea quickly. &lt;strong&gt;Gemma 4's audio encoder had no exposure to Nigerian Pidgin&lt;/strong&gt;. WER (word error rate) was unusable for a medical application — roughly 1 in 2 words wrong means you cannot trust the transcription, and in a clinical setting that is dangerous, not inconvenient.&lt;/p&gt;
&lt;p&gt;Separating the pipeline into distinct components let us optimise each independently. A fine-tunable speech model handles Pidgin transcription; Gemma 4 focuses on what it does best — clinical reasoning over text.&lt;/p&gt;
&lt;p&gt;The TTS decision followed similar logic. We originally planned to deploy a third model (Kokoro-82M via ONNX Runtime) for speech synthesis. Memory profiling showed that three on-device models pushed total footprint above the iOS app memory limit. The phone's built-in voice-over gives us natural speech output for zero additional RAM cost.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;The Pidgin Challenge&lt;/h3&gt;
&lt;p&gt;Nigerian Pidgin is not a dialect of English — it is a distinct creole with its own grammar, vocabulary, and tense system. Fine-tuning Gemma 4 on English midwifery data (which is all the quality clinical data that exists) gives you a model that reasons well but responds in English. That is not good enough.&lt;/p&gt;
&lt;p&gt;We solved this in two layers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Layer 1 — ASR fine-tuning:&lt;/strong&gt; We fine-tuned Whisper-small on Nigerian Pidgin speech data. Out of the box, Whisper-small on Pidgin was sitting at roughly 48% WER. Fine-tuning on Pidgin audio/transcript pairs brought this into a range where the transcriptions are clinically reliable. The fine-tuned model runs on the Cactus STT engine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Layer 2 — Gemma 4 system prompt (few-shot Pidgin grammar):&lt;/strong&gt; We couldn't just tell the model "respond in Pidgin." We had to actually teach it Pidgin grammar rules in the system prompt. The prompt encodes the full grammar Gemma 4 needs to both parse incoming Pidgin and generate fluent Pidgin responses:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="n"&gt;PIDGIN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;TENSE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dey&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ongoing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;don&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;completed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;go&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;wan&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;about&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;to&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;can&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;no&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;negation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="n"&gt;PIDGIN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;COPULAS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;na&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;at&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;place&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;zero&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;copula&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ADJ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;NEVER&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;write&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;is&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;are&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;was&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Pidgin&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;Right&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;"Your BP high."&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;Wrong&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;"Your BP is high."&lt;/span&gt;
&lt;span class="nl"&gt;PRONOUNS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;am&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;him&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;her&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;una&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;you&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;plural&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dem&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;they&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;them&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="n"&gt;SERIAL&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;VERBS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;carry&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;come&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bring&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;carry&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;go&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;take&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;away&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;small&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;small&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gradually&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;immediately&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;VOCAB&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;belle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;belly&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;pregnancy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pikin&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;baby&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;born&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;give&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;birth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pain&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dey&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;come&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;and&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;go&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;contractions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;water&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;don&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;membranes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ruptured&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;hot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;fever&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;blood&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dey&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;comot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bleeding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The model also understands clinical mode-switching. &lt;code&gt;"BP", "PPH", "G3P2", "CTG"&lt;/code&gt; in a message signals a trained clinician who wants concise, clinical shorthand. Pidgin phrasing about personal symptoms signals a patient or untrained attendant who needs warm, step-by-step guidance. The system prompt handles both modes and locks the response language to whatever language the first message used.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Keeping the Model on the Phone&lt;/h3&gt;
&lt;p&gt;Gemma 4 E2B defaults to a 128K token context window. At that size the KV cache alone would be around 3 GB — more than a mid-range iPhone can give a single app. We patch &lt;code&gt;config.txt&lt;/code&gt; at runtime before loading the model, clamping &lt;code&gt;context_length&lt;/code&gt; to 2048. That brings the KV cache down to about 47 MB.&lt;/p&gt;
&lt;p&gt;The same patch zeroes out the vision and audio encoder fields in config, forcing the runtime to instantiate the lightweight text-only model path instead of the full multimodal &lt;code&gt;Gemma4MmModel&lt;/code&gt;. Without this, even though we never send audio or images, the binary memory-maps ~4–5 GB of vision and audio encoder weights. After the patch, total model footprint is roughly 2 GB — within budget for the app.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="c1"&gt;// In cactus_service.dart — applied before model load&lt;/span&gt;
&lt;span class="kd"&gt;final&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;visionAudioPatches&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;RegExp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;r'vision_num_layers=(?!0\b)\d+'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;'vision_num_layers=0'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;RegExp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;r'audio_num_layers=(?!0\b)\d+'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;'audio_num_layers=0'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;RegExp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;r'use_image_tokens=true'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;'use_image_tokens=false'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h4&gt;Context Management Across a Long Delivery&lt;/h4&gt;
&lt;p&gt;A delivery can last hours. With a 2048-token ceiling, a long consultation will eventually overflow the context window. We handle this with a rolling summary strategy:&lt;/p&gt;
&lt;p&gt;Every 6 message turns, the older portion of the conversation is compressed by calling the model itself with a clinical summary prompt. The summary — 2 to 3 sentences of third-person clinical narrative — is prepended to the system prompt on the next turn. Only the most recent exchanges are passed as full chat history. The model always has a clinical picture of what happened earlier without the full token cost.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;System prompt + rolling clinical summary
  + last N messages as full history
  + current user message
&lt;/pre&gt;&lt;/div&gt;

&lt;h4&gt;How the model was trained and quantized&lt;/h4&gt;
&lt;p&gt;The Gemma 4 E2B model went through two distinct stages before it could run on a phone:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Stage 1 — Fine-tuning on Kaggle (Unsloth):&lt;/strong&gt; We used &lt;a href="https://github.com/unslothai/unsloth"&gt;Unsloth&lt;/a&gt; to train a LoRA adapter on top of &lt;code&gt;google/gemma-4-e4b-it&lt;/code&gt; using 536 midwifery Q&amp;amp;A pairs. Unsloth requires an NVIDIA GPU — Kaggle's free T4 sessions provided that. Training used LoRA rank 16, &lt;code&gt;lora_alpha=16&lt;/code&gt;, &lt;code&gt;lora_dropout=0&lt;/code&gt;, and &lt;code&gt;use_rslora=True&lt;/code&gt; for rank-stabilised scaling. After training, the adapter was merged back into the base model weights inside Kaggle (using Unsloth's &lt;code&gt;push_to_hub_merged&lt;/code&gt;) and pushed to HuggingFace Hub as a standard safetensors model.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;Kaggle (NVIDIA T4)
  ├─ Load google/gemma-4-e4b-it in 4-bit via Unsloth
  ├─ Train LoRA adapter on 536 midwifery pairs, 3 epochs
  ├─ Merge adapter into base model weights (Unsloth handles Gemma4ClippableLinear internally)
  └─ Push merged model → samariwa/gemma4-mama-ai-merged on HuggingFace Hub
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Stage 2 — Quantization and conversion (Cactus CLI):&lt;/strong&gt; The merged HuggingFace model cannot run on a phone as-is. The Cactus CLI (&lt;code&gt;cactus convert&lt;/code&gt;) downloads it, quantizes it to INT8, and converts it into Cactus's on-device &lt;code&gt;.weights&lt;/code&gt; format — the only format the Cactus SDK can load. This step runs locally on a Mac.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="c1"&gt;# One-time setup&lt;/span&gt;
git&lt;span class="w"&gt; &lt;/span&gt;clone&lt;span class="w"&gt; &lt;/span&gt;https://github.com/cactus-compute/cactus&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;cactus&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;source&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;./setup

&lt;span class="c1"&gt;# Download, quantize, and convert to Cactus on-device format&lt;/span&gt;
cactus&lt;span class="w"&gt; &lt;/span&gt;convert&lt;span class="w"&gt; &lt;/span&gt;samariwa/gemma4-mama-ai-merged&lt;span class="w"&gt; &lt;/span&gt;./mama-ai-model
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The output is a folder of &lt;code&gt;.weights&lt;/code&gt; files (one per layer) plus &lt;code&gt;config.txt&lt;/code&gt; and a chat template — everything the Flutter app needs to load and run the model offline.&lt;/p&gt;
&lt;p&gt;For the full fine-tuning workstream — including the PEFT blocker we hit with &lt;code&gt;Gemma4ClippableLinear&lt;/code&gt; and why the local adapter method failed — see the companion post: &lt;a href="https://samariwa.github.io/posts/finetuning-gemma4-unsloth-cactus/"&gt;Fine-tuning Gemma 4 with Unsloth and Cactus&lt;/a&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;The Cloud Handoff&lt;/h3&gt;
&lt;p&gt;The offline-first design is non-negotiable for areas without connectivity. But in urban communities or clinics with internet access, we can do better. When the on-device model hedges — phrases like &lt;code&gt;"i no sure"&lt;/code&gt;, &lt;code&gt;"consult a doctor"&lt;/code&gt;, &lt;code&gt;"beyond my knowledge"&lt;/code&gt; — the app detects the uncertainty and, if the user has enabled cloud mode and has internet, hands the query to &lt;strong&gt;GPT-4o-mini&lt;/strong&gt; via the OpenAI API.&lt;/p&gt;
&lt;p&gt;The cloud path uses an identical Pidgin-aware system prompt so responses are consistent in voice and language regardless of which backend answered. The user configures their own API key in the settings screen — nothing is baked into the app.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://samariwa.github.io/images/mama-ai-settings.jpg" alt="Mama AI settings screen — cloud handoff configuration" style="max-width: 300px; display: block; margin: 1rem auto;"&gt;&lt;/p&gt;
&lt;p&gt;This design keeps the offline experience intact for the majority use case while providing a better answer in the minority of cases where the on-device model is genuinely uncertain and connectivity exists.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;The App&lt;/h3&gt;
&lt;p&gt;&lt;img src="https://samariwa.github.io/images/mama-ai-conversation.png" alt="Mama AI — live Pidgin consultation during a delivery in Abuja" style="max-width: 300px; display: block; margin: 1rem auto;"&gt;&lt;/p&gt;
&lt;p&gt;The screenshot above is a real consultation. A newborn has just been delivered in Abuja. The midwife reports back in Pidgin: &lt;em&gt;"di baby is now breathing well she is crying and i think its all good di mother is now not in pain she feels much better."&lt;/em&gt; Mama AI responds in Patient Mode — warm, reassuring, clinical: keep watching the baby, let the mother rest, monitor closely. The animated waveform at the bottom shows the response being spoken aloud through the phone's TTS as the text streams in.&lt;/p&gt;
&lt;p&gt;Notice the &lt;strong&gt;Ask cloud&lt;/strong&gt; button below the conversation — this surfaces when the on-device model detects its own uncertainty, giving the midwife one tap to escalate to GPT-4o-mini while staying in the same conversation thread.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Audit Trail&lt;/h3&gt;
&lt;p&gt;Every consultation is stored in a local SQLite database. Sessions are identified by &lt;strong&gt;patient name and location&lt;/strong&gt;, entered by the midwife at the start of each encounter. Every turn — the midwife's spoken query and Mama AI's response — is written to the database with a timestamp.&lt;/p&gt;
&lt;p&gt;If anything goes wrong during or after a delivery, a qualified clinician can open the app and trace exactly what was asked and what guidance was given. The log is the full conversation, not a summary. This matters for both clinical accountability and for training data collection as the app matures.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;What We're Targeting&lt;/h3&gt;
&lt;p&gt;Mama AI was built for the &lt;a href="https://www.kaggle.com/competitions/gemma-4-good-hackathon"&gt;Kaggle Gemma 4 Good Hackathon&lt;/a&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;Repo&lt;/h3&gt;
&lt;p&gt;The full source — Flutter app, Kaggle fine-tuning notebook, conversion scripts, and process logs — is on GitHub:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/LowUp/Gemma4_hackathon/tree/cactus_setup"&gt;github.com/LowUp/Gemma4_hackathon&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;Mama AI — because every mother deserves a midwife who speaks her language.&lt;/em&gt;&lt;/p&gt;</description><category>AI</category><category>Flutter</category><category>Gemma4</category><category>healthcare</category><category>midwifery</category><category>mobile</category><category>NLP</category><category>on-device</category><category>Pidgin</category><guid>https://samariwa.github.io/posts/mama-ai-offline-midwifery-assistant/</guid><pubDate>Tue, 26 May 2026 09:00:00 GMT</pubDate></item></channel></rss>