TechNotes by Sam

Fine-tuning Gemma 4 with Unsloth and Cactus

Samuel Mariwa — Tue, 28 Apr 2026 11:00:00 GMT

Context

This post covers the fine-tuning workstream for a project that uses Google's Gemma 4 E4B (google/gemma-4-e4b-it) as its reasoning core, running entirely on-device via the Cactus SDK. The goal was to equip the base model with clinical domain knowledge using Unsloth and LoRA, then convert the result into a format the mobile app can load. This is a record of how that pipeline was built, and the two blockers we hit along the way.

Why Fine-tune at All?

google/gemma-4-e4b-it is a strong instruction-following model, but out of the box it has no grounding in clinical data. Without fine-tuning, it produces plausible-sounding answers that may be factually wrong in a specialised domain — not acceptable for the use case.

We used LoRA (Low-Rank Adaptation) rather than full fine-tuning. LoRA freezes the base model and trains a small set of additional weight matrices (the "adapter") that steer the model's outputs without touching the original weights. At our dataset size — 536 medical instruction pairs — this is both compute-efficient and safer from an overfitting standpoint.

Why Kaggle?

Unsloth — the library we used to train the adapter — currently requires an NVIDIA GPU. It does not yet run on Apple Silicon (MLX support is on their roadmap). Our local machines are Apple Silicon. Kaggle provides free T4/P100 GPU sessions with persistent storage and secret management, making it the practical choice for running Unsloth without a cloud GPU subscription. The full training notebook lives in the project repo.

The Finetuning Pipeline

Step 1 — Install dependencies

%%capture
!pip install "unsloth[kaggle-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "trl==0.22.0" peft accelerate bitsandbytes sentencepiece
!pip install "datasets==3.5.0"

We pin trl==0.22.0 and datasets==3.5.0 to avoid upstream API breakage — both libraries were changing quickly at the time of training.

Step 2 — Configuration

from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()
HF_TOKEN = user_secrets.get_secret("HF_TOKEN")  # stored in Kaggle secrets, never hardcoded
login(token=HF_TOKEN)

DATASET_PATH = '/kaggle/input/datasets/samuelmariwa/gemma4-clinical-finetuning/pass1_medical_training.jsonl'
OUTPUT_DIR   = '/kaggle/working/adapter'

BASE_MODEL     = 'google/gemma-4-e4b-it'
MAX_SEQ_LENGTH = 1024
LOAD_IN_4BIT   = True   # fits comfortably on T4 VRAM

LORA_R        = 16
LORA_ALPHA    = 16   # equal to r — see Challenge 1 below
LORA_DROPOUT  = 0    # see Challenge 1 below
NUM_EPOCHS    = 3    # see Challenge 1 below
LEARNING_RATE = 2e-4
BATCH_SIZE    = 2
GRAD_ACCUM    = 4    # effective batch size = 8
SEED          = 42

The HuggingFace token lives in Kaggle's secret manager. It is never written to code or files.

Step 3 — Load the base model and attach the LoRA adapter

from unsloth import FastLanguageModel, is_bfloat16_supported

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = BASE_MODEL,
    max_seq_length = MAX_SEQ_LENGTH,
    dtype          = None,
    load_in_4bit   = LOAD_IN_4BIT,
    token          = HF_TOKEN,
    device_map     = {'': 0},
)

model = FastLanguageModel.get_peft_model(
    model,
    r                          = LORA_R,
    target_modules             = ['q_proj', 'k_proj', 'v_proj', 'o_proj',
                                  'gate_proj', 'up_proj', 'down_proj'],
    lora_alpha                 = LORA_ALPHA,
    lora_dropout               = LORA_DROPOUT,
    bias                       = 'none',
    use_gradient_checkpointing = 'unsloth',
    random_state               = SEED,
    use_rslora                 = True,
)

use_rslora=True enables rank-stabilised LoRA scaling, which normalises gradient magnitude across ranks. Unsloth recommends this for most use cases and it doubles as implicit regularisation — one of the reasons lora_dropout=0 is correct here (more on that below).

use_gradient_checkpointing='unsloth' is Unsloth's custom implementation that trades some compute for significantly lower VRAM usage. On a T4 (16 GB), loading Gemma 4 E4B in 4-bit and training with this flag keeps memory usage manageable.

Step 4 — Prepare the dataset

The training data is a JSONL file of {"instruction": "...", "output": "..."} pairs — 536 records of clinical question-answer pairs curated for the target domain.

import json
from datasets import Dataset

with open(DATASET_PATH, 'r') as f:
    records = [json.loads(line) for line in f if line.strip()]

EOS    = tokenizer.eos_token
PROMPT = (
    'Below is an instruction that describes a task. '
    'Write a response that appropriately completes the request.\n\n'
    '### Instruction:\n{instruction}\n\n### Response:\n{output}'
)

def fmt(examples):
    return {'text': [
        PROMPT.format(instruction=i, output=o) + EOS
        for i, o in zip(examples['instruction'], examples['output'])
    ]}

def tokenize(examples):
    out = tokenizer(text=examples['text'], truncation=True,
                    max_length=MAX_SEQ_LENGTH, padding='max_length')
    out['labels'] = out['input_ids'].copy()
    return out

raw     = Dataset.from_list(records).map(fmt, batched=True)
dataset = raw.map(tokenize, batched=True, remove_columns=raw.column_names)

The Alpaca-style prompt format (### Instruction / ### Response) is compatible with gemma-4-e4b-it's instruction-tuning template and is what Unsloth's SFTTrainer expects by default.

Step 5 — Train

from trl import SFTTrainer, SFTConfig
import time

trainer = SFTTrainer(
    model         = model,
    tokenizer     = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        max_seq_length              = MAX_SEQ_LENGTH,
        packing                     = False,
        output_dir                  = OUTPUT_DIR,
        num_train_epochs            = NUM_EPOCHS,
        per_device_train_batch_size = BATCH_SIZE,
        gradient_accumulation_steps = GRAD_ACCUM,
        optim                       = 'adamw_8bit',
        learning_rate               = LEARNING_RATE,
        warmup_ratio                = 0.03,
        lr_scheduler_type           = 'cosine',
        weight_decay                = 0.01,
        max_grad_norm               = 1.0,
        fp16                        = not is_bfloat16_supported(),
        bf16                        = is_bfloat16_supported(),
        logging_steps               = 10,
        save_strategy               = 'epoch',
        report_to                   = 'none',
        seed                        = SEED,
        dataloader_num_workers      = 0,
    ),
)

t0    = time.time()
stats = trainer.train()
print(f'Done!  {(time.time()-t0)/60:.1f} min  |  loss: {stats.training_loss:.4f}')

adamw_8bit is Unsloth's 8-bit AdamW optimiser, which significantly cuts the memory footprint of the optimiser states compared to standard AdamW — important when the base model is already eating most of the VRAM.

Step 6 — Save the adapter

model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

The output directory (adapter/) contains:

File	Purpose
`adapter_model.safetensors`	The trained LoRA weights (~50–100 MB)
`adapter_config.json`	LoRA hyperparameters and target layer config
`tokenizer.json` / `tokenizer_config.json`	Tokenizer files needed for inference

The adapter cannot run standalone. It is a delta that must be merged into or loaded on top of google/gemma-4-e4b-it.

Challenge 1 — Hyperparameter Alignment with Cactus

Before running training the first time, the initial notebook had three hyperparameters that were wrong relative to what Cactus expects:

Parameter	Initial value	Corrected value	Why
`LORA_ALPHA`	32	16	Cactus docs use `lora_alpha = r`. A 2× alpha applies a higher effective learning rate to the adapter and risks overfitting on 536 pairs.
`LORA_DROPOUT`	0.05	0	`use_rslora=True` already provides implicit regularisation. Non-zero dropout adds unvalidated stochasticity; Cactus's reference notebooks all use 0.
`NUM_EPOCHS`	10	3	At effective batch size 8, 10 epochs over 536 samples = ~670 gradient steps. The model memorises the training set. 3 epochs (~200 steps) is the standard LoRA starting point for datasets of this size.

The Cactus fine-tuning documentation uses r=16, lora_alpha=16, lora_dropout=0 throughout. Matching these values ensures the adapter behaves as tested by the Cactus team and is the correct prior for a small dataset.

Challenge 2 — PEFT Does Not Support `Gemma4ClippableLinear`

After training, the plan was to download the adapter from Kaggle's Output tab and run cactus convert locally with the --lora flag:

cactus convert google/gemma-4-e4b-it \
    /path/to/on-device-model \
    --lora /path/to/adapter

This failed. The error:

ValueError: target_modules ... contains 'Gemma4ClippableLinear' which is not supported by PEFT.

Gemma4ClippableLinear is a new linear layer type introduced in Gemma 4. The standard PEFT library does not know how to load LoRA adapters whose config references this module type. When cactus convert tried to load the adapter via PEFT, it hit this wall immediately.

The Fix — Merge Inside Kaggle Using Unsloth

Unsloth handles Gemma4ClippableLinear internally. Since the adapter was produced by Unsloth, the fix was to merge the adapter into the base model while still inside the Kaggle notebook (where Unsloth is installed), produce a complete merged model, and push that to HuggingFace Hub. cactus convert then works against the merged model — no --lora flag, no PEFT, no incompatible layer types.

The merge cell added to the end of the notebook:

# Merge LoRA into the base model and push the complete model to HuggingFace Hub
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = OUTPUT_DIR,       # load the just-trained adapter
    max_seq_length = MAX_SEQ_LENGTH,
    dtype          = None,
    load_in_4bit   = LOAD_IN_4BIT,
    token          = HF_TOKEN,
)

model.push_to_hub_merged(
    "samariwa/gemma4-merged",   # HuggingFace repo to push to
    tokenizer,
    save_method = "merged_16bit",
    token       = HF_TOKEN,
)

save_method="merged_16bit" fully merges the LoRA adapter into the base model weights and saves the result as standard float16 HuggingFace safetensors. The output on HuggingFace Hub is a complete, standalone model — no adapter config, no PEFT dependency.

Step 7 — Cactus Conversion (the Working Method)

Once the merged model is on HuggingFace Hub, cactus convert can download and quantise it in one command:

# One-time: clone and set up the Cactus CLI
git clone --depth 1 https://github.com/cactus-compute/cactus
cd cactus && source ./setup
# source ./setup must be re-run in every new terminal session

# Convert merged model to Cactus on-device format
cactus convert samariwa/gemma4-merged \
    /path/to/assets/models/on-device-model

No --lora flag — the weights are already merged. Cactus downloads the model from HuggingFace, quantises it to INT8, and writes the on-device format to the output directory. That directory is what the Flutter app loads via cactusInit().

For comparison, the original (failed) local adapter method would have been:

# This does NOT work for Gemma 4 — Gemma4ClippableLinear incompatibility
cactus convert google/gemma-4-e4b-it \
    /path/to/on-device-model \
    --lora /path/to/adapter

If and when PEFT adds support for Gemma4ClippableLinear, the local adapter method will become viable again and avoids the extra HuggingFace Hub push step.

Summary of the Working Pipeline

Kaggle notebook (NVIDIA GPU, Unsloth installed)
  │
  ├─ Load google/gemma-4-e4b-it in 4-bit
  ├─ Attach LoRA adapter (r=16, alpha=16, dropout=0, rslora=True)
  ├─ Train on 536 medical Q&A pairs, 3 epochs
  ├─ Save adapter to OUTPUT_DIR
  └─ Merge adapter into base model → push to HuggingFace Hub

Local machine (Mac/Linux, Cactus CLI installed)
  │
  └─ cactus convert samariwa/gemma4-merged ./on-device-model
       → Downloads merged model
       → Quantises to INT8
       → Writes on-device format

Flutter app
  └─ cactus_service.dart loads ./on-device-model at startup — fully offline

Project Repo

GitHub Repository

Building suspETHious — An AI Agent for Wallet Security

Samuel Mariwa — Tue, 17 Jun 2025 14:22:07 GMT

The Problem

Decentralized wallets today are powerful, but they lack built-in mechanisms to help users evaluate the trustworthiness of past transactions or counterparties. Many users fall victim to fraudulent transactions or scams without even realizing the red flags.

The Idea: suspETHious

suspETHious is a smart Ethereum-based crypto wallet enhanced with an AI-powered audit agent. Before executing any outgoing transaction, the agent scans and analyzes the historical transaction patterns of the target wallet.

The AI agent determines:

If the wallet has a history of interacting with known scam addresses
Transaction frequencies and amounts that look anomalous
Gas fees or token behavior that deviates from norms

How It Works

Whenever you enter a recipient address, the wallet fetches all historical transactions for that address using the Etherscan API:

import requests
import os
BASE_URL = "https://api.etherscan.io/api"

def fetch_wallet_transactions(address):
    url = (
        f"{BASE_URL}?module=account&action=txlist"
        f"&address={address}&startblock=0&endblock=99999999"
        f"&sort=asc&apikey={os.environ['ETHERSCAN_API_KEY']}"
    )
    response = requests.get(url)
    data = response.json()
    if data["status"] != "1":
        if data["message"].lower().startswith("no transactions"):
            return []
        raise ValueError(f"Error from Etherscan: {data['message']}")
    return data["result"]

The transaction data is then transformed into features that capture behavioral patterns—like average time between transactions, unique counterparties, and value statistics. These features are fed into a machine learning model trained to spot scam-like activity:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PowerTransformer
from imblearn.over_sampling import RandomOverSampler
import xgboost as xgb
import pickle

def train_model():
    df = pd.read_csv('transaction_dataset.csv', index_col=0)
    # ...data cleaning and feature selection...
    y = df.iloc[:, 0]
    X = df.iloc[:, 1:]
    norm = PowerTransformer()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    norm_train_f = norm.fit_transform(X_train)
    oversample = RandomOverSampler()
    x_tr_resample, y_tr_resample = oversample.fit_resample(norm_train_f, y_train)
    xgb_c = xgb.XGBClassifier()
    xgb_c.fit(x_tr_resample, y_tr_resample)
    with open('scam_model.pkl', 'wb') as f:
        pickle.dump(xgb_c, f)
    with open('scam_normalizer.pkl', 'wb') as f:
        pickle.dump(norm, f)

When a transaction is about to be sent, the model predicts whether the recipient is likely to be a scam address. If flagged, the system uses SHAP to explain which features contributed most to the verdict, and then passes this to a conversational AI agent for a human-friendly summary:

import shap
from langchain.agents import create_openai_functions_agent
from langchain_openai import ChatOpenAI

def explain_scam(features, model):
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(features)
    feature_contributions = dict(zip(features.columns, shap_values[0]))
    top_features = sorted(feature_contributions.items(), key=lambda x: abs(x[1]), reverse=True)[:3]
    return top_features

# LangChain agent setup
llm = ChatOpenAI(model="gpt-4o-mini")
agent = create_openai_functions_agent(llm, tools=[...], system_prompt="Explain why this address is suspicious.")

# Example usage:
explanation = agent.invoke({
    "input": "Explain why this address is flagged as a scam.",
    "features": top_features
})

The result? Before any ETH leaves your wallet, you get a clear, AI-generated warning if the recipient looks suspicious—plus a plain-English explanation of the red flags. This seamless blend of blockchain, machine learning, and LLMs helps users make safer decisions, without ever needing to understand the technical details under the hood.

The Tech Stack

Machine Learning: I trained a fraud detection model (XGBoost Classifier) on a dataset of real Ethereum transactions containing scam flags.
Explainability: I used SHAP (SHapley Additive exPlanations) to make the model decisions interpretable. Users can understand why a transaction was flagged.
LLM Integration: The system integrates with OpenAI's GPT-4o-mini via LangChain, transforming ML insights into human-readable alerts.
Frontend: Built with React.js, providing users with an intuitive interface and dynamic transaction alerts.
Backend: A Flask server connects the React frontend, ML model, and Ethereum APIs.
Blockchain: I used Ethereum testnets to simulate transactions and trigger wallet audits.

Project Repo

GitHub Repository

In the News

FinTechWest: Exploring the Future of Blockchain — Bournemouth University Hackathon Week 2025

TechNotes by Sam

Fine-tuning Gemma 4 with Unsloth and Cactus

Context

Why Fine-tune at All?

Why Kaggle?

The Finetuning Pipeline

Step 1 — Install dependencies

Step 2 — Configuration

Step 3 — Load the base model and attach the LoRA adapter

Step 4 — Prepare the dataset

Step 5 — Train

Step 6 — Save the adapter

Challenge 1 — Hyperparameter Alignment with Cactus

Challenge 2 — PEFT Does Not Support Gemma4ClippableLinear

The Fix — Merge Inside Kaggle Using Unsloth

Step 7 — Cactus Conversion (the Working Method)

Summary of the Working Pipeline

Project Repo

Building suspETHious — An AI Agent for Wallet Security

The Problem

The Idea: suspETHious

How It Works

The Tech Stack

Project Repo

In the News

Challenge 2 — PEFT Does Not Support `Gemma4ClippableLinear`