Motivation
In many parts of the world, access to reliable medical information can mean the difference between life and death. Yet billions of people face challenges like unreliable internet connections, remote locations without connectivity, or situations where privacy concerns prevent sharing medical information online. What if we could put a comprehensive medical assistant directly in everyone’s pocket, working anytime, anywhere, completely offline?
This vision is now possible thanks to Gemma 3n, Google’s mobile-first AI model engineered to run efficiently on everyday devices with a small memory footprint. However, even advanced small language models face a fundamental challenge: they can’t contain all specialized, up-to-date medical knowledge within their weights. In healthcare, accuracy isn’t just important - it’s critical.
That’s why we developed LOMA - Local Offline Medical Assistant. LOMA combines Gemma 3n’s efficient on-device reasoning with an advanced Retrieval-Augmented Generation (RAG) system containing over 5 million medical questions and answers, running entirely on your device.
Executive Summary
LOMA (Local Medical AI) solves a challenge: creating a medical AI assistant that works seamlessly across iOS and Android while maintaining medical accuracy. We combined Google’s Gemma 3n language model with retrieval systems that search 5 million medical documents in real-time. The result is a mobile medical assistant that runs entirely offline on devices, always citing sources for medical accuracy.
Architecture & Backend Strategy
LOMA’s architecture centers on privacy-first mobile AI processing using React Native for cross-platform compatibility. The system runs entirely on-device, ensuring privacy for sensitive medical conversations.
On-Device Processing
LOMA uses llama.rn for local inference with a 4.79GB GGUF model that downloads directly to the device. We chose llama.rn over alternatives like react-native-transformers (which is no longer actively maintained). We needed something that works on both iOS and Android. GPU acceleration pushes up to 99 layers to the graphics chip, delivering responses up to one minute, which allows the system to work on a broader range of devices with no data sent to external servers.
Gemma 3n Integration
Integrating Google’s Gemma 3n into a medical application required solving three challenges:
Conversation Format: Gemma 3n requires strict alternating user-assistant patterns. We built a message processing pipeline that reorganizes conversations to maintain this structure, ensuring consistent response quality even with irregular user interactions.
The integration with llama.rn required converting Gemma 3n to GGUF format and optimizing it for on-device inference. The 4.79GB model runs efficiently on mobile hardware, leveraging GPU acceleration where available to achieve response times under one minute on a wide range of devices.
RAG & Vector Database
LOMA bridges AI intelligence with verified medical knowledge through Retrieval-Augmented Generation (RAG). Since medical information changes rapidly and every response needs traceable sources, we couldn’t rely solely on Gemma 3n’s training data.
We’ve implemented a doc2query system that generates question embeddings for each document in our knowledge base. This approach significantly enhances retrieval accuracy by allowing the system to understand not just the content of medical documents but the types of questions they can answer. When you query LOMA, it matches your question against these pre-generated question embeddings, finding the most relevant information with remarkable precision.
The Search System
We implemented semantic search using the ALL-MiniLM-L6-v2 embedding model via ExecuTorch, chosen for its compact size and 384-dimensional vectors. We initially considered MedEmbed-small-v0.1, a medical-specific embedding model, but it wasn’t supported by ExecuTorch, so we used the general model. The system generates embeddings in 53-78ms on mobile devices using around 150-190MB memory.
Database Architecture
We use Turso.db as the only viable free vector database for React Native. While Turso offers built-in offline mode with sync, it’s a paid feature, so we host our database on Cloudflare R2 and download it locally – not ideal but practically free.
Turso supports vector_top_k with indexing using disk-based ANN (Approximate Nearest Neighbor) algorithms for fast searches. These ANN algorithms store vector indexes on disk rather than in memory, enabling efficient similarity searches even with large datasets. However, indexing triples database size. Our database would grow to ~100GB, so we chose unindexed cosine distance search. Performance comparison: indexed vector_top_k (50k vectors, f8): 11ms at 1.09GB vs unindexed cosine distance (50k vectors, f32): 94ms at 250MB. For our 5 million documents, search takes ~45 seconds – acceptable given the storage savings.
Complete Workflow
When you ask a medical question: your query converts to a mathematical representation → system searches documents and Q&A pairs simultaneously → results rank by relevance → relevant information assembles into structured context with citations → enhanced prompt goes to Gemma 3n → response returns with traceable medical sources.
Cross-Platform Engineering
Building one application across iOS and Android required solving several challenges:
File System Differences: iOS and Android have different storage limitations and permissions. Our abstraction layer handles these differences transparently with fallback mechanisms.
Memory Management: AI models on mobile devices required lazy loading, cleanup, and queue-based processing to handle concurrent requests without crashes.
Performance Optimization: We implemented lazy loading throughout the system, progress feedback, retry mechanisms, and model caching. Each platform gets optimized generation parameters while maintaining consistent experiences.
Results & Impact
Performance Metrics
Metric | Value | Notes |
---|---|---|
Response Time | Up to 1 minute | Enables broader device compatibility |
Model Size | 4.79GB | GGUF format with Q8_0 quantization |
Embedding Memory | 150-190MB | ALL-MiniLM-L6-v2 via ExecuTorch |
Document Search | ~45 seconds | 5 million medical documents |
Vector Search | 94ms | Unindexed cosine similarity |
Database Size | 250MB | Unindexed vs 1.09GB indexed |
Key Achievements
- RAG-enhanced responses with traceable medical source citations
- Privacy-first architecture enabling on-device inference for medical questions
- Offline functionality with 5 million medical documents