MobiRAG (Mobile Retrieval-Augmented Generation) is a lightweight, privacy-first Android app that enables users to chat with any PDF file stored on their phone — entirely offline. With on-device embedding generation, vector compression, and SLM inference, MobiRAG brings the power of AI search and summarization directly to your pocket.
No internet, no cloud servers, and no telemetry — everything runs natively on your phone, ensuring complete data privacy and zero leakage. Whether you’re reviewing research papers, legal documents, or ebooks, MobiRAG offers a seamless way to search, ask questions, and summarize content using optimized RAG for mobile devices.
MobiRAG_finalcut_cropped.mp4
🔺️ YT Video
| Feature | Description |
|---|---|
| 🔐 100% On-Device | No cloud calls. No telemetry. Your data never leaves your phone. |
| 🧠 Embeddings via ONNX | Runs all-MiniLM-L6-v2 model for fast, good-quality sentence embeddings on phone. |
| 📚 PDF Discovery & Parsing | Detects and processes all PDFs on device using PDFBox. |
| 🔎 Semantic Search with FAISS | PQ-compressed embeddings enable scalable vector search on-device. |
| 💬 SLM Chat with Context | On-device Small LM like Qwen 0.5B generates answers grounded in PDF context. |
| 🔁 Hybrid RAG | Combines FAISS vector similarity with TF-IDF keyword overlap. |
| 🖼️ Lightweight UI | Responsive and optimized for phones with minimalistic design. |
- Extracts text from each page using PDFBox
- Splits into clean sentence-based units
- Combines sentences into chunks of ~1024 characters with 1-sentence overlap
- Encodes all chunks using ONNX
all-MiniLM-L6-v2model - Compresses embeddings using FAISS Product Quantization (PQ)
- Stores only metadata (PDF URI, page, offset), not the chunk text itself
- User query is embedded using ONNX
all-MiniLM-L6-v2model - FAISS returns top-k chunk IDs based on PQ-compressed vector similarity (nearest neighbours appraoch)
- Matching metadata is used to extract chunk text on the fly
- TF-IDF keyword overlap further refines relevance
- Combined context is used to prompt a local SLM (Qwen) to generate the answer
This reduces memory footprint and ensures that deleted PDFs cannot be queried later (privacy by design).
- ✨ Efficient Compression: FAISS PQ compresses 384-dim vectors at 32x, enabling 1000 PDFs to fit in ~2.4MB (upto 97x can be obtained with a compromize on performance).
- ✨ FAISS Tradeoff: Slight drop in search quality from PQ vs flat index — but acceptable for mobile efficiency.
- ✨ Metadata-Only Storage: Chunks are not stored — only chunkID + PDF location + offset are. Text is extracted on-demand.
- ✨ Privacy by Design: If a PDF is deleted from storage, its chunks become inaccessible.
- Android 8.0+
- ARM64 device (with >=4GB RAM recommended)
- LLM file:
qwen2.5-0.5b-instruct-q4_k_m.gguf
git clone https://github.com/nishchaljs/MobiRAG.git
cd MobiRAG
git submodule update --init --recursive- Open in Android Studio
- Add your embedding ONNX + tokenizer to
assets/all-minilm-l6-v2/ - Place your
.ggufLLM file insideAndroid/data/com.mobirag/files/
-
Refactor to MVC architecture
Reorganize app code into cleanModel-View-Controllerseparation for maintainability and testing -
Improve SLM Inference speed
llama.cppuses android cpu - tends to be very slow for long prompts. Need to explore alternatives likemlc llmto utilize GPUs for inference on android -
Improve hybrid RAG scoring
Replace naive combination of vector similarity + keyword overlap with a unified scoring function (e.g., z-score normalization, weighted distances) -
Optimize FAISS PQ & SLM inference
Experiment with different PQ training sizes, nprobe settings, and SLM decoding strategies for best quality-performance tradeoff -
Improve system prompt
Design a robust, guardrailed prompt template that guides the SLM to avoid hallucinations and respect query constraints
This project is licensed under the MIT License.