Skip to content
View iamrahulreddy's full-sized avatar
🎯
Focusing
🎯
Focusing

Block or report iamrahulreddy

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
iamrahulreddy/readme.md

Muskula Rahul

Profile Views Blog LinkedIn Hugging Face

I build LLM systems that sit close to the metal β€” MoE architectures, attention kernels, speculative decoding. The kind of work where a misaligned memory access costs you a day.

Note

The workflow: read the paper, implement it, fix what broke. The CUDA race conditions weren't in the abstract.

Projects

Keiro β€” Sparse MoE on Qwen2.5-3B

Retrofitted Sparse Mixture-of-Experts into Qwen2.5-3B. A Top-2 router activates 2 of 8 LoRA experts per transformer block, leaving the frozen FFN untouched and routing through Rank-16 adapters instead. Active compute stays identical to the dense baseline. The model adds 19.46M trainable parameters (0.63% of total) and retains 95.4% of GSM8K performance.

What actually needed fixing:
β”œβ”€β”€ CUDA race condition in index_add_ with duplicate Top-K indices
β”œβ”€β”€ BFloat16 cumsum upcast mismatch in the coalesce path
└── 4.7Γ— autoregressive inference bottleneck β€” resolved by bypassing
    capacity buffers during single-token generation

lm-evaluation-harness results vs. base model:
β”œβ”€β”€ HellaSwag     βˆ’0.13%
β”œβ”€β”€ ARC-Challenge βˆ’0.17%
└── GSM8K         βˆ’3.19%

Prolepsis β€” Speculative Decoding

A Qwen 1.7B draft model generates candidate tokens; a Qwen 8B target verifies them in a single parallel pass. A rejection sampling pipeline ensures the output distribution is mathematically identical to running the target model alone.

Metric Result
Speedup on A100 1.30Γ—
Acceptance Rate ~56.5% across mixed-domain prompts
Output Distribution Identical to target

FlashTile β€” Flash Attention V1/V2

Implements block-wise tiling, online softmax, and recomputation-based backward passes to cut attention storage from O(NΒ²) to O(N). Covers GQA and MQA variants, with a forward-only Triton kernel included for benchmarking.

Substrata9 β€” Linux Introspection Toolkit

Pure Bash. No compilation, no dependencies. Reads /proc to surface memory maps, file descriptors, process hierarchies, and runtime anomalies. Outputs JSON β€” slots into observability, debugging, and forensics pipelines without modification.

Mission Cipher β€” GraphRAG App

Combines cosine-similarity search over semantic embeddings with a live knowledge graph (NetworkX) to answer questions with richer contextual grounding than plain RAG. Deployed on GCE behind NGINX, with Flask and Gunicorn communicating over a Unix socket.

Writing

Tip

neuralnets.dev β€” LLM architecture, inference, GPU programming, and occasionally the math underneath all of it. The goal is precision over vibe β€” the writeups get into what the papers skip and what the code alone won't tell you.

Pinned Loading

  1. Prolepsis Prolepsis Public

    Prolepsis is a speculative decoding implementation that accelerates LLM inference by 1.30x on an A100. By pairing a small draft model (Qwen 1.7B) with a larger target (Qwen 8B), it shifts generatio…

    Python 1

  2. FlashTile FlashTile Public

    Reference Flash Attention implementation in PyTorch with V1/V2, GQA/MQA, Triton kernels, benchmark and docs.

    Python 1

  3. Substrata9 Substrata9 Public

    Deep process visibility for Linux β€” inspect memory, file descriptors, and process hierarchies via /proc

    Shell 3 2

  4. cipher cipher Public

    This Graph RAG Application is a web-based tool that allows users to ask questions about the Mission Impossible film franchise and receive detailed, contextually relevant answers. By combining retri…

    JavaScript 4

  5. graphil graphil Public

    This repository powers graphil.neuralnets.dev, a platform I designed to democratize technical education through sophisticated, interactive visualizations.

    HTML 3