Relational Time Engine (RTE): runtime density regulation for compute-efficient transformer inference. Demonstrates up to 75% layer reduction with improved latency and throughput.
-
Updated
Mar 12, 2026 - Python
Relational Time Engine (RTE): runtime density regulation for compute-efficient transformer inference. Demonstrates up to 75% layer reduction with improved latency and throughput.
Frozen KV Context for Mixture-of-Recursions on a Modernized BERT
End-to-end model compression pipeline using architecture reduction, knowledge distillation, pruning, and INT8 quantization. Achieves 83.43% accuracy, 3.31 ms latency, and 0.220 MB size, optimized for efficient edge inference.
Task-Aware Dynamic Model Optimization for Multi-Task Learning (IEEE Access 2023)
In this repo you will understand .The process of reducing the precision of a model’s parameters and/or activations (e.g., from 32-bit floating point to 8-bit integers) to make neural networks smaller, faster, and more energy-efficient with minimal accuracy loss.
Symbolic Transformers: 2.2MB models for logical reasoning. Achieves 47% accuracy with 566K parameters—220× smaller than GPT-2. Proves data quality > model size for symbolic AI. 🔬 Novel base-625 symbolic encoding | 🚀 Edge-deployable | 📊 Open research
Code for paper "Dynamic Deep Neural Network Inference via Adaptive Channel Skipping"
Transformer (GPT) implemented from scratch in C++. Runs on modest hardware with complete mathematical derivations and optimized tensor operations.
A non-Transformer hierarchical recurrent network with differentiable Gumbel-Softmax routing and bounded memory slots. Runs 7B+ parameter models layer-by-layer on low-budget GPUs.
An open and practical guide to Edge Language
🔬 Curiosity-Driven Quantized Mixture of Experts
MOCA-Net: Novel neural architecture with sparse MoE, external memory, and budget-aware computation. Real Stanford SST-2 integration, O(L) complexity, 96.40% accuracy. Built for efficient sequence modeling.
⚡ Fast, concise, LLM-first Generative UI language
"TRM (Tiny Recursive Model) integration architecture for Symbion.space ecosystem"
Mixture-of-Recursions on a Modernized BERT (Prototype)
QuantLab-8bit is a reproducible benchmark of 8-bit quantization on compact vision backbones. It includes FP32 baselines, PTQ (dynamic & static), QAT, ONNX exports, parity checks, ORT CPU latency, and visual diagnostics.
A deep learning framework that implements Early Exit strategies in Convolutional Neural Networks (CNNs) using Deep Q-Learning (DQN). This project enhances computational efficiency by dynamically determining the optimal exit point in a neural network for image classification tasks on CIFAR-10.
An awesome list of papers, datasets, and tools for efficient sensor-based Human Activity Recognition (HAR), with a focus on lightweight and edge-friendly deep learning.
Dynamic Attention Mask (DAM) generate adaptive sparse attention masks per layer and head for Transformer models, enabling long-context inference with lower compute and memory overhead without fine-tuning.
🌟 Build efficient models with Transformer Hierarchical Layers for powerful text processing and enhanced performance in natural language tasks.
Add a description, image, and links to the efficient-ai topic page so that developers can more easily learn about it.
To associate your repository with the efficient-ai topic, visit your repo's landing page and select "manage topics."