* Decode tok/s, versus a (cluster of) H100 GPUs with 8-bit quantisation and TensorRT-LLM, on Llama2 70B
This means the biggest LLMs in the world running faster than you can read, and a universe of completely new capabilities and possibilities for how we work that will be unlocked by near-instant inference of models with superhuman intelligence.
When a trained language model is run for a user, over 99% of the total compute time is spent not on arithmetic but on moving model weights from memory to the processor
* Llama2 70B at 8 bit quantisation, on 80GB A100 GPU
The time taken to run the arithmetic for generating a single word
The time taken moving parameters from memory to the processor for each word
The total time a Fractile processor takes to generate the same word
The time taken to run the arithmetic for generating a single word
The time taken moving parameters from memory to the processor for each word
The total time a Fractile processor takes to generate the same word
We are a team of scientists, engineers and hardware designers who are committed to building the solutions that the AI revolution requires to keep scaling. We believe that the most important breakthroughs will come from trying solutions that others are not, to serious problems we actually face.