Skip to content

kekse1/llama3pure

 
 

Repository files navigation

llama3pure

Three inference engines for Llama 3: pure C for desktop systems, pure JavaScript for Node.js, and pure JavaScript for Web environments. Supports both Llama and Gemma architectures. Try the Web engine here.

demo

Table of Contents

Building the engine (macOS / Linux)

make llama3pure

Building the engine (Windows)

Use the x64 Native Tools Command Prompt for VS.

cl /O2 llama3pure-c-engine.c /Fe:llama3pure.exe

Running the engine (macOS / Linux / Windows)

# On macOS / Linux
./llama3pure -model Llama3.gguf -prompt "Tell me in 1 line what is Microsoft."
./llama3pure -model Llama3.gguf -chathistory chat.txt

# On Windows
llama3pure.exe -model Llama3.gguf -prompt "Tell me in 1 line what is Microsoft."
llama3pure.exe -model Llama3.gguf -chathistory chat.txt
Argument Required Description Default Value
-model Yes Path to a GGUF model file. -
-prompt No Input prompt text (single-turn, alternative to -chathistory). -
-chathistory No Path to a .txt file containing a JSON chat history (multi-turn, alternative to -prompt). -
-system_prompt No System prompt prepended to every conversation. You are a helpful assistant.
-max_tokens No Maximum number of tokens to generate per response. -1 (unlimited)
-context_size No Context window size (capped by the model's own limit). Model's max.
-temperature No Sampling temperature. Higher values produce more varied output. 0.9
-top_p No Nucleus sampling threshold. Only tokens whose cumulative probability reaches this value are considered. 0.9
-top_k No Top-K sampling. Only the K most probable tokens are considered at each step. 40
-debug No Show detailed model loading and performance logs (including tok/s). disabled

Sample chat history in tests.txt.

Running in Node.js

  • Step 1: Load a model

Read the GGUF file into an ArrayBuffer and pass it to llama3pure with type: "load".

import llama3pure from "./llama3pure-js-engine.js"
import fs from "fs"

const readFileAsArrayBuffer = (filePath) => {
  const fd = fs.openSync(filePath, "r")
  const fileSize = fs.fstatSync(fd).size
  const arrayBuffer = new ArrayBuffer(fileSize)
  const fileUint8 = new Uint8Array(arrayBuffer)
  const chunkSize = 256 * 1024 * 1024
  let pos = 0
  while (pos < fileSize) {
    const toRead = Math.min(chunkSize, fileSize - pos)
    fs.readSync(fd, fileUint8, pos, toRead, pos)
    pos = pos + toRead
  }
  fs.closeSync(fd)
  return arrayBuffer
}

llama3pure({
  type: "load",
  model: readFileAsArrayBuffer("/path/to/your-model.gguf"),
  cbRender: (token) => {
    process.stdout.write(token)
  },
  systemPrompt: "You are a helpful assistant.",
  maxTokens: 256,
  contextSize: 2048,
  temperature: 0.9,
  topP: 0.9,
  topK: 40,
})
Parameter Type Required Description Default Value
type string Yes Must be load -
model ArrayBuffer Yes The GGUF model file contents. -
cbRender function Yes Callback invoked with each generated token as a string. -
systemPrompt string No System prompt prepended to every conversation. You are a helpful assistant.
maxTokens number No Maximum number of tokens to generate per response. -1 (unlimited)
contextSize number No Context window size (capped by the model's own limit). Model's max.
temperature number No Sampling temperature. Higher values produce more varied output. 0.9
topP number No Nucleus sampling threshold. Only tokens whose cumulative probability reaches this value are considered. 0.9
topK number No Top-K sampling. Only the K most probable tokens are considered at each step. 40
  • Step 2: Generate a response

Call llama3pure with type: "generate" and a chatHistory array. The engine uses the cbRender callback provided during load to stream tokens. The last message in chatHistory should have role: "user" - that is the message the model will respond to. Previous messages provide conversation context, enabling multi-turn conversations.

llama3pure({
  type: "generate",
  chatHistory: [
    { role: "user", content: "Tell me in 1 line what is Microsoft." },
    {
      role: "assistant",
      content:
        "Microsoft is a global technology leader known for its innovative products and services.",
    },
    { role: "user", content: "Tell me in 1 line the names of the founders." },
  ],
})
Parameter Type Required Description
type string Yes Must be generate.
chatHistory array Yes Array of message objects representing the conversation.

Full example in llama3pure-nodejs-demo.js.

Running in Web Environments

  • Step 1: Load a model

Read the GGUF file as an ArrayBuffer and send it to the worker with type: "load". The ArrayBuffer is transferred (not copied) for performance.

const reader = new FileReader()

reader.onload = (event) => {
  const arrayBuffer = event.target.result
  worker.postMessage(
    {
      type: "load",
      model: arrayBuffer,
      systemPrompt: "You are a helpful assistant.",
      maxTokens: 256,
      contextSize: 2048,
      temperature: 0.9,
      topP: 0.9,
      topK: 40,
    },
    [arrayBuffer]
  )
}

reader.readAsArrayBuffer(file)
Parameter Type Required Description Default Value
type string Yes Must be load -
model ArrayBuffer Yes The GGUF model file contents. -
systemPrompt string No System prompt prepended to every conversation. You are a helpful assistant.
maxTokens number No Maximum number of tokens to generate per response. -1 (unlimited)
contextSize number No Context window size (capped by the model's own limit). Model's max.
temperature number No Sampling temperature. Higher values produce more varied output. 0.9
topP number No Nucleus sampling threshold. Only tokens whose cumulative probability reaches this value are considered. 0.9
topK number No Top-K sampling. Only the K most probable tokens are considered at each step. 40
  • Step 2: Generate a response
worker.postMessage({
  type: "generate",
  chatHistory: [
    { role: "user", content: "Tell me in 1 line what is Microsoft." },
    {
      role: "assistant",
      content:
        "Microsoft is a global technology leader known for its innovative products and services.",
    },
    { role: "user", content: "Tell me in 1 line the names of the founders." },
  ],
})
Parameter Type Required Description
type string Yes Must be generate.
chatHistory array Yes Array of message objects representing the conversation.
  • Step 3: Receiving messages from the Worker
worker.onmessage = function (e) {
  var data = e.data
  switch (data.type) {
    case "progress":
      // Fired during model loading
      break

    case "loaded":
      // Fired once the model is fully loaded and ready
      break

    case "token":
      // Fired for each generated token during inference
      console.log(data.token)
      break

    case "complete":
      // Fired when generation is finished
      console.log(data.output)
      break
  }
}
Event Fields Description
progress - Emitted during model loading to indicate progress.
loaded - Emitted once when the model has been fully loaded and is ready for inference.
token token (string) Emitted for each token as it is generated, enabling real-time streaming of the response.
complete output (string) Emitted when generation finishes. Contains the full generated text.

Try the Web engine here or with custom maxTokens, contextSize, topP and topK here.

A standalone version is available here; it offers the same functionality as the standard version but uses a base64-embedded Worker, allowing you to run it as a local file without a web server.

Suggested Models and Engines

MODEL C NODE.JS WEB
Gemma-3-1B-it-Q8_0.gguf
Llama-3.2-1B-Instruct-Q8_0.gguf
Llama-3.2-3B-Instruct-Q8_0.gguf
Gemma-3-4b-it-Q8_0.gguf

Tested Models

MODEL C NODE.JS WEB
Gemma-3-270M-it-Q2_K_L.gguf
Gemma-3-270M-it-Q3_K_M.gguf
Gemma-3-270M-it-Q4_K_M.gguf
Gemma-3-270M-it-Q5_K_M.gguf
Gemma-3-270M-it-Q6_K.gguf
Gemma-3-270M-it-Q8_0.gguf
Gemma-3-270M-it-F16.gguf
Gemma-3-1B-it-Q2_K_L.gguf
Gemma-3-1B-it-Q3_K_M.gguf
Gemma-3-1B-it-Q4_K_M.gguf
Gemma-3-1B-it-Q5_K_M.gguf
Gemma-3-1B-it-Q6_K.gguf
Gemma-3-1B-it-Q8_0.gguf
Gemma-3-1B-it-BF16.gguf
Llama-3.2-1B-Instruct-Q3_K_L.gguf
Llama-3.2-1B-Instruct-Q4_K_L.gguf
Llama-3.2-1B-Instruct-Q5_K_L.gguf
Llama-3.2-1B-Instruct-Q6_K_L.gguf
Llama-3.2-1B-Instruct-Q8_0.gguf
Llama-3.2-1B-Instruct-f16.gguf
Llama-3.2-3B-Instruct-Q3_K_L.gguf
Llama-3.2-3B-Instruct-Q4_K_L.gguf
Llama-3.2-3B-Instruct-Q5_K_L.gguf
Llama-3.2-3B-Instruct-Q6_K_L.gguf
Llama-3.2-3B-Instruct-Q8_0.gguf
Llama-3.2-3B-Instruct-f16.gguf
Gemma-3-4b-it-Q2_K_L.gguf
Gemma-3-4b-it-Q3_K_M.gguf
Gemma-3-4b-it-Q4_K_M.gguf
Gemma-3-4b-it-Q5_K_M.gguf
Gemma-3-4b-it-Q6_K.gguf
Gemma-3-4b-it-Q8_0.gguf
Gemma-3-4b-it-BF16.gguf
Llama-3-8B-Instruct-Q2_K.gguf
Llama-3-8B-Instruct-Q3_K_M.gguf
Llama-3-8B-Instruct-Q4_K_M.gguf
Llama-3-8B-Instruct-Q5_K_M.gguf
Llama-3-8B-Instruct-Q6_K.gguf
Llama-3-8B-Instruct-Q8_0.gguf
Llama-3-8B-Instruct-fp16.gguf

Author's Notes

  • Using quantizations below Q4 is generally discouraged because the loss in logic and coherence makes them nearly unusable for most tasks.

  • Due to universal browser memory constraints regarding ArrayBuffer size limits, the Web engine can only read GGUF files up to 2 GB.

  • There isn't a Python engine because a ported and pure version would be very slow. Using NumPy wouldn't make sense because it uses C under the hood, and for that, there is already a C engine.

Based on the work of

https://github.com/karpathy/llama2.c

About

Three inference engines for Llama 3: pure C for desktop systems, pure JavaScript for Node.js, and pure JavaScript for Web environments.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • HTML 65.9%
  • JavaScript 17.7%
  • C 16.2%
  • Makefile 0.2%