This package provides the command line interface and development kit for use with the chutes.ai platform.
The miner code is available here, and validator/API code here.
Before getting into the weeds, it might be useful to understand the terminology.
Images are simply docker images that all chutes (applications) will run on within the platform.
Images must meet a few requirements:
- Contain a cuda installation, preferably version 12.2-12.6
- Contain clinfo, opencl dev libraries, clblast, openmi, etc.
- Contain a python 3.10+ installation, where
python
andpip
are contained within the executable pathPATH
We HIGHLY, HIGHLY recommend you start with our base image: parachutes/python:3.12 to avoid dependency hell
A chute is essentially an application that runs on top of an image, within the platform. Think of a chute as a single FastAPI application.
A cord is a single function within the chute. In the FastAPI analogy, this would be a single route & method.
GraVal is the graphics card validation library used to help ensure the GPUs that miners claim to be running are authentic/correct. The library performs VRAM capacity checks, matrix multiplications seeded by device information, etc.
You don't really need to know anything about graval, except that it runs as middleware within the chute to decrypt traffic from the validator and perform additional validation steps (filesystem checks, device info challenges, pings, etc.)
Currently, to become a user on the chutes platform, you must have a Bittensor wallet and hotkey, as authentication is performed via Bittensor hotkey signatures. Once you are registered, you can create API keys that can be used with a simple "Authorization" header in your requests.
If you don't already have a wallet, you can create one by installing bittensor<8
, e.g. pip install 'bittensor<8'
note: you can use the newer bittensor-wallet package but it requires rust, which is absurd
Then, create a coldkey and hotkey according to the library you installed, e.g.:
btcli wallet new_coldkey --n_words 24 --wallet.name chutes-user
btcli wallet new_hotkey --wallet.name chutes-user --n_words 24 --wallet.hotkey chutes-user-hotkey
Once you have your hotkey, just run:
chutes register
Don't override CHUTES_API_URL unless you are developing chutes, you can just stop here!
To use a development environment, simply set the CHUTES_API_URL
environment variable accordingly to whatever your dev environment endpoint is, e.g.:
CHUTES_API_URL=https://api.chutes.dev chutes register
Once you've completed the registration process, you'll have a file in ~/.chutes/config.ini
which contains the configuration for using chutes.
You can create API keys, optionally limiting the scope of each key, with the chutes keys
subcommand, e.g.:
Full admin access:
chutes keys create --name admin-key --admin
Access to images:
chutes keys create --name image-key --images
Access to a single chute.
chutes keys create --name foo-key --chute-ids 5eda1993-9f4b-5426-972c-61c33dbaf541
As of 2025-10-02, this is no longer required! You must have >= $50 balance to build images, and there is a deployment fee (also mentioned in this doc) to deploy chutes
To get your deposit back, perform a POST to the /return_developer_deposit
endpoint, e.g.:
curl -XPOST https://api.chutes.ai/return_developer_deposit \
-H 'content-type: application/json' \
-H 'authorization: cpk_...' \
-d '{"address": "5EcZsewZSTxUaX8gwyHzkKsqT3NwLP1n2faZPyjttCeaPdYe"}'
The first step in getting an application onto the chutes platform is to build an image.
This SDK includes an image creation helper library as well, and we have a recommended base image which includes python 3.12 and all necessary cuda packages: parachutes/python:3.12
Here is an entire chutes application, which has an image that includes vllm
-- let's store it in llama1b.py
:
from chutes.chute import NodeSelector
from chutes.chute.template.vllm import build_vllm_chute
from chutes.image import Image
image = (
Image(username="chutes", name="vllm", tag="0.6.3", readme="## vLLM - fast, flexible llm inference")
.from_base("parachutes/python:3.12")
.run_command("pip install 'vllm<0.6.4' wheel packaging")
.run_command("pip install flash-attn")
.run_command("pip uninstall -y xformers")
)
chute = build_vllm_chute(
username="chutes",
readme="## Meta Llama 3.2 1B Instruct\n### Hello.",
model_name="unsloth/Llama-3.2-1B-Instruct",
image=image,
node_selector=NodeSelector(
gpu_count=1,
),
)
The chutes.image.Image
class includes many helper directives for environment variables, adding files, installing python from source, etc.
To build this image, you can use the chutes CLI:
chutes build llama1b:chute --public --wait --debug
Explanation of the flags:
--public
means we want this image to be public/available for ANY user to use -- use with care but we do like public/open source things!--wait
means we want to stream the docker build logs back to the command line. All image builds occur remotely on our platform, so without the--wait
flag you just have to wait for the image to become available, whereas with this flag you can see real-time logs/status.--debug
additional debug logging
Once you have an image that is built and pushed and ready for use (see above), you can deploy applications on top of those.
To use the same example llama1b.py
file outlined in the image building section above, we can deploy the llama-3.2-1b-instruct model with:
chutes deploy llama1b:chute
Note: this will ERROR and show you the deployment fee, as a safety mechanism, so you can confirm you want to accept that fee
To acknowledge and accept the fee you must pass --accept-fee
, e.g. chutes deploy llama1b:chute --accept-fee
You are charged a one-time deployment fee per chute, equivalent to 3 times the hourly rate based on the node selector (meaning, gpu_count
* cheapest compatible GPU type hourly rate). There is no deployment fee for any updates to existing chutes.
For example, if the node_selector
has gpu_count=1
and nothing else, the cheapest compatible GPU is $0.1/hr, so your deployment fee is $0.3.
Be sure to carefully craft the node_selector
option within the chute, to ensure the code runs on GPUs appropriate to the task.
node_selector=NodeSelector(
gpu_count=1,
# All options.
# gpu_count: int = Field(1, ge=1, le=8)
# min_vram_gb_per_gpu: int = Field(16, ge=16, le=80)
# include: Optional[List[str]] = None
# exclude: Optional[List[str]] = None
),
The most important fields are gpu_count
and min_vram_gb_per_gpu
. If you wish to include specific GPUs, you can do so, where the include
(or exclude
) fields are the short identifier per model, e.g. "a6000"
, "a100"
, etc. All supported GPUs and their short identifiers
All user-created chutes are charged at the standard hourly (per gpu, based on your gpu_count
value in node selector), based on the cheapest compatible GPU type in the node_selector
definition: https://api.chutes.ai/pricing
For example, if your chute can run on either a100 or h100, you are charged as though all instances are a100, even if it happens to deploy on h100s.
You can configure how much the chute will scale up, how quickly it scales up, and how quickly to spin down with the following flags:
chute = Chute(
...,
concurrency=10,
max_instances=3,
scaling_threshold=0.5,
shutdown_after_seconds=300
)
This controls the maximum number of requests each instance can handle concurrently, which is dependent entirely on your code. For vLLM and SGLang template chutes, this value can be fairly high, e.g. 32+
Maximum number of instances that can be active at a time.
The ratio of average requests in flight per instance that will trigger creation of another instance, when the number of instances is lower than the configured max_instances
value. For example, if your concurrency
is set to 10, and your scaling_threshold
is 0.5, and max_instances
is 2 and you have one instance now, you will trigger a scale up of another instance once the platform observes you have 5 or more requests on average in flight consistently (i.e., you are using 50% of the concurrency supported by your chute).
The number of seconds to wait after the last request (per instance) before shutting down the instance to avoid incurring any additional charges.
Deployment fee: You are charged a one-time deployment fee per chute, equivalent to 3 times the hourly rate based on the node selector (meaning, gpu_count
* cheapest compatible GPU type hourly rate). No deployment fee for any updates to existing chutes.
You are charged the standard hourly rate while any instance is hot, based on your criteria specified above, up through last request timestamp + shutdown_after_seconds
You are not charged for "cold start" times (e.g., downloading the model, downloading the chute image, etc.). You are, however, charged for the shutdown_after_seconds
seconds of compute while the instance is hot but not actively being called, because it keeps the instance hot.
For example:
- deploy a chute at 12:00:00 (new chute, one time node-selector based deployment fee, let's say a single 3090 at $0.12/hr = $0.36 total fee)
max_instances
set to 1,shutdown_after_seconds
set to 300
- send requests to the chute and/or call warmup endpoint: 12:00:01 (no charge)
- first instance becomes hot and ready for use: 12:00:30 (billing at $0.12/hr starts here)
- continuously send requests to the instance (no per-request inference charges)
- stop sending requests at 12:05:00
- triggers the instance shutdown timer based on
shutdown_after_seconds
for 5 minutes...
- triggers the instance shutdown timer based on
- instance chutes down 12:10:00 (billing stops here)
Total charges are: $0.36 deployment fee + 5 minutes at $0.12/hr of active compute + 5 minutes shutdown_after_seconds
= $0.38
Now, suppose you want to use that chute again:
- start requests at 13:00:00
- instance becomes hot at 13:00:30 (billing starts at $0.12/hr here)
- stop requests at 13:05:30
- instance stays hot due to
shutdown_after_seconds
for 5 minutes
Total additional charges = 5 minutes active compute + 5 minute shutdown delay = 10 minutes @ $0.12/hr = $0.02
If you share a chute with another user, they also pay standard rates for usage on the chute!
For any user-deployed chutes, the chutes are private, but they can be shared. You can either use the chutes share
entrypoint, or call the API endpoint directly.
chutes share --chute-id unsloth/Llama-3.2-1B-Instruct --user-id anotheruser
The --chute-id
parameter can either be the chute name or the UUID.
Likewise, --user-id
can be either the username or the user's UUID.
When you share a chute with another user, you authorize that user to trigger the chute to scale up, and you as the chute owner are charged the hourly rate while it's running.
When the user you shared the chute with calls the chute, they are charged the standard rate (dependent on chute type, e.g. per million token for llms, per step on diffusion models, per second otherwise).
Chutes are in fact completely arbitrary, so you can customize to your heart's content.
Here's an example chute showing some of this functionality:
import asyncio
from typing import Optional
from pydantic import BaseModel, Field
from fastapi.responses import FileResponse
from chutes.image import Image
from chutes.chute import Chute, NodeSelector
image = (
Image(username="chutes", name="foo", tag="0.1", readme="## Base python+cuda image for chutes")
.from_base("parachutes/python:3.12")
)
chute = Chute(
username="test",
name="example",
readme="## Example Chute\n\n### Foo.\n\n```python\nprint('foo')```",
image=image,
concurrency=4,
node_selector=NodeSelector(
gpu_count=1,
# All options.
# gpu_count: int = Field(1, ge=1, le=8)
# min_vram_gb_per_gpu: int = Field(16, ge=16, le=80)
# include: Optional[List[str]] = None
# exclude: Optional[List[str]] = None
),
)
class MicroArgs(BaseModel):
foo: str = Field(..., max_length=100)
bar: int = Field(0, gte=0, lte=100)
baz: bool = False
class FullArgs(MicroArgs):
bunny: Optional[str] = None
giraffe: Optional[bool] = False
zebra: Optional[int] = None
class ExampleOutput(BaseModel):
foo: str
bar: str
baz: Optional[str]
@chute.on_startup()
async def initialize(self):
self.billygoat = "billy"
print("Inside the startup function!")
@chute.cord(minimal_input_schema=MicroArgs)
async def echo(self, input_args: FullArgs) -> str:
return f"{self.billygoat} says: {input_args}"
@chute.cord()
async def complex(self, input_args: MicroArgs) -> ExampleOutput:
return ExampleOutput(foo=input_args.foo, bar=input_args.bar, baz=input_args.baz)
@chute.cord(
output_content_type="image/png",
public_api_path="/image",
public_api_method="GET",
)
async def image(self) -> FileResponse:
return FileResponse("parachute.png", media_type="image/png")
async def main():
print(await echo("bar"))
if __name__ == "__main__":
asyncio.run(main())
The main thing to notice here are the various the @chute.cord(..)
decorators and @chute.on_startup()
decorator.
Any code within the @chute.on_startup()
decorated function(s) are executed when the application starts on the miner, it does not run in the local/client context.
Any function that you decorate with @chute.cord()
becomes a function that runs within the chute, i.e. not locally - it's executed on the miners' hardware.
It is very important to give type hints to the functions, because the system will automatically generate OpenAPI schemas for each function for use with the public/hostname based API using API keys instead of requiring the chutes SDK to execute.
For a cord to be available from the public, subdomain based API, you need to specify public_api_path
and public_api_method
, and if the return content type is anything other than application/json
, you'll want to specify that as well.
You can also spin up completely arbitrary webservers and do "passthrough" cords which pass along the request to the underlying webserver. This would be useful for things like using a webserver written in a different programming language, for example.
To see an example of passthrough functions and more complex functionality, see the vllm template chute/helper
It is also very important to specify concurrency=N
in your Chute(..)
constructor. In many cases, e.g. vllm, this can be fairly high (based on max sequences), where in other cases without data parallelism or other cases with contention, you may wish to leave it at the default of 1.
If you'd like to test your image/chute before actually deploying onto the platform, you can build the images with --local
, then run in dev mode:
chutes build llama1b:chute --local
Then, you can start a container with that image:
docker run --rm -it -e CHUTES_EXECUTION_CONTEXT=REMOTE -p 8000:8000 vllm:0.6.3 chutes run llama1b:chute --port 8000 --dev
Then, you can simply perform http requests to your instance.
curl -XPOST http://127.0.0.1:8000/chat_stream -H 'content-type: application/json' -d '{
"model": "unsloth/Llama-3.2-1B-Instruct",
"messages": [{"role": "user", "content": "Give me a spicy mayo recipe."}],
"temperature": 0.7,
"seed": 42,
"max_tokens": 3,
"stream": True,
"logprobs": True,
}'