-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Description
I'm interested in implementing code for automatically determining the optimal runtime parameters given some model and memory constraints. I imagine the implementation to use something like a "dummy" parameter which, when set, does not result in any actual memory allocations but enables the creation of llama_model and llama_context dummies that can be used to determine how much memory would be used for some choice of llama_model_params and llama_context_params. By comparing the amount of memory that was used for the dummies with the amount of memory that is actually available the implementation could then iteratively optimize parameters such as context size or the number of GPU layers.
One roadblock that I have run into is how to make this implementation minimally invasive for the rest of the code. Right now I think the way to do it would be:
- Extend
ggml_backend_deviceto track the amount of memory that has been allocated to this device by the current process. - Add a function like
ggml_backend_dev_get_device_dummythat returns a dummy instead of the actual device. - In llama.cpp, conditionally fetch the dummy devices. Some additional logic in
llama-model-load.cppwill still be needed to avoid temporarily loading data from disk to RAM. - Extend the logic of
llama_decodea bit to allow for determining the allocated size of the worst-case graph. - In the runtime parameter optimization code, simply iterate over the dummy devices and retrieve the amount of memory that was allocated.
I'm very much open to suggestions, particularly from @slaren .