-
Notifications
You must be signed in to change notification settings - Fork 56
Description
Is your feature request related to a problem? Please describe.
Currently, Instructlab has no documented limits to the overall cumulative size of documents it can process in its sdg engine (and at a larger scale: no documented limits or resource requirements for taxonomies in general). However: in practice there are limits that need to be set: I will provide an example taxonomy and the resource spike it caused in SDG as a reference.
The taxonomy is listed in a comment below this issue. If you look at it: there is one knowledge qna.yaml file with 8 knowledge seed examples. It references 8 pdfs that cumulatively are ~64 MB in size and approximately ~10000 pages of pdfs to be processed. If you run this through ilab data generate with the agnetic sdg pipeline (will provide the exact parameters below) against a full scale Mixtral teacher model with the appropriate LORA adapters for knowledge and skills:
ilab data generate --taxonomy-path /var/mnt/inststg1/instructlab/job/taxonomy/ --taxonomy-base empty --output-dir /var/mnt/inststg1/instructlab/job/job_out/artifacts --endpoint-url http://localhost:8081/v1 --server-ctx-size 32768 --model /instructlab/models/mixtral-8x7b-instruct-v0-1 --sdg-scale-factor 30
The individual node datasets and MMLU files will ultimately generate but when they go through data mixing with the replay buffer to form the ultimate skills and knowledge payload that goes into training: the machine spikes past 80 GB of memory usage and OOMs the process (it stops at 80GB only because the machine it is running on has 80 GB of memory I believe it would spike higher). Providing logs as well of the run for reference along with journalctl logs showing the final memory the process was at before it was killed.
Describe the solution you'd like
This brings up a larger point on Instructlab from an SDG perspective should publish limits or resource requirements of the machines necessary to process certain payloads. We also long term maybe can improve data mixing to not have such a high resource spike: however ultimately instructlab will have limits on the amount of information it can process depending on the hardware it is running on. These limits should be documented and ideally also be validated potentially through ilab taxonomy diff for the machine ilab is being ran on to ensure that the process will not result in a out of memory event for the machine.
Acceptance Criteria:
- What are the profiled limits for instructlab (or tested limits) when it comes to size of a taxonomy. Specifically in a few dimensions:
Cumulative size of pdf knowledge documents associated with a taxonomy
Cumulative size limits of md files associated with a taxonomy
Cumulative size of knowledge leaf nodes seed examples associated with a taxonomy (knowledge is special here as there is a multiplicative factor with size of documents associated with knowledge leaf node)
Cumulative size of skills leaf nodes seed examples associated with a taxonomy (knowledge is special here as there is a multiplicative factor with size of documents associated with knowledge leaf node)
- Document tested limits and provide way (potentially through ilab taxonomy diff) to give status to user that based on their taxonomy payload they are generating that their knowledge document and/or taxonomy size is outside the stress tested bounds of instructlab and could lead to unstable results. Without this: users can execute the job and it would ultimately just result in a out of memory event on the machine in the reproducable example I gave above.
Additional context