You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Aug 7, 2025. It is now read-only.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This PR shows how to use DeepSpeed using deferred model loading with a large model like opt-30b
Fixes #(issue)
Type of change
Please delete options that are not relevant.
Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update
Feature/Issue validation/testing
Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
curl -v "http://localhost:8080/predictions/opt" -T sample_text.txt
* Trying 127.0.0.1:8080...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> PUT /predictions/opt HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.87.0
> Accept: */*
> Content-Length: 54
> Expect: 100-continue
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
* Mark bundle as not supporting multiuse
< HTTP/1.1 200
< x-request-id: 70c9a4ff-6c57-4d65-99bc-d42adf0f179c
< Pragma: no-cache
< Cache-Control: no-cache; no-store, must-revalidate, private
< Expires: Thu, 01 Jan 1970 00:00:00 UTC
< content-length: 59
< connection: keep-alive
<
Today the weather is really nice and I am planning on
* Connection #0 to host localhost left intact
going
Server
2023-06-21T23:16:57,280 [INFO ] W-29500-opt_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT to backend at: 1687389417280
2023-06-21T23:16:57,281 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Backend received inference at: 1687389417
2023-06-21T23:16:57,281 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Received text: 'Today the weather is really nice and I am planning on
2023-06-21T23:16:57,281 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - '
2023-06-21T23:16:57,282 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Backend received inference at: 1687389417
2023-06-21T23:16:57,282 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Received text: 'Today the weather is really nice and I am planning on
2023-06-21T23:16:57,282 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - '
2023-06-21T23:16:57,282 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Backend received inference at: 1687389417
2023-06-21T23:16:57,282 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Received text: 'Today the weather is really nice and I am planning on
2023-06-21T23:16:57,283 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - '
2023-06-21T23:16:57,283 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Backend received inference at: 1687389417
2023-06-21T23:16:57,283 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Received text: 'Today the weather is really nice and I am planning on
2023-06-21T23:16:57,283 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - '
2023-06-21T23:16:57,289 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Input length of input_ids is 13, but `max_length` is set to 10. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
2023-06-21T23:16:57,289 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Input length of input_ids is 13, but `max_length` is set to 10. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
2023-06-21T23:16:57,290 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Input length of input_ids is 13, but `max_length` is set to 10. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
2023-06-21T23:16:57,290 [WARN ] W-29500-opt_1.0-stderr MODEL_LOG - Input length of input_ids is 13, but `max_length` is set to 10. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
2023-06-21T23:16:58,088 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - ------------------------------------------------------
2023-06-21T23:16:58,088 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Free memory : 6.170227 (GigaBytes)
2023-06-21T23:16:58,088 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Total memory: 22.056641 (GigaBytes)
2023-06-21T23:16:58,088 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Requested memory: 0.984375 (GigaBytes)
2023-06-21T23:16:58,088 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Setting maximum total tokens (input + output) to 1024
2023-06-21T23:16:58,088 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - WorkSpace: 0x7f59f8000000
2023-06-21T23:16:58,088 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - ------------------------------------------------------
2023-06-21T23:16:58,256 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Generated text: ['Today the weather is really nice and I am planning on\ngoing']
2023-06-21T23:16:58,256 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Generated text: ['Today the weather is really nice and I am planning on\ngoing']
2023-06-21T23:16:58,256 [INFO ] W-29500-opt_1.0-stdout MODEL_LOG - Generated text: ['Today the weather is really nice and I am planning on\ngoing']
Checklist:
Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?
Merging #2419 (858bd83) into master (a77a150) will decrease coverage by 0.12%.
The diff coverage is 0.00%.
❗ Current head 858bd83 differs from pull request most recent head 1d967d9. Consider uploading reports for the commit 1d967d9 to get more accurate results
@ankithagunapal can you pls move the handler, README, requirement.txt and sample_Text to the parent folder from opt. Lets keep only model_config file in opt folder. They seem to be general regardless of the model/ we can always add to readme if needed other models, otherwise it would be repetitive.
@agunapal , I noticed there are three 'Generated text' lines in the serve output, I think it's because you used three GPUs. However, is it correct each gpu generates one output ?
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
None yet
5 participants
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR shows how to use DeepSpeed using deferred model loading with a large model like opt-30b
Fixes #(issue)
Type of change
Please delete options that are not relevant.
Feature/Issue validation/testing
Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
Client
Server
Checklist: