You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The issue
The same exact model will train considerably slower if the last layer is interpreted as multiple outputs rather than a single output. A benchmark I've built takes a model with about 72 million weights across several linear layers, where the last layer has 4 weight. I compare the case where these weights are considered a single output, with one metric and one loss, vs cases where I have two losses + metrics, and four losses + metrics. I've found what appears to be an increasing performance degradation with additional metrics and losses, accumulating to as much as 40-50 % in some runs. The same does not appear to replicate in equivalent training loops I've set up with vanilla Pytorch.
Given that each step in the training loop has hundreds of millions of calculations (forward + backward pass), I find a decrease of tens of percentages in performance just to support another loss + metric to be quite a hefty price to pay, and at least some of the issue appears to be something that cropped up in Lightning and isn't there in vanilla torch.
The benchmark
The lightning benchmark is set up to run with barebones=True, on a machine with 2 CPUs and one GPU (container on an A100). The data is a set of in-memory random vectors with random labels. I've written the models and metrics as "flat" as I can, not using any nested structures to hold them (no lists/dicts, no nn.Sequential, etc.). The training loop is of 10_000 steps, in a single training epoch and not including a validation epoch. I use seed_everything to make the experiment deterministic across runs, in both the vanilla torch and lightning case. In the vanilla case I tried to make sure all computations happen on-device.
(I've also tried changing the order in which I call each training loop, thinking maybe there's some global system state the changes from call to call, but the results replicate regardless of order).
Further info
I did this benchmark after encountering performance degradation in my work. I used the lightning profiler to investigate and found that the entirety of the difference in performance between having one head and two is that there were twice as many function calls associated with metrics and losses. I've disabled profiling in the benchmark in favor of the barebones approach to reduce noise as much as possible, so the results are pretty blackboxy, but this can definitely be a direction to investigate.
Bug description
The issue
The same exact model will train considerably slower if the last layer is interpreted as multiple outputs rather than a single output. A benchmark I've built takes a model with about 72 million weights across several linear layers, where the last layer has 4 weight. I compare the case where these weights are considered a single output, with one metric and one loss, vs cases where I have two losses + metrics, and four losses + metrics. I've found what appears to be an increasing performance degradation with additional metrics and losses, accumulating to as much as 40-50 % in some runs. The same does not appear to replicate in equivalent training loops I've set up with vanilla Pytorch.
Given that each step in the training loop has hundreds of millions of calculations (forward + backward pass), I find a decrease of tens of percentages in performance just to support another loss + metric to be quite a hefty price to pay, and at least some of the issue appears to be something that cropped up in Lightning and isn't there in vanilla torch.
The benchmark
The lightning benchmark is set up to run with barebones=True, on a machine with 2 CPUs and one GPU (container on an A100). The data is a set of in-memory random vectors with random labels. I've written the models and metrics as "flat" as I can, not using any nested structures to hold them (no lists/dicts, no nn.Sequential, etc.). The training loop is of 10_000 steps, in a single training epoch and not including a validation epoch. I use seed_everything to make the experiment deterministic across runs, in both the vanilla torch and lightning case. In the vanilla case I tried to make sure all computations happen on-device.
(I've also tried changing the order in which I call each training loop, thinking maybe there's some global system state the changes from call to call, but the results replicate regardless of order).
Further info
I did this benchmark after encountering performance degradation in my work. I used the lightning profiler to investigate and found that the entirety of the difference in performance between having one head and two is that there were twice as many function calls associated with metrics and losses. I've disabled profiling in the benchmark in favor of the barebones approach to reduce noise as much as possible, so the results are pretty blackboxy, but this can definitely be a direction to investigate.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: