Skip to content

NaN error #12

@zaocan666

Description

@zaocan666

Hi, excellent work here.
I encountered NaN error when training with the config configs/train_mf_gan64d_april.config:
Traceback (most recent call last): File "/home/urkax/project/GenFed/manifold-flow-public/experiments/train.py", line 592, in <module> learning_curves = train_model(args, dataset, model, simulator) File "/home/urkax/project/GenFed/manifold-flow-public/experiments/train.py", line 504, in train_model learning_curves = train_manifold_flow_sequential(args, dataset, model, simulator) File "/home/urkax/project/GenFed/manifold-flow-public/experiments/train.py", line 276, in train_manifold_flow_sequential learning_curves = trainer1.train( File "/home/urkax/project/GenFed/manifold-flow-public/experiments/training/trainer.py", line 307, in train loss_train, loss_val, loss_contributions_train, loss_contributions_val = self.epoch( File "/home/urkax/project/GenFed/manifold-flow-public/experiments/training/trainer.py", line 380, in epoch batch_loss, batch_loss_contributions = self.batch_train( File "/home/urkax/project/GenFed/manifold-flow-public/experiments/training/trainer.py", line 513, in batch_train loss_contributions = self.forward_pass(batch_data, loss_functions, forward_kwargs=forward_kwargs, custom_kwargs=custom_kwargs) File "/home/urkax/project/GenFed/manifold-flow-public/experiments/training/trainer.py", line 633, in forward_pass self._check_for_nans("Reconstructed data", x_reco) File "/home/urkax/project/GenFed/manifold-flow-public/experiments/training/trainer.py", line 122, in _check_for_nans raise NanException training.trainer.NanException

I am using 5 GPUs, pytorch 1.7.1
Have you ever encountered such problem?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions