Loss Functions for Ordinal data and importance weighting #10

brycejoh16 · 2025-05-29T20:02:18Z

TL;DR: This update gives the ability for metl finetuned models to have an ordinal specific loss (either through corn or coral loss), originally implemented by Sebastian Raschka here, along with other small updates. I recommend looking to the Loss Arguments section of notebooks/finetuning.ipynb for a comprehensive view of how to run these models with ordinal specific loss.

This PR is a work in progress, as coral still needs to be tested on a gpu (update May 31, 2025 - tested) and multilibrary coral loss needs to be implemented. However, these updates, given the auxiliary inputs are available to the CORAL layer, should be minimal.
Additionally, inference.py needs to be tested with these functions.

compute_rosetta_standardization.py

allow for --columns2ignore in code/compute_rosetta_standardization.py, these columns are not saved to the database. This is important for situations where columns after the chosen start column, via energies_start_col should not be included.

datamodules.py

--num_classes , required for all ordinal loss functions.
allow for class imbalance loss through flag --use_importance_weights, to get weights based on training set class balance. You can set your own importance weights through set_importance_weights. This can be done for MSE, CORN, and CORAL loss.

models.py

allow for coral specific task layer (top_net_type), which includes class specific bias terms.
top_net_output_dim is specific to corn, as there must be a task to predict the probability of a sample being in each class, takes dimension N-1, where N is the number of classes.
preinit_bias is coral specific, and initializes the custom bias terms to have better convergence in practice (although not tested with dms data).

parse_rosetta_data.py

the parameter int_cols allows the user to specify which columns of the rosetta energy terms are integers, as these columns must be explicitly defined due to an hdf saving error.

tasks.py

Majority of updates here.

--loss_func allows the user to specify a loss, between 'mse', 'corn', and 'coral'.
--corn_coral_log_feature this special flag only effects the logging process. All N-1 probabilities tasks, where N is the number of classes, and true predictions are saved from coral and corn. However, parity plots and other metrics such as spearman correlation and pearson correlation are calculated. This flag tells the logging process which column to use for this logging process.

train_target_model.py & training_utils.py

Updates to the logging function to maintain legacy functionality while allowing for matrix outputs from models as in the case of corn and coral models.

finetuning.ipynb

Section Loss Arguments contains the details for running with the ordinal loss above.

…luded beyond a single key specifying the start of the rosetta energy terms. This is problematic when many rosetta energy functions are combined into one, or ordering of columns is not guaranteed.

…s to a warning telling the user that they are proceeding at their own risk

…oss function for both MSE and CORN loss

…les.py object does not need to be passed into the transfer_model which is unserializaable in hparams

…loats

…ro errors.

…nd divide by zero errors.

…raining yet.

samgelman · 2025-05-29T21:53:51Z

code/datamodules.py

                            type=str, default="")
+        parser.add_argument("--aux_input_num",
+                            help="number of auxiliary inputs",
+                            type=int,default=1)


We should be able to get the # of auxiliary inputs from the given aux_input_names, which would avoid needing a separate argument.

samgelman · 2025-05-29T22:03:09Z

I reviewed the diff but didn't run any code. Overall, it looks good to me so far. There's a comment above about getting the number of auxiliary inputs from the given auxiliary input names instead of providing it as a separate argument. Another minor suggestion would be to follow the same coding style as the existing code / PEP8. The new code does this for the most part, but there are some places where an additional space needs to be inserted like choices=["linear", "nonlinear", "sklearn","coral"]. If you're using PyCharm, there's built-in functionality to auto format your code.

brycejoh16 · 2025-06-02T20:32:00Z

ask arnav to do a quick skim once inference and multilibrary coral are done.

brycejoh16 added 23 commits May 28, 2025 09:55

revert commit

b5947ae

test for downloading metl

f1b4151

undo test for downloading metl

3b7e677

creating table for all energies

429cf1b

creating table for all energies

031bae0

add in the ability to manually specify columns that should not be inc…

b915707

…luded beyond a single key specifying the start of the rosetta energy terms. This is problematic when many rosetta energy functions are combined into one, or ordering of columns is not guaranteed.

update utils.py to change the ValueError of not including energy term…

eab8e35

…s to a warning telling the user that they are proceeding at their own risk

update utils.py to change the ValueError of not including energy term…

59c6c59

…s to a warning telling the user that they are proceeding at their own risk

allow for dms files with csv extension

14e38aa

initial commit for corn loss implementation in transfer_model

d007637

corn loss implementation

90ffcec

add in functionality to do class weighting with ordinal data in the l…

9eb26c2

…oss function for both MSE and CORN loss

update the codebase to have the same behavior, but so that a datamodu…

11914f0

…les.py object does not need to be passed into the transfer_model which is unserializaable in hparams

update datamodules.py to change set_importance_weights to allow for f…

f8e5e0f

…loats

add in class types for debugging a cuda error

3f1ae72

change count_dict to account for error with 0 labels and divide by ze…

c071356

…ro errors.

change count_dict to account for error with 0 labels in a given bin a…

38f820b

…nd divide by zero errors.

put class weights on the proper device for MSE loss

c162985

put class weights on the proper device for MSE loss

5318479

start working on readme and documentation

80f2b25

updates to documentation in finetuning.ipynb

7d731f0

update to include coral loss and coral final layer, no multilibrary t…

fcebd43

…raining yet.

update coral to allow for imporance weighting

11c1675

brycejoh16 requested a review from samgelman May 29, 2025 20:06

samgelman reviewed May 29, 2025

View reviewed changes

brycejoh16 added 2 commits May 31, 2025 20:58

place importance weights on the correct device

1c51a16

place importance weights on the correct device

b59e818

coral training, adding in the ability for auxiliary bias

a785cff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Loss Functions for Ordinal data and importance weighting #10

Loss Functions for Ordinal data and importance weighting #10

Uh oh!

brycejoh16 commented May 29, 2025 •

edited

Loading

Uh oh!

samgelman May 29, 2025

Uh oh!

samgelman commented May 29, 2025

Uh oh!

brycejoh16 commented Jun 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Loss Functions for Ordinal data and importance weighting #10

Are you sure you want to change the base?

Loss Functions for Ordinal data and importance weighting #10

Uh oh!

Conversation

brycejoh16 commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samgelman May 29, 2025

Choose a reason for hiding this comment

Uh oh!

samgelman commented May 29, 2025

Uh oh!

brycejoh16 commented Jun 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

brycejoh16 commented May 29, 2025 •

edited

Loading