Need help understanding validation discrepancies when training completes #22095

mwelliott · 2025-09-16T00:56:39Z

mwelliott
Sep 16, 2025

Howdy! I just had a 200 epoch object detection training session complete - this consisted of around 55k images and 130k~ bounding boxes.

As you know, once training completes we get some validation metrics. Here are the last few lines of the training, showing the last 2 epochs and then the validation output:

     Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     199/200      8.34G     0.3436      0.174     0.8592         53        640: 100%|██████████| 1369/1369 [19:27<00:00,  1.17it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 172/172 [02:17<00:00,  1.25it/s]
                   all      10951      26254      0.988      0.982      0.992      0.894

     Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
     200/200      7.99G     0.3422      0.174     0.8585         53        640: 100%|██████████| 1369/1369 [19:15<00:00,  1.19it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 172/172 [02:14<00:00,  1.28it/s]
                   all      10951      26254      0.988      0.982      0.992      0.894


Validating runs\detect\15_55k_cleaned_upper_body_garments_train\weights\best.pt...
Ultralytics 8.3.179  Python-3.13.1 torch-2.8.0+cu128 CUDA:0 (NVIDIA GeForce RTX 3080 Ti Laptop GPU, 16384MiB)
Model summary (fused): 112 layers, 68,131,272 parameters, 0 gradients, 257.4 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 172/172 [04:27<00:00,  1.55s/it]
                   all      10951      26254      0.763      0.935      0.872      0.791
                 front       4339       4339       0.97      0.996      0.993      0.958
                  back       4066       4066      0.987      0.997      0.994      0.961
            left_chest       6694       6698      0.584      0.893      0.843      0.745
           right_chest       6749       6750      0.588      0.882      0.839      0.739
   bottom_front_pocket        815        815      0.995      0.984      0.994      0.895
        on_left_pocket        328        328      0.813      0.939      0.915      0.803
       on_right_pocket         92         92      0.252      0.859      0.425      0.356
            upper_back       3166       3166      0.916      0.933      0.976      0.871
Speed: 0.1ms preprocess, 20.2ms inference, 0.0ms loss, 1.2ms postprocess per image
Results saved to runs\detect\15_55k_cleaned_upper_body_garments_train

I noticed that some of the classes had very low precision and map50-95 in the validation results, which doesn't seem to line up with the overall metrics we see via the individual epochs output. It also doesn't line up with real world usage, as these classes are very accurate (like left_chest and right_chest). So, I then ran the validation again manually using the same config, dataset, and environment immediately after the training completed. Essentially just immediately thought - hey, something is strange, let me run validation again. Here are the results of that:

val: Fast image access  (ping: 0.10.0 ms, read: 571.2136.5 MB/s, size: 339.9 KB)
val: Scanning C:\Users\uname\OneDrive\Desktop\vs code repos\hotspot_detection_trainer\data\val\labels.cache... 10951 images, 1 backgrounds, 0 corrupt: 100%|██████████| 10951/10951 [00:00<?, ?it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 685/685 [03:39<00:00,  3.13it/s]
                   all      10951      26254      0.988      0.982      0.992      0.896
                 front       4339       4339      0.998      0.998      0.995      0.963
                  back       4066       4066      0.997      0.998      0.994      0.967
            left_chest       6694       6698       0.98      0.976      0.992      0.875
           right_chest       6749       6750      0.981      0.977      0.992      0.872
   bottom_front_pocket        815        815      0.996       0.99      0.995      0.906
        on_left_pocket        328        328      0.991       0.99      0.995      0.878
       on_right_pocket         92         92      0.992      0.967      0.988      0.811
            upper_back       3166       3166      0.969      0.958      0.989      0.892
Speed: 0.1ms preprocess, 15.9ms inference, 0.0ms loss, 1.0ms postprocess per image
Results saved to runs\detect\val
mAP50-95: 0.8956561333633672
mAP50: 0.9924571289793388
mAP75: 0.9614263105193042

Interestingly- now the precision, recall, and mAP numbers are a lot higher. And they seem to align much better with what the individual epoch runs were showing in their rolled up metrics. Is this a bug, or am I just not understanding how these validation metrics work after a training run completes? I'd love to know - thank you so much for any knowledge or insights! :)

Quick edit - I am using Ultralytics 8.3.179 and this is a yolo11x model. Python version is 3.11.9

UltralyticsAssistant · 2025-09-16T00:57:20Z

UltralyticsAssistant
Sep 16, 2025
Maintainer

👋 Hello @mwelliott, thanks for sharing these detailed results and for using Ultralytics 🚀. This is an automated response to help you get support quickly — an Ultralytics engineer will also assist you here soon.

We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report regarding validation discrepancies, please provide a minimum reproducible example (MRE) to help us debug it. For this case, please include:

Exact ultralytics version shown in logs and how installed
Full training and validation commands (CLI) or code (Python), including seeds and any flags that affect evaluation
The specific checkpoint you validated (e.g., weights/best.pt vs weights/last.pt) and where it came from
Dataset path/config used for both train-time and post-train validation, and whether .cache files existed/changed
A short log snippet showing the end-of-epoch metrics and the subsequent standalone val output
System info: OS, Python, PyTorch, CUDA, GPU
Whether any augmentations/eval settings differ between the two validations

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the Ultralytics community where it suits you best. For real-time chat, head to Discord 🎧. Prefer in-depth discussions? Check out Discourse. Or dive into threads on our Subreddit to share knowledge with the community.

Upgrade

Upgrade to the latest ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8 to verify your issue is not already resolved in the latest version:

pip install -U ultralytics

Environments

YOLO may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLO Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

0 replies

Y-T-G · 2025-09-17T06:57:15Z

Y-T-G
Sep 17, 2025
Maintainer

Can you post your training command and the exact name of the model you trained?

6 replies

Y-T-G Sep 17, 2025
Maintainer

It's possible some results got corrupted during validation, so it resulted in lower mAP.

mwelliott Sep 17, 2025
Author

Thank you I really appreciate it - would you say that it is generally good practice to always re-run validation manually at the end of training to ensure we get the right the proper validation results?

glenn-jocher Sep 17, 2025
Maintainer

Good question—no, you shouldn’t need to re-run validation; the end-of-training val and a standalone val() use the same validator, and results should match when the same checkpoint, dataset, and args run in the same environment. Based on your logs, the automated run processed 172 batches while your manual run used 685 (different batch), and the first log shows Python 3.13.1 while you mention 3.11.9, suggesting two environments; to compare apples-to-apples, validate the saved checkpoint with the model’s stored overrides in the same env:

from ultralytics import YOLO

model = YOLO("runs/detect/16_55k_cleaned_upper_body_garments_with_sleeves_train/weights/best.pt")
metrics = model.val()  # reuses dataset/imgsz/rect from training
print(metrics.box.map, metrics.box.map50, metrics.box.map75)

If numbers still diverge, please share the output of yolo checks and the exact standalone val command, and try deleting any dataset labels/*.cache files before re-validating. For how validation works and what each metric means, see the Val mode guide and metrics deep dive in the docs: Model Validation with Ultralytics YOLO and Performance Metrics Deep Dive.

mwelliott Sep 17, 2025
Author

Thank you gentlemen! I have a new training run in progress, so once this completes I will give your recommendations a shot. Also I had the python environment wrong, it was indeed 3.13 and not 3.11 so the environments were the same (literally ran the validation on the same box/environment after training completed).

For my validation command I did not specify the batch size either, I just used:
model = YOLO(args.model)
metrics = model.val(plots=True, data=args.config_file)

So I will simplify the validation command to what you recommended, ensure the environment is indeed the same, and we'll see how it goes! stay tuned, might be a couple days :)
cheers

glenn-jocher Sep 18, 2025
Maintainer

Sounds good—one extra check while you’re at it: printing the checkpoint’s stored overrides will confirm you’re truly using the same data/imgsz/batch/rect/conf/iou at eval; if you pass data=... you replace what was saved with the model, so for a strict apples-to-apples comparison omit it and rely on the stored overrides (the validator merges overrides → kwargs). See the YOLO.val() reference for details.

from ultralytics import YOLO

m = YOLO("runs/detect/16_55k_cleaned_upper_body_garments_with_sleeves_train/weights/best.pt")
print({k: m.overrides.get(k) for k in ("data", "imgsz", "batch", "rect", "conf", "iou")})
metrics = m.val()  # uses the saved training overrides; no extra kwargs
print(metrics.box.map, metrics.box.map50, metrics.box.map75)

If results still diverge after matching those, share the new logs and we’ll dig deeper. For how overrides are applied and the available args, see the YOLO.val() reference and validation settings: YOLO.val() method reference and Validation settings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ultralytics

Need help understanding validation discrepancies when training completes #22095

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Ultralytics

Need help understanding validation discrepancies when training completes #22095

Uh oh!

Uh oh!

mwelliott Sep 16, 2025

Replies: 2 comments · 6 replies

Uh oh!

UltralyticsAssistant Sep 16, 2025 Maintainer

Upgrade

Environments

Status

Uh oh!

Y-T-G Sep 17, 2025 Maintainer

Uh oh!

Y-T-G Sep 17, 2025 Maintainer

Uh oh!

mwelliott Sep 17, 2025 Author

Uh oh!

glenn-jocher Sep 17, 2025 Maintainer

Uh oh!

mwelliott Sep 17, 2025 Author

Uh oh!

glenn-jocher Sep 18, 2025 Maintainer

mwelliott
Sep 16, 2025

Replies: 2 comments 6 replies

UltralyticsAssistant
Sep 16, 2025
Maintainer

Y-T-G
Sep 17, 2025
Maintainer

Y-T-G Sep 17, 2025
Maintainer

mwelliott Sep 17, 2025
Author

glenn-jocher Sep 17, 2025
Maintainer

mwelliott Sep 17, 2025
Author

glenn-jocher Sep 18, 2025
Maintainer