Inspect and compare self-supervised vision model representations for DINOv2, I-JEPA, V-JEPA 2, EUPE.
| How AI Models See the World Presentation deck for latent-inspector: project narrative, model comparison setup, corrected EUPE and V-JEPA 2 interpretations, and the representation-geometry thesis. |
Four-Model Elephant Compare Report Self-contained sample report with per-model metrics, pairwise CKA and k-NN overlap, PCA projections, and exported artifact metadata. |
Each pixel block below is a patch token projected onto the top three principal components (mapped to RGB). Contiguous color = the model groups those patches into a similar representation neighborhood. Same image, same patches, four different answers about what matters.
Details
git clone https://github.com/AbdelStark/latent-inspector.git
cd latent-inspector
cargo build --release
# List models and cache state
./target/release/latent-inspector models
# Compare three models on a single image (models auto-download on first use)
./target/release/latent-inspector compare photo.jpg \
--models dinov2-vit-l14,ijepa-vit-h14,vjepa2-vitl-img16-256
# Deep-dive into one model
./target/release/latent-inspector inspect photo.jpg --model dinov2-vit-l14
# Interactive TUI (real analysis when an image is provided)
./target/release/latent-inspector tui photo.jpg -m dinov2-vit-l14,ijepa-vit-h14
# Profile a model over a dataset (isotropy, uniformity, intrinsic dimensionality)
./target/release/latent-inspector profile --model dinov2-vit-l14 --dataset images/
# Stub backend for development (no model downloads, validation downgraded to unverified)
LATENT_INSPECTOR_MODEL_BACKEND=stub \
./target/release/latent-inspector compare photo.jpg \
--models dinov2-vit-l14,clip-vit-l14Different SSL objectives produce different internal representations of the same image. You can read the papers and get intuitions about what should differ. But intuitions are wrong often enough that you should measure.
- DINOv2 -- self-distillation across augmented views. Patches in similar semantic regions get pushed together. The result looks like unsupervised segmentation.
- I-JEPA -- predict masked patches in latent space (not pixel space). Each patch must encode enough context to reconstruct its neighbors abstractly. Higher patch entropy than DINOv2 because the objective demands it.
- V-JEPA 2 -- JEPA on video. Learns spatiotemporal structure from internet-scale video. Even on a still image, the encoder carries a prior about how the world moves.
- EUPE -- proxy distillation from a large universal teacher into a compact encoder. The representation is a learned compromise across perception tasks, not a direct small-student multi-teacher baseline.
- MAE -- reconstruct masked pixels. Must encode enough detail to literally redraw what was hidden.
- CLIP -- align images with text. The representation is shaped by language, not just visual similarity.
latent-inspector makes these differences concrete. Numbers, not vibes.
| Model | Architecture | Params | Method | Status |
|---|---|---|---|---|
| DINOv2 | ViT-L/14 | 304M | Self-distillation + centering | Ready |
| I-JEPA | ViT-H/14 | 632M | Joint embedding predictive | Ready |
| V-JEPA 2 | ViT-L/16 | 304M | Video joint embedding predictive | Ready |
| EUPE | ViT-B/16 | 86M | Proxy distillation | Ready |
| DINOv3 | ViT-L/14 | 304M | Self-distillation + Gram anchoring | Planned |
| MAE | ViT-L/16 | 304M | Masked autoencoder | Planned |
| CLIP | ViT-L/14 | 304M | Contrastive image-text | Planned |
| SigLIP | ViT-SO400M/14 | 400M | Sigmoid contrastive image-text | Planned |
Models download on first use (~1-2 GB each) and are SHA-256 verified. Downloads retry on transient HTTP failures. Override cache location with LATENT_INSPECTOR_CACHE_DIR.
Model provenance and ONNX artifacts
Everything runs through ONNX Runtime. Sources:
| CLI name | Original checkpoint | ONNX source | Paper |
|---|---|---|---|
dinov2-vit-l14 |
facebook/dinov2-large |
onnx-community/dinov2-large -- community export |
Oquab et al. 2024 |
ijepa-vit-h14 |
facebook/ijepa_vith14_1k |
onnx-community/ijepa_vith14_1k -- community export |
Assran et al. 2023 |
vjepa2-vitl-img16-256 |
facebook/vjepa2-vitl-fpc64-256 |
abdelstark/vjepa2-vitl-img16-256-onnx -- custom export |
Bardes et al. 2024 |
eupe-vit-b16 |
facebook/EUPE-ViT-B |
abdelstark/eupe-vit-b16-onnx -- custom export |
Zhu et al. 2026 |
V-JEPA 2 export notes. The earlier public vjepa2-vitl-fpc2-256 artifact was not a broken ONNX export, but it was the wrong still-image adapter. Duplicating a photo to only 2 frames produces a materially different representation from the stable repeated-frame image manifold used by Meta's own image-evaluation path. The corrected export uses scripts/export_vjepa2_onnx.py; the reproducible procedure and source links are documented in docs/vjepa2-onnx-export.md. In short: repeat a single image to 16 frames, run the official encoder-only video trunk, reshape the output into 8 x 256 x 1024, and average over time back to 256 x 1024 patch tokens. Published artifact: abdelstark/vjepa2-vitl-img16-256-onnx. On 5 sample images the ONNX matches PyTorch with cosine > 0.999999, worst mean abs diff 1.37e-4, worst max abs diff 0.00643, and input-independence cosine 0.317. The canonical CLI name is vjepa2-vitl-img16-256; the older vjepa2-vitl-fpc2-256 name remains as a backward-compatible alias.
V-JEPA 2 correction. The corrected image path changes the interpretation materially:
- The old 2-frame surrogate understated V-JEPA 2's image coherence and made the PCA look artificially odd.
- On the elephant sample, corrected V-JEPA 2 lands at effective rank
51/1024, patch entropy2.89, isotropy0.417, and spatial coherence0.809. - Alignment rises sharply relative to the retired adapter: CKA vs DINOv2
0.495, CKA vs I-JEPA0.381, k-NN overlap vs DINOv20.366, k-NN overlap vs I-JEPA0.311. - The corrected story is not "V-JEPA 2 is the weird outlier on still images." The surviving story is that it stays distinctly video-shaped while remaining much closer to the SSL image encoders than the 2-frame surrogate implied.
EUPE export notes. Use the reproducible script at scripts/export_eupe_onnx.py and procedure doc docs/eupe-onnx-export.md. The upstream Hugging Face release is a .pt checkpoint, so the export loads EUPE through the official facebookresearch/eupe torch.hub entrypoint, concatenates [x_norm_clstoken, x_norm_patchtokens] -> [1,197,768], exports with the legacy TorchScript ONNX path (dynamo=False), rewrites the bundle as model.onnx + model.onnx_data, and gates publication on cosine/diff parity plus an input-independence check (cos(zeros, random) < 0.85).
EUPE correction. The earlier public EUPE report was based on a broken ONNX export and should not be trusted. The corrected export still shows that EUPE is the most compressed of the four reference models, but the surviving story is narrower:
- EUPE is still more top-heavy than DINOv2, I-JEPA, and V-JEPA 2: effective rank
22/768, top-10 variance87.0%, components@90%13. - EUPE is still less isotropic and more locally coherent than the SSL-only models: patch isotropy
0.375, spatial coherence0.913. - EUPE is weaker-aligned to the SSL-only cluster, but not remotely near-zero CKA: DINOv2
0.150, I-JEPA0.115, V-JEPA 20.103. - The invalid thesis was that EUPE was effectively off-manifold because of artifact-driven near-zero CKA and isotropy
0.026. The corrected thesis is that EUPE is a compact, top-heavy outlier with sharper local agreement. - These numbers are geometry comparisons against DINOv2, I-JEPA, and V-JEPA 2. They are not the paper's ImageNet k-NN classification metric, and they do not use the paper's main peer set.
- The paper's actual training story is proxy distillation through a merged 1.9B teacher. The earlier repo wording incorrectly described direct multi-teacher distillation into the 86M student.
The refreshed single-image compare artifacts live in demo/reports/20260408-123006/report.html and demo/reports/20260408-123006/compare.json. demo/reports/eupe-vs-ssl-reference.html mirrors the same corrected bundle at a stable root path.
PyTorch parity for the published export lives in the accompanying export.validation.json artifact on Hugging Face; the checked-in fixture is now explicit ONNX regression evidence rather than a fake PyTorch proof.
For other HuggingFace models, use the ONNX Community Converter.
A real example. Same elephant photograph, two models, different training objectives.
latent-inspector compare docs/assets/img/samples/elephant_sample_image.jpg \
--models dinov2-vit-l14,ijepa-vit-h14Model Comparison
================================================================================
Metric dinov2-vit-l14 ijepa-vit-h14
--------------------------------------------------------------------------------
Repr. rank 60/1024 44/1280
Dead dimensions 0 0
Patch entropy 2.52 2.89
CLS L2 norm 46.3 N/A
Top-10 var% 66.8% 72.7%
Components@90% 31 22
Patch isotropy 0.712 0.834
Patch uniformity -2.891 -3.247
================================================================================
Reading these numbers
Representation rank (60 vs 44). How many dimensions the model actually uses. DINOv2 spreads across 60 effective dimensions out of 1024. I-JEPA uses 44 out of 1280. Zero dead dimensions in both -- no wasted capacity, just different concentrations.
Patch entropy (2.52 vs 2.89). How differentiated the patch representations are. I-JEPA's prediction objective forces fine-grained spatial encoding, so each patch carries more unique information. DINOv2's self-distillation favors globally consistent features -- patches on the same object tend to look alike.
CLS L2 norm (46.3 vs N/A). DINOv2 has a CLS token (one vector summarizing the whole image). I-JEPA doesn't -- it was never designed with one. The tool reports N/A rather than silently dropping the metric.
Top-10 variance / Components@90%. I-JEPA packs 72.7% of variance into 10 components and needs only 22 for 90%. DINOv2 is more spread (66.8% / 31). I-JEPA's representation is lower-dimensional in practice despite having a wider embedding space. Worth thinking about if you're choosing a backbone for a downstream task with limited data.
Isotropy (0.712 vs 0.834). How directionally diverse the patch embeddings are (1 = perfectly isotropic, 0 = all patches point the same way). I-JEPA patches are more directionally diverse -- each patch represents something more distinct.
Uniformity (-2.891 vs -3.247). Wang & Isola (2020) metric for how evenly patches spread on the unit hypersphere. More negative = better spread. I-JEPA distributes patches more uniformly, consistent with its latent-prediction objective that naturally prevents representational collapse.
Linear CKA: 0.329 (representation geometry overlap)
k-NN overlap: 0.278 (fraction of shared nearest neighbors)
CKA of 0.329 means some structural overlap but substantially different organization. k-NN overlap of 27.8% means when DINOv2 considers two patches "similar," I-JEPA often disagrees. The trunk patches might cluster with body patches in one model but with boundary patches in the other.
These are genuinely different representations of the same image. Not just rotations of each other. Different training objectives, different geometry.
| Property | DINOv2 | I-JEPA | What it means |
|---|---|---|---|
| Effective rank | 60/1024 | 44/1280 | DINOv2 uses more dimensions |
| Variance concentration | 66.8% in top 10 | 72.7% in top 10 | I-JEPA is more concentrated |
| Patch entropy | 2.52 | 2.89 | I-JEPA differentiates patches more |
| Patch isotropy | 0.712 | 0.834 | I-JEPA spreads more uniformly |
| CLS token | Yes (46.3 norm) | No | Different architectures |
| CKA | -- | 0.329 | Different internal geometry |
latent-inspector compare <image> --models <model1>,<model2>[,...]
[--format terminal|json|html|png] [--output <dir>] [--pca-components <n>]Per-model metrics plus pairwise cross-model similarity. Handles mismatched architectures: dimension-agnostic metrics (CKA, k-NN) work when patch counts match; dimension-dependent and CLS-dependent metrics report N/A with an explanation.
latent-inspector inspect <image> --model <model>
[--format terminal|json|html|png] [--output <dir>] [--pca-components <n>]Full representation analysis: rank, entropy, variance spectrum, patch norm statistics, isotropy, uniformity, spatial coherence, attention concentration (when available), and PCA projection. PNG/HTML output includes a spatial coherence heatmap.
latent-inspector neighbors <image> --model <model> --dataset <dir>
[--k <n>] [--format terminal|json|html|png] [--output <dir>]Find the k most similar images according to the model. Shows what a model considers "similar." Falls back to mean-patch embeddings when no CLS token is available.
latent-inspector similarity --model-a <model> --model-b <model> --dataset <dir>
[--format terminal|json|html|png] [--output <dir>]Dataset-level CKA, k-NN overlap, and (when both models expose CLS) mean CLS cosine similarity. Parallel inference across the dataset.
latent-inspector profile --model <model> --dataset <dir>
[--format terminal|json|html|png] [--output <dir>]Dataset-level representation fingerprint: isotropy (cosine + partition function), uniformity (Wang & Isola 2020), intrinsic dimensionality (Levina & Bickel 2004 MLE), plus per-image metric aggregates.
latent-inspector drift --model <model> --checkpoints <dir> --dataset <dir>
[--format terminal|json|html|png] [--output <dir>]Load .onnx checkpoints from different training stages, compute consecutive CKA. Shows when representations materially shift during training. Natural numeric ordering (step-2.onnx before step-10.onnx).
latent-inspector embed <image-or-dir> --model <model>
[--level global|patches|full] [--output <file.jsonl>]Export model embeddings for downstream use (Python, JS, etc). Outputs one JSON object per line (JSONL). Three levels: global (CLS/mean-patch vector), patches (full patch matrix), full (both). Writes to stdout by default; use --output for file output. Handles single images and directories (recursive scan).
latent-inspector models [--verbose] [--download <model>]
[--format terminal|json|html] [--output <dir>]Model registry with status, readiness, cache state, evidence status, artifact inventory. Use --download <model> to pre-cache.
latent-inspector validate --model <model>
[--format terminal|json|html] [--output <dir>] [--refresh-goldens]Validates integration against checked-in contract and reference artifacts. Use --refresh-goldens after a verified ONNX update.
latent-inspector tui [<image>] [-m <model1>,<model2>,...]Interactive views: dashboard, inspector, compare, spectrum, file browser, help. Arrow keys to navigate, number keys to switch views.
Every analysis command supports four output formats:
| Format | Flag | Output | Use case |
|---|---|---|---|
| Terminal | --format terminal (default) |
Rich Unicode, ASCII fallback | Interactive use |
| JSON | --format json |
Structured metrics | Scripting, pipelines |
| HTML | --format html |
Self-contained report bundle | Sharing |
| PNG | --format png |
PCA projections, heatmaps, charts | Papers, slides |
With --output <dir>, all formats also emit artifacts.json -- a manifest of generated files with byte sizes and SHA-256 digests. HTML bundles include companion JSON. Stable file names and JSON keys are documented in docs/REPORT-SCHEMA.md.
Force ASCII output: LATENT_INSPECTOR_FORCE_ASCII=1.
| Metric | What it measures | Range | Intuition |
|---|---|---|---|
| Effective rank | Significant singular values | 1 to embed_dim | Higher = uses more capacity |
| Dead dimensions | Zero-valued embedding dims | 0 to embed_dim | Should be 0 |
| Patch entropy | Diversity of patch features (k-means) | 0 to log2(k) | Higher = more differentiated |
| Attention Gini | Attention weight concentration | 0 to 1 | Higher = more focused |
| CLS L2 norm | Global image vector magnitude | 0+ | Cross-image comparison |
| Patch norm mean/std | Patch vector magnitude distribution | 0+ | Low std = uniform activation |
| Top-10 variance % | Info in first 10 PCA components | 0-100% | Higher = more concentrated |
| Components@90% | PCA components for 90% variance | 1 to embed_dim | Lower = more compressible |
| Linear CKA | Representation geometry similarity | 0 to 1 | 1 = identical geometry |
| k-NN overlap | Neighborhood agreement | 0 to 1 | 1 = same neighbors |
| Patch correspondence | Hungarian-matched patch similarity | 0 to 1 | Optimal alignment quality |
| Isotropy (cosine) | Embedding directional spread | 0 to 1 | Higher = more uniform |
| Isotropy (partition) | Singular value uniformity | 0 to 1 | Higher = less top-heavy |
| Uniformity | Hypersphere spread (Wang & Isola 2020) | -inf to 0 | More negative = better |
| Intrinsic dim | Manifold dimension (Levina & Bickel 2004) | 1+ | Lower than ambient = compressed |
| Spatial coherence | Similarity of adjacent patches on grid | -1 to 1 | Higher = smoother/segmented |
| RankMe | Smooth effective rank (Garrido et al. 2023) | 1 to k | Higher = richer representation |
| Spectral decay (β) | Power-law eigenvalue decay exponent | 0+ | Lower = more uniform spread |
The full pipeline: what happens from image input to cross-model comparison. Read this if you want to understand what the metrics actually measure and why they differ between models.
Every vision transformer takes an image and produces patch embeddings: one high-dimensional vector per spatial region.
Image (e.g. 224x224 RGB)
|
+- Resize short edge to model's input size, center-crop to square
| (src/models/preprocess.rs -- standard ViT pipeline)
|
+- Normalize: (pixel / 255 - mean) / std per channel
| (model-specific mean/std from registry)
|
+- ONNX Runtime inference
| (src/models/loader.rs -> ort crate -> C++ ONNX Runtime backend)
|
+- Output: [1, seq_len, embed_dim] tensor
|
+- CLS token (index 0) if present -> global image representation
+- Patch tokens (the rest) -> per-region representations
The patch tokens are the representation. Each is a point in a high-dimensional space (1024-dim for DINOv2, 1280-dim for I-JEPA). The geometry of these points -- how they cluster, how they spread, how they relate to each other -- is what defines the model's internal model of the image.
Consider the elephant image:
DINOv2 (self-distillation). A student network matches a slowly-evolving teacher across augmented views. This creates consistency pressure: patches in similar semantic regions get pushed toward similar representations. Elephant body patches cluster together. Background patches cluster together. The result looks like unsupervised segmentation -- no labels needed.
I-JEPA (latent prediction). Given visible patches, predict the representation of masked patches. Unlike MAE (which predicts pixels), I-JEPA predicts in representation space, so it must learn abstract structure. Each patch must encode enough context about its neighborhood to predict what's missing. This is why patch entropy is higher (2.89 vs 2.52) -- each patch carries more unique information.
V-JEPA 2 (video prediction). Predict future frame representations from past frames. Even on a static image, the encoder carries a prior about how the visual world moves. It sees the elephant as something that could walk away, not just a static arrangement of pixels.
Two models, two different embedding spaces. DINOv2 lives in R^1024, I-JEPA in R^1280. You can't subtract them. Instead, compare structural properties.
CKA (Centered Kernel Alignment) -- src/analysis/cka.rs
Build a kernel matrix for each model: K[i,j] = dot(patch_i, patch_j). This captures pairwise similarity structure -- which patches are similar to which, regardless of coordinate system. Center both matrices, measure alignment via HSIC:
CKA(X, Y) = HSIC(K_X, K_Y) / sqrt(HSIC(K_X, K_X) * HSIC(K_Y, K_Y))
Invariant to orthogonal transforms and isotropic scaling. Compares geometric structure, not coordinates.
k-NN overlap -- src/analysis/knn.rs
For each patch, find its 10 nearest neighbors in model A and model B. Count how many overlap. If DINOv2 thinks patches 3, 7, 12 are similar (all on the trunk), does I-JEPA agree? Overlap of 0.278 = 27.8% agreement. Substantial disagreement about what "similar" means.
Patch correspondence -- src/analysis/correspondence.rs
When dimensions match (e.g., DINOv2 and V-JEPA 2, both 1024-dim), compute cosine similarity between every patch pair and find optimal assignment via the Hungarian algorithm. Tells you whether there's a clean mapping between the two representations, or whether they organized the space incompatibly.
Effective rank -- src/analysis/rank.rs
SVD on the patch matrix. Threshold singular values at 1% of max, count survivors. Rank 60/1024 means 60 effective directions; the other 964 carry negligible information. Not waste -- just concentration.
PCA variance spectrum -- src/analysis/variance.rs, src/analysis/pca.rs
Power method PCA (no LAPACK dependency) on centered patch matrix. Eigenvalue ratios show how information distributes. Steep scree plot = compressible representation. Flat plot = information spread uniformly. Both can be useful depending on your downstream task.
Isotropy and uniformity -- src/analysis/isotropy.rs
Two views of the same question: is the representation using its space well?
- Isotropy (1 - mean pairwise cosine): are patches directionally diverse, or all clustered in a narrow cone?
- Uniformity (Wang & Isola 2020): log of average pairwise Gaussian kernel on the unit hypersphere. More negative = better coverage. Collapse to few modes pushes uniformity toward 0.
Patch entropy -- src/analysis/entropy.rs
k-means on patch tokens, then Shannon entropy of cluster assignments. High entropy = patches spread across many clusters. Low entropy = most patches land in the same cluster. Direct measure of how discriminative the representation is at the patch level.
RankMe (smooth effective rank) -- src/analysis/rankme.rs
Garrido et al. (ICML 2023). Computes exp(H(p)) where p is the normalized singular value distribution and H is Shannon entropy. Unlike threshold-based effective rank, RankMe is smooth and differentiable -- a value of 1 means total collapse (all variance in one dimension), while a value near k means variance is uniformly spread. Better at detecting subtle representation degradation.
Spectral decay (β) -- src/analysis/spectral_decay.rs
Fits a power law λ_i ~ i^(-β) to the eigenvalue spectrum via least-squares in log-log space. Low β (< 1) means slow decay -- the representation distributes information broadly across dimensions. High β (> 2) means rapid decay -- a few dimensions dominate. Complements RankMe: RankMe tells you how many effective dimensions, spectral decay tells you how sharply the spectrum drops off.
V-JEPA 2 is still a video model, but the current adapter no longer uses the retired 2-frame shortcut. For single-image analysis, the wrapper repeats the image to 16 frames, runs the official video encoder, reshapes the token sequence into 8 temporal groups by 256 spatial patches, then averages over time. That keeps the representation on the same repeated-frame manifold Meta uses for image evaluation while still producing a plain [1, 256, 1024] patch tensor for cross-model comparison.
Every report embeds a validation summary (src/validation/). Before trusting metrics, the tool checks:
- Preprocessing contract: registered resize/crop/normalize matches checked-in golden artifact
- Tensor semantics: ONNX graph exposes expected input/output names and shapes
- Reference parity: output matches previously approved references within tolerance
Status levels: validated (passed all checks against ONNX Runtime), stale (reference artifacts from a different backend), unverified (no reference artifacts yet).
src/
models/
registry.rs Model metadata: architecture, normalization, tensor contracts
loader.rs ONNX session, inference (image + video paths), stub backend
preprocess.rs Resize + center-crop + normalize -> [1, 3, H, W] tensor
cache.rs Download, SHA-256 verify, partial-resume, cache state
extract/
features.rs ModelOutput -> CLS token + patch tokens + attention maps
analysis/
pca.rs Power method PCA (no LAPACK)
coherence.rs Spatial coherence (adjacent patch similarity on grid)
cka.rs Linear CKA + CLS cosine similarity
knn.rs Cosine similarity matrix, top-k neighbors, overlap
rank.rs Effective rank via singular value thresholding
variance.rs PCA variance spectrum (scree plot data)
entropy.rs k-means + Shannon entropy, patch norm statistics
isotropy.rs Cosine isotropy, partition function isotropy, uniformity
attention.rs Gini coefficient on attention weights
correspondence.rs Hungarian-matched patch correspondence
rankme.rs Smooth effective rank via Shannon entropy (Garrido 2023)
spectral_decay.rs Power-law eigenvalue decay exponent
viz/
terminal.rs Rich Unicode terminal output (ASCII fallback)
json.rs Structured JSON
html.rs Self-contained HTML report bundles
png.rs PCA RGB projections, heatmaps, variance charts
validation/
evidence.rs Freshness checks against golden fixtures
parity.rs Output-level comparison against reference artifacts
cargo build --release
# Run without downloading models
LATENT_INSPECTOR_MODEL_BACKEND=stub cargo run -- models
LATENT_INSPECTOR_MODEL_BACKEND=stub cargo run -- compare docs/assets/img/samples/elephant_sample_image.jpg \
--models dinov2-vit-l14,ijepa-vit-h14
cargo test
cargo fmt -- --check
cargo clippy --all-targets -- -D warnings
# Coverage (excludes TUI surface)
cargo llvm-cov --workspace \
--ignore-filename-regex '(^|/)src/tui/|(^|/)src/cli/tui.rs$' \
--fail-under-lines 85 \
--fail-under-functions 80 \
--summary-only
# Full CI
make allThe stub backend produces deterministic synthetic outputs for development and testing. Validation summaries downgrade stub-backed results to unverified. The TUI shows demo data without an image; with an image it runs the same live pipeline as the CLI.
MIT OR Apache-2.0