Skip to content

AbdelStark/latent-inspector

latent-inspector

Inspect and compare self-supervised vision model representations for DINOv2, I-JEPA, V-JEPA 2, EUPE.

Crates.io License Rust 1.75+ Pages

How AI Models See the World
Presentation deck for latent-inspector: project narrative, model comparison setup, corrected EUPE and V-JEPA 2 interpretations, and the representation-geometry thesis.
Four-Model Elephant Compare Report
Self-contained sample report with per-model metrics, pairwise CKA and k-NN overlap, PCA projections, and exported artifact metadata.
Dashboard
Dashboard -- model registry, image preview, architecture comparison
Inspector
Inspector -- representation health gauges and PCA variance spectrum
Compare
Compare -- cross-model metrics, CLS similarity matrix, CKA and k-NN overlap
Spectrum
Spectrum -- full PCA scree plot with 90%/99% thresholds

PCA projections across models

Each pixel block below is a patch token projected onto the top three principal components (mapped to RGB). Contiguous color = the model groups those patches into a similar representation neighborhood. Same image, same patches, four different answers about what matters.

DINOv2 PCA representation
DINOv2 -- large uniform regions. Self-distillation pushes semantically related patches together, so the model produces something close to unsupervised segmentation. Elephant = one color, background = another.
I-JEPA PCA representation
I-JEPA -- finer local variation. The latent-prediction objective forces each patch to encode its own context rather than collapsing into broad semantic zones.
V-JEPA 2 PCA representation
V-JEPA 2 -- structured regions with strong local continuity once the image path is adapted correctly. Following Meta's repeated-frame image evaluation path, the encoder lands much closer to DINOv2 than the retired 2-frame surrogate suggested.
EUPE PCA representation
EUPE -- compact, sharper grouping. Proxy-distilled training produces a more top-heavy representation with stronger local agreement than the SSL-only references; the earlier broken export exaggerated that effect.
Details

Quick start

git clone https://github.com/AbdelStark/latent-inspector.git
cd latent-inspector
cargo build --release

# List models and cache state
./target/release/latent-inspector models

# Compare three models on a single image (models auto-download on first use)
./target/release/latent-inspector compare photo.jpg \
  --models dinov2-vit-l14,ijepa-vit-h14,vjepa2-vitl-img16-256

# Deep-dive into one model
./target/release/latent-inspector inspect photo.jpg --model dinov2-vit-l14

# Interactive TUI (real analysis when an image is provided)
./target/release/latent-inspector tui photo.jpg -m dinov2-vit-l14,ijepa-vit-h14

# Profile a model over a dataset (isotropy, uniformity, intrinsic dimensionality)
./target/release/latent-inspector profile --model dinov2-vit-l14 --dataset images/

# Stub backend for development (no model downloads, validation downgraded to unverified)
LATENT_INSPECTOR_MODEL_BACKEND=stub \
  ./target/release/latent-inspector compare photo.jpg \
  --models dinov2-vit-l14,clip-vit-l14

Why this exists

Different SSL objectives produce different internal representations of the same image. You can read the papers and get intuitions about what should differ. But intuitions are wrong often enough that you should measure.

  • DINOv2 -- self-distillation across augmented views. Patches in similar semantic regions get pushed together. The result looks like unsupervised segmentation.
  • I-JEPA -- predict masked patches in latent space (not pixel space). Each patch must encode enough context to reconstruct its neighbors abstractly. Higher patch entropy than DINOv2 because the objective demands it.
  • V-JEPA 2 -- JEPA on video. Learns spatiotemporal structure from internet-scale video. Even on a still image, the encoder carries a prior about how the world moves.
  • EUPE -- proxy distillation from a large universal teacher into a compact encoder. The representation is a learned compromise across perception tasks, not a direct small-student multi-teacher baseline.
  • MAE -- reconstruct masked pixels. Must encode enough detail to literally redraw what was hidden.
  • CLIP -- align images with text. The representation is shaped by language, not just visual similarity.

latent-inspector makes these differences concrete. Numbers, not vibes.

Supported models

Model Architecture Params Method Status
DINOv2 ViT-L/14 304M Self-distillation + centering Ready
I-JEPA ViT-H/14 632M Joint embedding predictive Ready
V-JEPA 2 ViT-L/16 304M Video joint embedding predictive Ready
EUPE ViT-B/16 86M Proxy distillation Ready
DINOv3 ViT-L/14 304M Self-distillation + Gram anchoring Planned
MAE ViT-L/16 304M Masked autoencoder Planned
CLIP ViT-L/14 304M Contrastive image-text Planned
SigLIP ViT-SO400M/14 400M Sigmoid contrastive image-text Planned

Models download on first use (~1-2 GB each) and are SHA-256 verified. Downloads retry on transient HTTP failures. Override cache location with LATENT_INSPECTOR_CACHE_DIR.

Model provenance and ONNX artifacts

Everything runs through ONNX Runtime. Sources:

CLI name Original checkpoint ONNX source Paper
dinov2-vit-l14 facebook/dinov2-large onnx-community/dinov2-large -- community export Oquab et al. 2024
ijepa-vit-h14 facebook/ijepa_vith14_1k onnx-community/ijepa_vith14_1k -- community export Assran et al. 2023
vjepa2-vitl-img16-256 facebook/vjepa2-vitl-fpc64-256 abdelstark/vjepa2-vitl-img16-256-onnx -- custom export Bardes et al. 2024
eupe-vit-b16 facebook/EUPE-ViT-B abdelstark/eupe-vit-b16-onnx -- custom export Zhu et al. 2026

V-JEPA 2 export notes. The earlier public vjepa2-vitl-fpc2-256 artifact was not a broken ONNX export, but it was the wrong still-image adapter. Duplicating a photo to only 2 frames produces a materially different representation from the stable repeated-frame image manifold used by Meta's own image-evaluation path. The corrected export uses scripts/export_vjepa2_onnx.py; the reproducible procedure and source links are documented in docs/vjepa2-onnx-export.md. In short: repeat a single image to 16 frames, run the official encoder-only video trunk, reshape the output into 8 x 256 x 1024, and average over time back to 256 x 1024 patch tokens. Published artifact: abdelstark/vjepa2-vitl-img16-256-onnx. On 5 sample images the ONNX matches PyTorch with cosine > 0.999999, worst mean abs diff 1.37e-4, worst max abs diff 0.00643, and input-independence cosine 0.317. The canonical CLI name is vjepa2-vitl-img16-256; the older vjepa2-vitl-fpc2-256 name remains as a backward-compatible alias.

V-JEPA 2 correction. The corrected image path changes the interpretation materially:

  • The old 2-frame surrogate understated V-JEPA 2's image coherence and made the PCA look artificially odd.
  • On the elephant sample, corrected V-JEPA 2 lands at effective rank 51/1024, patch entropy 2.89, isotropy 0.417, and spatial coherence 0.809.
  • Alignment rises sharply relative to the retired adapter: CKA vs DINOv2 0.495, CKA vs I-JEPA 0.381, k-NN overlap vs DINOv2 0.366, k-NN overlap vs I-JEPA 0.311.
  • The corrected story is not "V-JEPA 2 is the weird outlier on still images." The surviving story is that it stays distinctly video-shaped while remaining much closer to the SSL image encoders than the 2-frame surrogate implied.

EUPE export notes. Use the reproducible script at scripts/export_eupe_onnx.py and procedure doc docs/eupe-onnx-export.md. The upstream Hugging Face release is a .pt checkpoint, so the export loads EUPE through the official facebookresearch/eupe torch.hub entrypoint, concatenates [x_norm_clstoken, x_norm_patchtokens] -> [1,197,768], exports with the legacy TorchScript ONNX path (dynamo=False), rewrites the bundle as model.onnx + model.onnx_data, and gates publication on cosine/diff parity plus an input-independence check (cos(zeros, random) < 0.85).

EUPE correction. The earlier public EUPE report was based on a broken ONNX export and should not be trusted. The corrected export still shows that EUPE is the most compressed of the four reference models, but the surviving story is narrower:

  • EUPE is still more top-heavy than DINOv2, I-JEPA, and V-JEPA 2: effective rank 22/768, top-10 variance 87.0%, components@90% 13.
  • EUPE is still less isotropic and more locally coherent than the SSL-only models: patch isotropy 0.375, spatial coherence 0.913.
  • EUPE is weaker-aligned to the SSL-only cluster, but not remotely near-zero CKA: DINOv2 0.150, I-JEPA 0.115, V-JEPA 2 0.103.
  • The invalid thesis was that EUPE was effectively off-manifold because of artifact-driven near-zero CKA and isotropy 0.026. The corrected thesis is that EUPE is a compact, top-heavy outlier with sharper local agreement.
  • These numbers are geometry comparisons against DINOv2, I-JEPA, and V-JEPA 2. They are not the paper's ImageNet k-NN classification metric, and they do not use the paper's main peer set.
  • The paper's actual training story is proxy distillation through a merged 1.9B teacher. The earlier repo wording incorrectly described direct multi-teacher distillation into the 86M student.

The refreshed single-image compare artifacts live in demo/reports/20260408-123006/report.html and demo/reports/20260408-123006/compare.json. demo/reports/eupe-vs-ssl-reference.html mirrors the same corrected bundle at a stable root path. PyTorch parity for the published export lives in the accompanying export.validation.json artifact on Hugging Face; the checked-in fixture is now explicit ONNX regression evidence rather than a fake PyTorch proof.

For other HuggingFace models, use the ONNX Community Converter.


Case study: how DINOv2 and I-JEPA see an elephant

A real example. Same elephant photograph, two models, different training objectives.

Compare both models

latent-inspector compare docs/assets/img/samples/elephant_sample_image.jpg \
  --models dinov2-vit-l14,ijepa-vit-h14
Model Comparison
================================================================================
Metric                dinov2-vit-l14  ijepa-vit-h14
--------------------------------------------------------------------------------
Repr. rank            60/1024         44/1280
Dead dimensions       0               0
Patch entropy         2.52            2.89
CLS L2 norm           46.3            N/A
Top-10 var%           66.8%           72.7%
Components@90%        31              22
Patch isotropy        0.712           0.834
Patch uniformity      -2.891          -3.247
================================================================================
Reading these numbers

Representation rank (60 vs 44). How many dimensions the model actually uses. DINOv2 spreads across 60 effective dimensions out of 1024. I-JEPA uses 44 out of 1280. Zero dead dimensions in both -- no wasted capacity, just different concentrations.

Patch entropy (2.52 vs 2.89). How differentiated the patch representations are. I-JEPA's prediction objective forces fine-grained spatial encoding, so each patch carries more unique information. DINOv2's self-distillation favors globally consistent features -- patches on the same object tend to look alike.

CLS L2 norm (46.3 vs N/A). DINOv2 has a CLS token (one vector summarizing the whole image). I-JEPA doesn't -- it was never designed with one. The tool reports N/A rather than silently dropping the metric.

Top-10 variance / Components@90%. I-JEPA packs 72.7% of variance into 10 components and needs only 22 for 90%. DINOv2 is more spread (66.8% / 31). I-JEPA's representation is lower-dimensional in practice despite having a wider embedding space. Worth thinking about if you're choosing a backbone for a downstream task with limited data.

Isotropy (0.712 vs 0.834). How directionally diverse the patch embeddings are (1 = perfectly isotropic, 0 = all patches point the same way). I-JEPA patches are more directionally diverse -- each patch represents something more distinct.

Uniformity (-2.891 vs -3.247). Wang & Isola (2020) metric for how evenly patches spread on the unit hypersphere. More negative = better spread. I-JEPA distributes patches more uniformly, consistent with its latent-prediction objective that naturally prevents representational collapse.

Cross-model similarity

Linear CKA:     0.329    (representation geometry overlap)
k-NN overlap:   0.278    (fraction of shared nearest neighbors)

CKA of 0.329 means some structural overlap but substantially different organization. k-NN overlap of 27.8% means when DINOv2 considers two patches "similar," I-JEPA often disagrees. The trunk patches might cluster with body patches in one model but with boundary patches in the other.

These are genuinely different representations of the same image. Not just rotations of each other. Different training objectives, different geometry.

Summary

Property DINOv2 I-JEPA What it means
Effective rank 60/1024 44/1280 DINOv2 uses more dimensions
Variance concentration 66.8% in top 10 72.7% in top 10 I-JEPA is more concentrated
Patch entropy 2.52 2.89 I-JEPA differentiates patches more
Patch isotropy 0.712 0.834 I-JEPA spreads more uniformly
CLS token Yes (46.3 norm) No Different architectures
CKA -- 0.329 Different internal geometry

Commands reference

compare -- side-by-side model comparison

latent-inspector compare <image> --models <model1>,<model2>[,...]
  [--format terminal|json|html|png] [--output <dir>] [--pca-components <n>]

Per-model metrics plus pairwise cross-model similarity. Handles mismatched architectures: dimension-agnostic metrics (CKA, k-NN) work when patch counts match; dimension-dependent and CLS-dependent metrics report N/A with an explanation.

inspect -- single model deep-dive

latent-inspector inspect <image> --model <model>
  [--format terminal|json|html|png] [--output <dir>] [--pca-components <n>]

Full representation analysis: rank, entropy, variance spectrum, patch norm statistics, isotropy, uniformity, spatial coherence, attention concentration (when available), and PCA projection. PNG/HTML output includes a spatial coherence heatmap.

neighbors -- k-NN retrieval across a dataset

latent-inspector neighbors <image> --model <model> --dataset <dir>
  [--k <n>] [--format terminal|json|html|png] [--output <dir>]

Find the k most similar images according to the model. Shows what a model considers "similar." Falls back to mean-patch embeddings when no CLS token is available.

similarity -- cross-model alignment on a dataset

latent-inspector similarity --model-a <model> --model-b <model> --dataset <dir>
  [--format terminal|json|html|png] [--output <dir>]

Dataset-level CKA, k-NN overlap, and (when both models expose CLS) mean CLS cosine similarity. Parallel inference across the dataset.

profile -- representation space profiling

latent-inspector profile --model <model> --dataset <dir>
  [--format terminal|json|html|png] [--output <dir>]

Dataset-level representation fingerprint: isotropy (cosine + partition function), uniformity (Wang & Isola 2020), intrinsic dimensionality (Levina & Bickel 2004 MLE), plus per-image metric aggregates.

drift -- track representation changes across checkpoints

latent-inspector drift --model <model> --checkpoints <dir> --dataset <dir>
  [--format terminal|json|html|png] [--output <dir>]

Load .onnx checkpoints from different training stages, compute consecutive CKA. Shows when representations materially shift during training. Natural numeric ordering (step-2.onnx before step-10.onnx).

embed -- export embeddings as JSON Lines

latent-inspector embed <image-or-dir> --model <model>
  [--level global|patches|full] [--output <file.jsonl>]

Export model embeddings for downstream use (Python, JS, etc). Outputs one JSON object per line (JSONL). Three levels: global (CLS/mean-patch vector), patches (full patch matrix), full (both). Writes to stdout by default; use --output for file output. Handles single images and directories (recursive scan).

models -- registry and cache status

latent-inspector models [--verbose] [--download <model>]
  [--format terminal|json|html] [--output <dir>]

Model registry with status, readiness, cache state, evidence status, artifact inventory. Use --download <model> to pre-cache.

validate -- preprocessing and parity checks

latent-inspector validate --model <model>
  [--format terminal|json|html] [--output <dir>] [--refresh-goldens]

Validates integration against checked-in contract and reference artifacts. Use --refresh-goldens after a verified ONNX update.

tui -- interactive terminal UI

latent-inspector tui [<image>] [-m <model1>,<model2>,...]

Interactive views: dashboard, inspector, compare, spectrum, file browser, help. Arrow keys to navigate, number keys to switch views.

Output formats

Every analysis command supports four output formats:

Format Flag Output Use case
Terminal --format terminal (default) Rich Unicode, ASCII fallback Interactive use
JSON --format json Structured metrics Scripting, pipelines
HTML --format html Self-contained report bundle Sharing
PNG --format png PCA projections, heatmaps, charts Papers, slides

With --output <dir>, all formats also emit artifacts.json -- a manifest of generated files with byte sizes and SHA-256 digests. HTML bundles include companion JSON. Stable file names and JSON keys are documented in docs/REPORT-SCHEMA.md.

Force ASCII output: LATENT_INSPECTOR_FORCE_ASCII=1.

Metrics glossary

Metric What it measures Range Intuition
Effective rank Significant singular values 1 to embed_dim Higher = uses more capacity
Dead dimensions Zero-valued embedding dims 0 to embed_dim Should be 0
Patch entropy Diversity of patch features (k-means) 0 to log2(k) Higher = more differentiated
Attention Gini Attention weight concentration 0 to 1 Higher = more focused
CLS L2 norm Global image vector magnitude 0+ Cross-image comparison
Patch norm mean/std Patch vector magnitude distribution 0+ Low std = uniform activation
Top-10 variance % Info in first 10 PCA components 0-100% Higher = more concentrated
Components@90% PCA components for 90% variance 1 to embed_dim Lower = more compressible
Linear CKA Representation geometry similarity 0 to 1 1 = identical geometry
k-NN overlap Neighborhood agreement 0 to 1 1 = same neighbors
Patch correspondence Hungarian-matched patch similarity 0 to 1 Optimal alignment quality
Isotropy (cosine) Embedding directional spread 0 to 1 Higher = more uniform
Isotropy (partition) Singular value uniformity 0 to 1 Higher = less top-heavy
Uniformity Hypersphere spread (Wang & Isola 2020) -inf to 0 More negative = better
Intrinsic dim Manifold dimension (Levina & Bickel 2004) 1+ Lower than ambient = compressed
Spatial coherence Similarity of adjacent patches on grid -1 to 1 Higher = smoother/segmented
RankMe Smooth effective rank (Garrido et al. 2023) 1 to k Higher = richer representation
Spectral decay (β) Power-law eigenvalue decay exponent 0+ Lower = more uniform spread

From pixels to world models

The full pipeline: what happens from image input to cross-model comparison. Read this if you want to understand what the metrics actually measure and why they differ between models.

The representation pipeline

Every vision transformer takes an image and produces patch embeddings: one high-dimensional vector per spatial region.

Image (e.g. 224x224 RGB)
  |
  +- Resize short edge to model's input size, center-crop to square
  |  (src/models/preprocess.rs -- standard ViT pipeline)
  |
  +- Normalize: (pixel / 255 - mean) / std  per channel
  |  (model-specific mean/std from registry)
  |
  +- ONNX Runtime inference
  |  (src/models/loader.rs -> ort crate -> C++ ONNX Runtime backend)
  |
  +- Output: [1, seq_len, embed_dim] tensor
     |
     +- CLS token (index 0) if present  ->  global image representation
     +- Patch tokens (the rest)         ->  per-region representations

The patch tokens are the representation. Each is a point in a high-dimensional space (1024-dim for DINOv2, 1280-dim for I-JEPA). The geometry of these points -- how they cluster, how they spread, how they relate to each other -- is what defines the model's internal model of the image.

Why training objectives produce different geometry

Consider the elephant image:

DINOv2 (self-distillation). A student network matches a slowly-evolving teacher across augmented views. This creates consistency pressure: patches in similar semantic regions get pushed toward similar representations. Elephant body patches cluster together. Background patches cluster together. The result looks like unsupervised segmentation -- no labels needed.

I-JEPA (latent prediction). Given visible patches, predict the representation of masked patches. Unlike MAE (which predicts pixels), I-JEPA predicts in representation space, so it must learn abstract structure. Each patch must encode enough context about its neighborhood to predict what's missing. This is why patch entropy is higher (2.89 vs 2.52) -- each patch carries more unique information.

V-JEPA 2 (video prediction). Predict future frame representations from past frames. Even on a static image, the encoder carries a prior about how the visual world moves. It sees the elephant as something that could walk away, not just a static arrangement of pixels.

How cross-model comparison works

Two models, two different embedding spaces. DINOv2 lives in R^1024, I-JEPA in R^1280. You can't subtract them. Instead, compare structural properties.

CKA (Centered Kernel Alignment) -- src/analysis/cka.rs

Build a kernel matrix for each model: K[i,j] = dot(patch_i, patch_j). This captures pairwise similarity structure -- which patches are similar to which, regardless of coordinate system. Center both matrices, measure alignment via HSIC:

CKA(X, Y) = HSIC(K_X, K_Y) / sqrt(HSIC(K_X, K_X) * HSIC(K_Y, K_Y))

Invariant to orthogonal transforms and isotropic scaling. Compares geometric structure, not coordinates.

k-NN overlap -- src/analysis/knn.rs

For each patch, find its 10 nearest neighbors in model A and model B. Count how many overlap. If DINOv2 thinks patches 3, 7, 12 are similar (all on the trunk), does I-JEPA agree? Overlap of 0.278 = 27.8% agreement. Substantial disagreement about what "similar" means.

Patch correspondence -- src/analysis/correspondence.rs

When dimensions match (e.g., DINOv2 and V-JEPA 2, both 1024-dim), compute cosine similarity between every patch pair and find optimal assignment via the Hungarian algorithm. Tells you whether there's a clean mapping between the two representations, or whether they organized the space incompatibly.

Per-model health metrics

Effective rank -- src/analysis/rank.rs

SVD on the patch matrix. Threshold singular values at 1% of max, count survivors. Rank 60/1024 means 60 effective directions; the other 964 carry negligible information. Not waste -- just concentration.

PCA variance spectrum -- src/analysis/variance.rs, src/analysis/pca.rs

Power method PCA (no LAPACK dependency) on centered patch matrix. Eigenvalue ratios show how information distributes. Steep scree plot = compressible representation. Flat plot = information spread uniformly. Both can be useful depending on your downstream task.

Isotropy and uniformity -- src/analysis/isotropy.rs

Two views of the same question: is the representation using its space well?

  • Isotropy (1 - mean pairwise cosine): are patches directionally diverse, or all clustered in a narrow cone?
  • Uniformity (Wang & Isola 2020): log of average pairwise Gaussian kernel on the unit hypersphere. More negative = better coverage. Collapse to few modes pushes uniformity toward 0.

Patch entropy -- src/analysis/entropy.rs

k-means on patch tokens, then Shannon entropy of cluster assignments. High entropy = patches spread across many clusters. Low entropy = most patches land in the same cluster. Direct measure of how discriminative the representation is at the patch level.

RankMe (smooth effective rank) -- src/analysis/rankme.rs

Garrido et al. (ICML 2023). Computes exp(H(p)) where p is the normalized singular value distribution and H is Shannon entropy. Unlike threshold-based effective rank, RankMe is smooth and differentiable -- a value of 1 means total collapse (all variance in one dimension), while a value near k means variance is uniformly spread. Better at detecting subtle representation degradation.

Spectral decay (β) -- src/analysis/spectral_decay.rs

Fits a power law λ_i ~ i^(-β) to the eigenvalue spectrum via least-squares in log-log space. Low β (< 1) means slow decay -- the representation distributes information broadly across dimensions. High β (> 2) means rapid decay -- a few dimensions dominate. Complements RankMe: RankMe tells you how many effective dimensions, spectral decay tells you how sharply the spectrum drops off.

The video model trick

V-JEPA 2 is still a video model, but the current adapter no longer uses the retired 2-frame shortcut. For single-image analysis, the wrapper repeats the image to 16 frames, runs the official video encoder, reshapes the token sequence into 8 temporal groups by 256 spatial patches, then averages over time. That keeps the representation on the same repeated-frame manifold Meta uses for image evaluation while still producing a plain [1, 256, 1024] patch tensor for cross-model comparison.

Validation

Every report embeds a validation summary (src/validation/). Before trusting metrics, the tool checks:

  1. Preprocessing contract: registered resize/crop/normalize matches checked-in golden artifact
  2. Tensor semantics: ONNX graph exposes expected input/output names and shapes
  3. Reference parity: output matches previously approved references within tolerance

Status levels: validated (passed all checks against ONNX Runtime), stale (reference artifacts from a different backend), unverified (no reference artifacts yet).

Code map

src/
  models/
    registry.rs      Model metadata: architecture, normalization, tensor contracts
    loader.rs        ONNX session, inference (image + video paths), stub backend
    preprocess.rs    Resize + center-crop + normalize -> [1, 3, H, W] tensor
    cache.rs         Download, SHA-256 verify, partial-resume, cache state
  extract/
    features.rs      ModelOutput -> CLS token + patch tokens + attention maps
  analysis/
    pca.rs           Power method PCA (no LAPACK)
    coherence.rs     Spatial coherence (adjacent patch similarity on grid)
    cka.rs           Linear CKA + CLS cosine similarity
    knn.rs           Cosine similarity matrix, top-k neighbors, overlap
    rank.rs          Effective rank via singular value thresholding
    variance.rs      PCA variance spectrum (scree plot data)
    entropy.rs       k-means + Shannon entropy, patch norm statistics
    isotropy.rs      Cosine isotropy, partition function isotropy, uniformity
    attention.rs     Gini coefficient on attention weights
    correspondence.rs  Hungarian-matched patch correspondence
    rankme.rs        Smooth effective rank via Shannon entropy (Garrido 2023)
    spectral_decay.rs  Power-law eigenvalue decay exponent
  viz/
    terminal.rs      Rich Unicode terminal output (ASCII fallback)
    json.rs          Structured JSON
    html.rs          Self-contained HTML report bundles
    png.rs           PCA RGB projections, heatmaps, variance charts
  validation/
    evidence.rs      Freshness checks against golden fixtures
    parity.rs        Output-level comparison against reference artifacts

Development

cargo build --release

# Run without downloading models
LATENT_INSPECTOR_MODEL_BACKEND=stub cargo run -- models
LATENT_INSPECTOR_MODEL_BACKEND=stub cargo run -- compare docs/assets/img/samples/elephant_sample_image.jpg \
  --models dinov2-vit-l14,ijepa-vit-h14

cargo test
cargo fmt -- --check
cargo clippy --all-targets -- -D warnings

# Coverage (excludes TUI surface)
cargo llvm-cov --workspace \
  --ignore-filename-regex '(^|/)src/tui/|(^|/)src/cli/tui.rs$' \
  --fail-under-lines 85 \
  --fail-under-functions 80 \
  --summary-only

# Full CI
make all

The stub backend produces deterministic synthetic outputs for development and testing. Validation summaries downgrade stub-backed results to unverified. The TUI shows demo data without an image; with an image it runs the same live pipeline as the CLI.

License

MIT OR Apache-2.0

About

Inspect and compare self-supervised vision model representations for DINOv2, I-JEPA, V-JEPA 2, EUPE

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors