Skip to content

Make schema parsing lazy in Metadata#6117

Merged
yili-db merged 2 commits into
delta-io:masterfrom
0x5eba:sebastien-biollo/lazy-schema-parsing-metadata
Feb 27, 2026
Merged

Make schema parsing lazy in Metadata#6117
yili-db merged 2 commits into
delta-io:masterfrom
0x5eba:sebastien-biollo/lazy-schema-parsing-metadata

Conversation

@0x5eba

@0x5eba 0x5eba commented Feb 24, 2026

Copy link
Copy Markdown
Contributor

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Make schema parsing lazy in Metadata so that loading a snapshot no longer eagerly deserializes schemaString.

Tables created with earlier Delta versions can have VOID type columns. Previously, Metadata.fromColumnVector() eagerly called DataTypeJsonSerDe.deserializeStructType(), which throws a KernelException for unsupported types like VOID. This blocked the entire snapshot load even for callers that never access the parsed schema.

How was this patch tested?

  • New unit test in MetadataSuite: "schema parsing is lazy - void type does not block non-schema access" — constructs a Metadata via fromRow with a VOID-type schema string, verifies getId(), getConfiguration(), and getSchemaString() work without triggering parsing, then verifies getSchema() throws KernelException with the VOID message.
  • Updated integration test in DeltaTableReadsSuite: "table with void type - schema parsing is lazy" — creates a Delta table with a VOID column, verifies latestSnapshot() succeeds and getVersion() works, then verifies snapshot.getSchema throws the expected KernelException.
  • Existing MetadataSuite tests (configuration merging, serialization round trip) continue to pass.

Does this PR introduce any user-facing changes?

Yes. Previously, calling Table.getLatestSnapshot() (or any snapshot resolution) on a table with unsupported schema types (e.g., VOID) would throw a KernelException immediately. After this change, snapshot loading succeeds and the exception is deferred to the point where the schema is actually accessed (e.g., snapshot.getSchema(), snapshot.getScanBuilder()). Callers that only need metadata like table ID, version, or configuration are no longer blocked.

// Logical data schema excluding partition columns
private final Lazy<StructType> dataSchema;

public Metadata(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who calls this now? e.g. just a bunch of tests?

@0x5eba 0x5eba Feb 27, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called by TransactionMetadataFactory and many tests

@scottsand-db scottsand-db left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@yili-db yili-db merged commit 2493c51 into delta-io:master Feb 27, 2026
25 checks passed
huashi-st pushed a commit to huashi-st/delta that referenced this pull request Apr 24, 2026
#### Which Delta project/connector is this regarding?

- [ ] Spark
- [ ] Standalone
- [ ] Flink
- [X] Kernel
- [ ] Other (fill in here)

## Description

Make schema parsing lazy in `Metadata` so that loading a snapshot no
longer eagerly deserializes `schemaString`.

Tables created with earlier Delta versions can have `VOID` type columns.
Previously, `Metadata.fromColumnVector()` eagerly called
`DataTypeJsonSerDe.deserializeStructType()`, which throws a
`KernelException` for unsupported types like `VOID`. This blocked the
entire snapshot load even for callers that never access the parsed
schema.

## How was this patch tested?

- **New unit test in `MetadataSuite`:** *"schema parsing is lazy - void
type does not block non-schema access"* — constructs a `Metadata` via
`fromRow` with a `VOID`-type schema string, verifies `getId()`,
`getConfiguration()`, and `getSchemaString()` work without triggering
parsing, then verifies `getSchema()` throws `KernelException` with the
`VOID` message.
- **Updated integration test in `DeltaTableReadsSuite`:** *"table with
void type - schema parsing is lazy"* — creates a Delta table with a
`VOID` column, verifies `latestSnapshot()` succeeds and `getVersion()`
works, then verifies `snapshot.getSchema` throws the expected
`KernelException`.
- Existing `MetadataSuite` tests (configuration merging, serialization
round trip) continue to pass.

## Does this PR introduce any user-facing changes?

**Yes.** Previously, calling `Table.getLatestSnapshot()` (or any
snapshot resolution) on a table with unsupported schema types (e.g.,
`VOID`) would throw a `KernelException` immediately. After this change,
snapshot loading succeeds and the exception is deferred to the point
where the schema is actually accessed (e.g., `snapshot.getSchema()`,
`snapshot.getScanBuilder()`). Callers that only need metadata like table
ID, version, or configuration are no longer blocked.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants