Skip to content

Conversation

@annrpom
Copy link
Contributor

@annrpom annrpom commented Dec 12, 2025

Add StatsHistogram, a quantile sketch backed by t-digest, to track byte distribution by TieringAttribute. The histogram supports CDF and quantile queries to estimate bytes above/below thresholds for tiering decisions.

Add TieringHistogramBlockWriter to encode multiple histograms per sstable or blob file, keyed by (KindAndTier, SpanID). The block will support histogram types for sstable keys, inline values, and blob references (hot/cold).

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@annrpom annrpom force-pushed the tiering-historgram-block branch from 1a28649 to c2d5a00 Compare December 12, 2025 21:11
@github-actions
Copy link

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@annrpom annrpom force-pushed the tiering-historgram-block branch 2 times, most recently from 5108683 to 9fce22e Compare December 15, 2025 19:59
@annrpom
Copy link
Contributor Author

annrpom commented Dec 15, 2025

I'll write the block in a separate PR

@annrpom annrpom marked this pull request as ready for review December 15, 2025 20:10
@annrpom annrpom requested a review from a team as a code owner December 15, 2025 20:10
@github-actions
Copy link

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!!

Reviewable status: 0 of 8 files reviewed, 14 unresolved discussions (waiting on @annrpom, @jbowens, and @sumeerbhola)


go.sum line 14 at r1 (raw file):

github.com/RaduBerinde/btreemap v0.0.0-20250419174037-3d62b7205d54 h1:bsU8Tzxr/PNz75ayvCnxKZWEYdLMPDkUgticP4a4Bvk=
github.com/RaduBerinde/btreemap v0.0.0-20250419174037-3d62b7205d54/go.mod h1:0tr7FllbE9gJkHq7CVeeDDFAFKQVy5RnCSSNBOvdqbc=
github.com/RaduBerinde/tdigest v0.0.0-20251022152254-90e030c3a314 h1:RcSNrxDZ1ZyEfpGwrpWqd3YWfMjQbKOtT+p3Wht6XN0=

We should apply the same due dilligence as any external library since it's a forked library and my changes to it weren't reviewed. Maybe @sumeerbhola or @jbowens can take a brief look over it?


sstable/tieredmeta/histogram.go line 20 at r1 (raw file):

// 2*compression centroids, with ~1.3*compression in practice.
//
// A value of 100 is a common default for t-digest implementations.

, which should provide at least ±1% quantile accuracy (and much better at the tails).


sstable/tieredmeta/histogram.go line 21 at r1 (raw file):

//
// A value of 100 is a common default for t-digest implementations.
const CompressionLevel = 100

[nit] digestDelta


sstable/tieredmeta/histogram.go line 24 at r1 (raw file):

// StatsHistogram is a quantile sketch using t-digest to track the distribution
// of bytes by TieringAttribute. It can efficiently answer questions like:

[nit] for bytes we could say "a bytes value (with a caller-determined meaning)"


sstable/tieredmeta/histogram.go line 38 at r1 (raw file):

	// ZeroBytes tracks bytes with attribute=0 separately, as this typically
	// indicates unset/special values that shouldn't be tiered.
	ZeroBytes uint64

[nit] BytesNoAttr


sstable/tieredmeta/histogram.go line 54 at r1 (raw file):

}

// Quantile returns the attribute value at quantile q (where q is in [0, 1]).

[nit] at quantile ~q


sstable/tieredmeta/histogram.go line 55 at r1 (raw file):

// Quantile returns the attribute value at quantile q (where q is in [0, 1]).
// Example: Quantile(0.5) returns the median attribute value by bytes.

[nit] returns a value close to the median value


sstable/tieredmeta/histogram.go line 64 at r1 (raw file):

// NonZeroBytes returns the total bytes with non-zero attributes.
func (s *StatsHistogram) NonZeroBytes() uint64 {

BytesWithAttr


sstable/tieredmeta/histogram.go line 68 at r1 (raw file):

}

// BytesBelowThreshold estimates how many non-zero bytes have attribute <= threshold.

how many bytes (excluding NoAttrBytes)


sstable/tieredmeta/histogram_block.go line 49 at r1 (raw file):

// Below is the block structure:
//
//	SSTable's TieringHistogramBlock:

IMO in sstables we should be writing only one tiering histogram, the one that matches the TieringID in the current policy. We can consider any KVs with other TierIDs as having "no attribute" when we populate this histogram. I don't think we would realistically see different TierIDs in the same keyspace anyway. WDYT @sumeerbhola?


sstable/tieredmeta/histogram_block.go line 58 at r1 (raw file):

//	└── ...
//
//	Blob file's TieringHistogramBlock:

@sumeerbhola remind me, when do we expect to use a blob's histogram block? I was thinking everything would be done from the sstable's perspective.


sstable/tieredmeta/histogram_block.go line 92 at r1 (raw file):

	SSTableKeyBytes KindAndTier = iota
	// SSTableValueBytes is the histogram for values stored inside the
	// sstable. The tier is implicit based on whether the sstable is in hot or

[nit] I'd remove the last sentence since all sstables are in hot storage currently.


sstable/tieredmeta/histogram_block.go line 94 at r1 (raw file):

	// sstable. The tier is implicit based on whether the sstable is in hot or
	// cold storage.
	SSTableValueBytes

SSTableInPlaceValueBytes


sstable/tieredmeta/histogram_block.go line 97 at r1 (raw file):

	// SSTableBlobReferenceHotBytes is the histogram for blob references from the
	// sstable to hot tier blob files. The size is the size of the value.
	SSTableBlobReferenceHotBytes

I don't think we need SSTableBlobReferenceHotBytes and SSTableBlobReferenceColdBytes. We should just have SSTableBlobValueBytes which tracks the blob value length.

Separately (in the sstable properties and/or TableMetadata), we will store the cutoff that was used, so we'll be able to use BytesBelowThreshold and BytesAboveThreshold to figure out cold vs hot.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 8 files reviewed, 15 unresolved discussions (waiting on @annrpom, @jbowens, and @sumeerbhola)


internal/base/internal.go line 643 at r1 (raw file):

type TieringAttribute uint64

// TieringSpanID TODO comment

TieringSpanID is an identifier that provides a context for interpreting a TieringAttribute value. At any point in time, the span policies determine a set of non-overlapping key regions with distinct TieringSpanIDs.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 8 files reviewed, 16 unresolved discussions (waiting on @annrpom, @jbowens, and @sumeerbhola)


sstable/tieredmeta/histogram_block.go line 90 at r1 (raw file):

	// SSTableKeyBytes is the histogram for sstable keys. The tier is implicit
	// based on whether the sstable is in hot or cold storage.
	SSTableKeyBytes KindAndTier = iota

It's unclear to me how we will use this. What we want is to estimate how much physical sstable data is there because of cold KVs. But the key size is just part of that, plus the keys compress against each other.

Maybe it would be easier/cleaner to just track just the KV count? And we'd estimate the above as roughly <sstable physical size> * <KV count with attribute below threshold> / <total KV count>

@annrpom annrpom force-pushed the tiering-historgram-block branch from 9fce22e to 1cd44c2 Compare December 17, 2025 21:08
Copy link
Contributor Author

@annrpom annrpom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 8 files reviewed, 4 unresolved discussions (waiting on @jbowens, @RaduBerinde, and @sumeerbhola)


internal/base/internal.go line 643 at r1 (raw file):

Previously, RaduBerinde wrote…

TieringSpanID is an identifier that provides a context for interpreting a TieringAttribute value. At any point in time, the span policies determine a set of non-overlapping key regions with distinct TieringSpanIDs.

eek i forgot to update; ty!


sstable/tieredmeta/histogram.go line 20 at r1 (raw file):

Previously, RaduBerinde wrote…

, which should provide at least ±1% quantile accuracy (and much better at the tails).

done


sstable/tieredmeta/histogram.go line 21 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] digestDelta

done


sstable/tieredmeta/histogram.go line 24 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] for bytes we could say "a bytes value (with a caller-determined meaning)"

done


sstable/tieredmeta/histogram.go line 38 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] BytesNoAttr

done


sstable/tieredmeta/histogram.go line 54 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] at quantile ~q

done


sstable/tieredmeta/histogram.go line 55 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] returns a value close to the median value

done


sstable/tieredmeta/histogram.go line 64 at r1 (raw file):

Previously, RaduBerinde wrote…

BytesWithAttr

done


sstable/tieredmeta/histogram.go line 68 at r1 (raw file):

Previously, RaduBerinde wrote…

how many bytes (excluding NoAttrBytes)

done


sstable/tieredmeta/histogram_block.go line 58 at r1 (raw file):

Previously, RaduBerinde wrote…

@sumeerbhola remind me, when do we expect to use a blob's histogram block? I was thinking everything would be done from the sstable's perspective.

looking at the prototype, the blob file's histogram block isn't currently used. the tiered storage design doc mentions writing it, but i'll leave this block out for now and can add it back if necessary

from offline convo: the primary use case for a blob histogram block would be observability


sstable/tieredmeta/histogram_block.go line 90 at r1 (raw file):

Previously, RaduBerinde wrote…

It's unclear to me how we will use this. What we want is to estimate how much physical sstable data is there because of cold KVs. But the key size is just part of that, plus the keys compress against each other.

Maybe it would be easier/cleaner to just track just the KV count? And we'd estimate the above as roughly <sstable physical size> * <KV count with attribute below threshold> / <total KV count>

would we ever be way off if we were to track just the KV count? in practice, would we have some large variance of value bytes across an sstable that would cause us to severely overestimate or underestimate? maybe like: if we wrote some data, some time passes so that data turns cold, and then we write mvcc tombstones that have no value - won't we then underestimate cold data, since count-based tracking weights tombstones (no value, hot) equally with real KVs (large values, cold)?


sstable/tieredmeta/histogram_block.go line 92 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] I'd remove the last sentence since all sstables are in hot storage currently.

done


sstable/tieredmeta/histogram_block.go line 94 at r1 (raw file):

Previously, RaduBerinde wrote…

SSTableInPlaceValueBytes

done


sstable/tieredmeta/histogram_block.go line 97 at r1 (raw file):

Previously, RaduBerinde wrote…

I don't think we need SSTableBlobReferenceHotBytes and SSTableBlobReferenceColdBytes. We should just have SSTableBlobValueBytes which tracks the blob value length.

Separately (in the sstable properties and/or TableMetadata), we will store the cutoff that was used, so we'll be able to use BytesBelowThreshold and BytesAboveThreshold to figure out cold vs hot.

ah yeah done

@annrpom annrpom requested a review from RaduBerinde December 17, 2025 21:16
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeerbhola reviewed 8 files and all commit messages, and made 5 comments.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @annrpom, @jbowens, and @RaduBerinde).


sstable/tieredmeta/histogram_block.go line 90 at r1 (raw file):

Previously, annrpom (annie pompa) wrote…

would we ever be way off if we were to track just the KV count? in practice, would we have some large variance of value bytes across an sstable that would cause us to severely overestimate or underestimate? maybe like: if we wrote some data, some time passes so that data turns cold, and then we write mvcc tombstones that have no value - won't we then underestimate cold data, since count-based tracking weights tombstones (no value, hot) equally with real KVs (large values, cold)?

I'm also confused because the sstable physical size includes inline values.
Say we tracked the original non prefix compressed bytes in the histogram. And we also computed the sum of the key columns in all the data blocks in the sstable (say B). Then the estimate for cold key bytes with a cold threshold T could be (BytesBelowThreshold(T)/TotalBytes)*B.


sstable/tieredmeta/histogram_block.go line 72 at r2 (raw file):

	// SSTableBlobValueBytes is the histogram for values stored in blob files
	// referenced by the sstable.
	SSTableBlobValueBytes

Now that some files will have hot references also have cold references, I was trying to think through what we need here and for what purpose. I was imagining the following categories:

  • Value that has both cold reference and hot (in place in sstable, or reference): Useful for observability, for figuring out how much duplication we have. Not used for deciding whether to write cold blob files in a compaction.
  • Value that has only cold reference: Again, only for observability
  • Value that is only hot: Needed for deciding whether to write cold references before starting the compaction. The prototype further distinguished the ones that were a reference versus in place since the former definitely met the value size threshold to be a reference, while the latter could be very small. So for the latter, there was an estimate used in the condition k.KindAndTier == base.SSTableValueBytes && h.MeanSize() < uint64(policy.MinimumSize)

The list in this PR is different than the above.


sstable/tieredmeta/histogram.go line 34 at r2 (raw file):

// multiple files without bucket alignment issues.
type StatsHistogram struct {
	// TotalBytes is the sum of all bytes recorded (including ZeroBytes).

should this say BytesNoAttr instead of ZeroBytes?


sstable/tieredmeta/histogram.go line 56 at r2 (raw file):

}

// Quantile returns the attribute value at quantile ~q (where q is in [0, 1]).

worth saying the quantile is of the non-zero bytes.


sstable/tieredmeta/histogram.go line 146 at r2 (raw file):

			digestSize, len(buf))
	}

is it ok to have digestSize < len(buf) or should that also be an error?

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RaduBerinde made 2 comments.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @annrpom, @jbowens, and @sumeerbhola).


sstable/tieredmeta/histogram_block.go line 90 at r1 (raw file):
I forgot about the in-place values. Indeed we would need a size that excludes that.

And we also computed the sum of the key columns in all the data blocks in the sstable (say B)

This would be the size of the encoded column as it appears in the block? That does make sense. Though I'm not sure if scaling according to the sum of key sizes is necessarily better than just by key count.

With the key count, we could also estimate the amount of other per-key data in the block (the tiering attribute, the blob reference).

I am suspecting that in many cases, the extra key storage itself might only be a few bytes per key (e.g. last few bytes of an ID), much less than the other per-key data in the sstable.


sstable/tieredmeta/histogram_block.go line 72 at r2 (raw file):

Previously, sumeerbhola wrote…

Now that some files will have hot references also have cold references, I was trying to think through what we need here and for what purpose. I was imagining the following categories:

  • Value that has both cold reference and hot (in place in sstable, or reference): Useful for observability, for figuring out how much duplication we have. Not used for deciding whether to write cold blob files in a compaction.
  • Value that has only cold reference: Again, only for observability
  • Value that is only hot: Needed for deciding whether to write cold references before starting the compaction. The prototype further distinguished the ones that were a reference versus in place since the former definitely met the value size threshold to be a reference, while the latter could be very small. So for the latter, there was an estimate used in the condition k.KindAndTier == base.SSTableValueBytes && h.MeanSize() < uint64(policy.MinimumSize)

The list in this PR is different than the above.

I think we should choose either a design where (for the levels where we actually do tiering) all hot values are both in hot and cold, or one where all hot values are only in hot. Mixing them will lead to a lot of complications.

With that in mind, these can all be deduced from BlobValueBytes and the threshold used for this file.

IMO upper level files that aren't tiered wouldn't have this data at all. The cost-benefit is not there - these are normally ~1% of total size so statistics are not very valuable, but they get written much more frequently. At the same time, a single upper-level file can have keys from many different tables and we'd have to look up all each table's span config in every compaction.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeerbhola made 1 comment.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @annrpom, @jbowens, and @RaduBerinde).


sstable/tieredmeta/histogram_block.go line 72 at r2 (raw file):

I think we should choose either a design where (for the levels where we actually do tiering) all hot values are both in hot and cold, or one where all hot values are only in hot. Mixing them will lead to a lot of complications.

Say we only were tiering in L6, which is the latest proposal. And we are compacting a 64MiB L5 file with some number of L6 files that sum to 640MiB. Say this compaction is not one that rewrites existing references. And some of those input L6 files were moved down from higher levels so don't have cold references.
Are we going to force writing of cold blob files? If yes, does this actually help with any invariant on L6 files given move compacted L6 files may only have hot references? Also, what if the L5 file happened to have only tiny values that don't get moved to a blob file so the cold blob file is also tiny -- if we don't write cold blob files yet, we could wait till we can write a bulkier cold blob file.

In my mind at least, the idea to have both cold and hot blob references for the same data was something to improve the average behavior of the system by bulking up the cold blob files and not an invariant that we needed to rely on. We already had fine grained KindAndTier in the prototype, and I don't think it complicated anything.

IMO upper level files that aren't tiered wouldn't have this data at all. The cost-benefit is not there - these are normally ~1% of total size so statistics are not very valuable, but they get written much more frequently. At the same time, a single upper-level file can have keys from many different tables and we'd have to look up all each table's span config in every compaction.

I really don't have an opinion on this.
Just one thing to note: upper level files were not even split based on TieringSpanID (let alone shard/range boundaries), in the prototype, since that can lead to smaller files than necessary. So one could not estimate how much data would have cold references written purely based on L5 sstable stats since there could be values in there with different tiering-span-ids (and some with no tiering policy). Just maintaining the same behavior across all files lead to less special casing of logic (which has its own simplicity). The sizes of these histograms don't really matter except for compactions, and I suspect we are a long way away from having 1000s of different tiered SQL tables in a cluster.

Anyway, I am fine with whatever y'all decide.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RaduBerinde made 1 comment.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @annrpom, @jbowens, and @sumeerbhola).


sstable/tieredmeta/histogram_block.go line 72 at r2 (raw file):

Previously, sumeerbhola wrote…

I think we should choose either a design where (for the levels where we actually do tiering) all hot values are both in hot and cold, or one where all hot values are only in hot. Mixing them will lead to a lot of complications.

Say we only were tiering in L6, which is the latest proposal. And we are compacting a 64MiB L5 file with some number of L6 files that sum to 640MiB. Say this compaction is not one that rewrites existing references. And some of those input L6 files were moved down from higher levels so don't have cold references.
Are we going to force writing of cold blob files? If yes, does this actually help with any invariant on L6 files given move compacted L6 files may only have hot references? Also, what if the L5 file happened to have only tiny values that don't get moved to a blob file so the cold blob file is also tiny -- if we don't write cold blob files yet, we could wait till we can write a bulkier cold blob file.

In my mind at least, the idea to have both cold and hot blob references for the same data was something to improve the average behavior of the system by bulking up the cold blob files and not an invariant that we needed to rely on. We already had fine grained KindAndTier in the prototype, and I don't think it complicated anything.

IMO upper level files that aren't tiered wouldn't have this data at all. The cost-benefit is not there - these are normally ~1% of total size so statistics are not very valuable, but they get written much more frequently. At the same time, a single upper-level file can have keys from many different tables and we'd have to look up all each table's span config in every compaction.

I really don't have an opinion on this.
Just one thing to note: upper level files were not even split based on TieringSpanID (let alone shard/range boundaries), in the prototype, since that can lead to smaller files than necessary. So one could not estimate how much data would have cold references written purely based on L5 sstable stats since there could be values in there with different tiering-span-ids (and some with no tiering policy). Just maintaining the same behavior across all files lead to less special casing of logic (which has its own simplicity). The sizes of these histograms don't really matter except for compactions, and I suspect we are a long way away from having 1000s of different tiered SQL tables in a cluster.

Anyway, I am fine with whatever y'all decide.

Thanks, good points. And in any case, it's better to make less assumptions here so we have more freedom to play with the higher-level policies. I retract my comment and should keep both SSTableBlobReferenceHotBytes and SSTableBlobReferenceColdBytes which counts the total value size of hot and for cold references (sorry @annrpom :) )

We'll leave the other question of whether upper-level files have tiering information open for now.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@RaduBerinde made 7 comments.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @annrpom, @jbowens, and @sumeerbhola).


sstable/tieredmeta/histogram.go line 24 at r1 (raw file):

Previously, annrpom (annie pompa) wrote…

done

[nit] value


sstable/tieredmeta/histogram.go line 43 at r2 (raw file):

	// digest is the t-digest for non-zero attributes (read-only).
	// May be nil if no non-zero samples have been recorded.
	digest *tdigest.TDigest

Are there any situations where we would want this to be nil? If not, we should make it a non-pointer


sstable/tieredmeta/histogram_block.go line 76 at r2 (raw file):

)

// TieringHistogramBlockWriter is instantiated for each sstable or blob file

[nit] let's just say is instantiated for each output file that requires tiering histograms.


sstable/tieredmeta/histogram_block.go line 111 at r2 (raw file):

}

func NewTieringHistogramBlockWriter() *TieringHistogramBlockWriter {

[nit] MakeTieringHistogramBlockWriter() TieringHistogramBlockWriter (too many pointers)


sstable/tieredmeta/histogram_block.go line 130 at r2 (raw file):

// Flush returns the encoded block contents.
func (w *TieringHistogramBlockWriter) Flush() []byte {

[nit] Finish(). Should mention if the writer can be reused after that.


sstable/tieredmeta/histogram_block.go line 150 at r2 (raw file):

// TieringHistogramBlockContents is the in-memory contents of the tiering
// histogram block. We may temporarily load it into TableMetadata for

[nit] the comment is likely to become stale, not sure if we'll ever load it into TableMetadata. I'd remove the second sentence. We could also just return map[Key]StatsHistogram since it's pretty self-explanatory.

@annrpom annrpom force-pushed the tiering-historgram-block branch from 1cd44c2 to 3cfbbe3 Compare December 19, 2025 18:52
Copy link
Contributor Author

@annrpom annrpom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@annrpom made 6 comments.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @jbowens, @RaduBerinde, and @sumeerbhola).


sstable/tieredmeta/histogram.go line 24 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] value

done


sstable/tieredmeta/histogram.go line 34 at r2 (raw file):

Previously, sumeerbhola wrote…

should this say BytesNoAttr instead of ZeroBytes?

yes - ty for the catch; done


sstable/tieredmeta/histogram.go line 56 at r2 (raw file):

Previously, sumeerbhola wrote…

worth saying the quantile is of the non-zero bytes.

done


sstable/tieredmeta/histogram.go line 146 at r2 (raw file):

Previously, sumeerbhola wrote…

is it ok to have digestSize < len(buf) or should that also be an error?

that should be an error; done


sstable/tieredmeta/histogram_block.go line 90 at r1 (raw file):

Previously, RaduBerinde wrote…

I forgot about the in-place values. Indeed we would need a size that excludes that.

And we also computed the sum of the key columns in all the data blocks in the sstable (say B)

This would be the size of the encoded column as it appears in the block? That does make sense. Though I'm not sure if scaling according to the sum of key sizes is necessarily better than just by key count.

With the key count, we could also estimate the amount of other per-key data in the block (the tiering attribute, the blob reference).

I am suspecting that in many cases, the extra key storage itself might only be a few bytes per key (e.g. last few bytes of an ID), much less than the other per-key data in the sstable.

let me make a separate PR for key count


sstable/tieredmeta/histogram_block.go line 72 at r2 (raw file):

Previously, RaduBerinde wrote…

Thanks, good points. And in any case, it's better to make less assumptions here so we have more freedom to play with the higher-level policies. I retract my comment and should keep both SSTableBlobReferenceHotBytes and SSTableBlobReferenceColdBytes which counts the total value size of hot and for cold references (sorry @annrpom :) )

We'll leave the other question of whether upper-level files have tiering information open for now.

np! done

@annrpom annrpom requested a review from sumeerbhola December 19, 2025 18:52
@github-actions
Copy link

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

Add StatsHistogram, a quantile sketch backed by t-digest, to track byte
distribution by TieringAttribute. The histogram supports CDF and quantile
queries to estimate bytes above/below thresholds for tiering decisions.

Add TieringHistogramBlockWriter to encode multiple histograms per sstable
or blob file, keyed by (KindAndTier, SpanID). The block will support histogram
types for sstable keys, inline values, and blob references (hot/cold).
@annrpom annrpom force-pushed the tiering-historgram-block branch from 3cfbbe3 to e04d365 Compare December 19, 2025 19:06
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@sumeerbhola reviewed 3 files and made 2 comments.
Reviewable status: all files reviewed (commit messages unreviewed), 13 unresolved discussions (waiting on @annrpom, @jbowens, and @RaduBerinde).


sstable/tieredmeta/histogram_block.go line 72 at r2 (raw file):

Value that has both cold reference and hot (in place in sstable, or reference): Useful for observability

We don't have this category. And there is a code comment now that says:

// SSTableKeyBytes + SSTableInPlaceValueBytes + SSTableBlobReferenceHotBytes +
// SSTableBlobReferenceColdBytes

Won't this double count without having that as a category?

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RaduBerinde made 1 comment.
Reviewable status: all files reviewed (commit messages unreviewed), 13 unresolved discussions (waiting on @annrpom, @jbowens, and @sumeerbhola).


sstable/tieredmeta/histogram_block.go line 72 at r2 (raw file):

Previously, sumeerbhola wrote…

Value that has both cold reference and hot (in place in sstable, or reference): Useful for observability

We don't have this category. And there is a code comment now that says:

// SSTableKeyBytes + SSTableInPlaceValueBytes + SSTableBlobReferenceHotBytes +
// SSTableBlobReferenceColdBytes

Won't this double count without having that as a category?

Hm, yes. I'm worried about overhead of updating too many histograms with every KV.

  1. We could remove SSTableKeyBytes and replace it with just a per-sstable percentage of data bytes that are due to keys that have tiering attribute below the threshold at the time of compaction? I don't see what the full histogram is buying us.

  2. Do we need to distinguish in-place values vs hot blob references? If not, we could have SSTableHotOnlyValueBytes, SSTableColdOnlyValueBytes and SSTableHotAndColdValueBytes where each KV updates only one of the three.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants