feat: TruncateFront & TruncateBack support truncating all entries #31

RangerCD · 2025-02-17T07:21:16Z

Currently, it seems like there is no way to truncate entire log, which causes the duplicate transaction issue (#20).

This PR provides the ability of truncating all entries, by extending the index range 1 step further than the last(TruncateFront) or 1 step before the first(TruncateBack) entry.

For example, if the log has 3 entries[10, 11, 12]:

TruncateFront(13) will truncate all entries, and left firstIndex as 13 and lastIndex as 12.
TruncateBack(9) will truncate all entries, and left firstIndex as 10 and lastIndex as 9.

Why this feature is needed

While I was working on a project that uses this package as WAL implementation, I've met the situation that described in #20. I can't simply mark all jobs as done if the program exits unexpectedly, and next time it starts, it will duplicate the last job. So I need a way to remove this last job from wal once it's done.

With this PR, I can call TruncateFront(index + 1) when I've popped indexth job from wal, and let the program wait for incoming index + 1th job.

What's the difference between #29 and this PR

Thanks to @DStrand1 for providing a solution. However, #29 can't fulfill my requirements, by providing a Clear function.

Complexity of usage

In my use case, I'm running a long-lived goroutine as a consumer(Read), and a gin server as producers(Write).

With Clear from #29, after the consumer finishes all jobs, it will first detect whether wal is empty (by either calling Read(index + 1) with ErrNotFound returned, or checking a seperate counter), and then call Clear to remove all entries. The problem is that, these two steps are not atomic, so I have to add another lock to protect them which is complicated.

With this PR, both TruncateFront and TruncateBack hold internal lock, so they are atomic. And caller don't need to care about how many jobs left in wal, just call Read and TruncateFront in a loop until Read returns ErrNotFound, then wait for incoming jobs.

Corruption-safty

Clear function in #29 isn't corruption-safe, like #29 (comment) pointed out.

This PR modifies TruncateFront and TruncateBack, following the original corruption-safe mechanism.

Index continuity

#29 clears all entries, also resets firstIndex and lastIndex to 1 and 0. Both Write and Read have to reset its own index counter after Clear.

This PR keeps firstIndex and lastIndex continuous, Write and Read can keep using previous index counter.

tidwall · 2025-03-16T14:22:30Z

This looks good on the surface.
We'll need some tests to ensure correctness.

RangerCD · 2025-03-16T14:38:38Z

This looks good on the surface. We'll need some tests to ensure correctness.

I'll add some tests, might also adjust current test cases if needed.

RangerCD · 2025-03-18T02:25:15Z

While I was writing tests, a question raised that needs your help to answer.

With this change, the WAL may be in a state without any entries during use. As a result, the behavior of FirstIndex and LastIndex may need further clarification or modification.

In the following two comments, these two functions are expected to return 0 when the log is empty.

wal/wal.go

Lines 484 to 486 in 56126d5

    
           // FirstIndex returns the index of the first entry in the log. Returns zero 
        
           // when log has no entries. 
        
           func (l *Log) FirstIndex() (index uint64, err error) {

wal/wal.go

Lines 502 to 504 in 56126d5

    
           // LastIndex returns the index of the last entry in the log. Returns zero when 
        
           // log has no entries. 
        
           func (l *Log) LastIndex() (index uint64, err error) {

However, if the WAL is in a continuous read/write state, returning 0 might cause some confusion for users.

Let's temporarily ignore the special handling of empty entries in these two functions and consider the following usage scenario:

There are two Go routines—one continuously appends entries to the WAL, while the other continuously retrieves entries from the WAL. The code is as follows:

func main() {
	notify := make(chan struct{}, 1)
	l, _ := wal.Open("wal", nil)

	go func() {
		for {
			time.Sleep(1 * time.Second)
			idx, _ := l.LastIndex()
			idx++
			l.Write(idx, []byte(strconv.FormatUint(idx, 10)+"hello world"))
			select {
			case notify <- struct{}{}:
			default:
			}
		}
	}()

	go func() {
		for {
			select {
			case <-notify:
			}
			for {
				idx, _ := l.FirstIndex()
				data, err := l.Read(idx)
				if errors.Is(err, wal.ErrNotFound) {
					break
				}
				fmt.Println(string(data))
				l.TruncateFront(idx + 1)
			}
		}
	}()

	select {}
}

Relatively speaking, this is a straightforward implementation of the producer-consumer pattern, where each Go routine only needs to focus on one end of the queue.

In this case, FirstIndex can be understood as the "Next index to read," while LastIndex represents the "Last index that has been written."

Now, returning to the special handling of empty entries in these functions, the above code is actually just one step away from running perfectly:

Since FirstIndex will initially return 0, we need to replace idx, _ := l.FirstIndex() with:

idx, _ := l.FirstIndex()
if idx == 0 {
	idx = 1
}

This isn't particularly complicated, but it may confuse users until they carefully read the source code of FirstIndex to figure out why they can't read the first entry, it's because FirstIndex doesn't only just return the first index to read. And that's what I had experienced a month ago.

Currently, the implementation of these two functions only includes special handling for one specific case: when l.lastIndex == 0, FirstIndex returns 0 instead of the actual l.firstIndex.

For LastIndex, the if branch does not actually change its behavior, because when l.lastIndex == 0, returning 0 and returning l.lastIndex are effectively the same.

Disregarding historical reasons, a feasible implementation would be to return an additional boolean value to indicate whether the WAL is empty.

So the key question that needs to be clarified is:

whether to keep the current behavior and write test cases around it
or introduce a breaking change, either by removing the special handling or by returning an additional value.

tidwall · 2025-08-16T16:43:21Z

Sorry for the slow response.

I think the best route is to always retain a non-zero index after the truncate operation.

For TruncateFront, that ends up clearing the log, the LastIndex should end up being the index provided to the function, and the FirstIndex should be one past the last. This ensures that the next entry to be inserted must always be LastIndex+1, continuing a forward only behavior.

For TruncateBack, that ends up clearing the log, the FirstIndex should end up being the index provided to the function, and the LastIndex should be one before the first. This too allows the next inserted entry to be LastIndex+1.

The problem with this solution is that clearing the log leaves the first/last indexes in a strange overlapping state. I think this could be considerer a breaking change, at least for those users who expect that l.FirstIndex() to return a zero or index that points to valid entry. As such, this change may require an option such as AllowClear, which defaults to false.

RangerCD · 2025-08-16T17:54:48Z

Glad to hear your reply! I will add the feature option and test case on Monday.

And a small mistake here:

For TruncateBack, that ends up clearing the log, the FirstIndex should end up being the index provided to the function, and the LastIndex should be one before the first. This too allows the next inserted entry to be LastIndex+1.

The FirstIndex in this case is actually index+1, because

wal/wal.go

Lines 806 to 808 in 56126d5

    
           // TruncateBack truncates the back of the log by removing all entries that 
        
           // are after the provided `index`. In other words the entry at `index` 
        
           // becomes the last entry in the log.

If provided index equals to FirstIndex, then it will keep the first entry instead of no entry.
Anyway, it doesn't change the conclusion.

tidwall · 2025-08-16T20:41:17Z

Sounds good. Thanks!

RangerCD · 2025-08-19T07:52:37Z

Added feature option AllowEmpty with default value false, and extra tests for this option.

I didn't use AllowClear as suggested, because there is already a ClearCache function. AllowClear might be misleading but actually it's unrelated to this function.

tidwall · 2025-08-20T14:30:24Z

I think the name AllowEmpty make more sense.

tidwall · 2025-08-20T16:39:12Z

After further thought, I'm not actually sure the AllowEmpty option is needed. Initially I suggested it out of an abundance of caution but I really can't see a scenario where it'll affect the current users. I said

I think this could be considerer a breaking change, at least for those users who expect that l.FirstIndex() to return a zero or index that points to valid entry.

But in reality, the only users that would expect the first index to be zero are those who just created a brand new log.
And those who expect it to point to a valid entry are the ones who haven't used the new empty feature yet.

RangerCD · 2025-08-21T04:08:06Z

That makes sense, should I remove this option or set true as default?

I've also investigated WAL usage in https://github.com/open-telemetry/opentelemetry-collector-contrib and https://github.com/influxdata/telegraf . Both of them can handle AllowEmpty set to true without modification, tests all passed. And I believe the usage in these 2 repos can be simplified with this PR.

For https://github.com/open-telemetry/opentelemetry-collector-contrib, it uses its own index counter and prevents overlapping with a max:
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/71fc94173de0ee2cf1bd9d38d4c16ea20722a362/exporter/prometheusremotewriteexporter/wal.go#L291-L293

// In normal state, wIndex and rIndex will differ by one. To avoid having -1 as a final value, we set it to 0 as minimum.
lag := max(0, int64(prweWAL.wWALIndex.Load()-prweWAL.rWALIndex.Load()))
prweWAL.telemetry.recordWALLag(ctx, lag)

For https://github.com/influxdata/telegraf, empty file is handled by itself, and expected to be simplified once WAL supports empty state(https://github.com/influxdata/telegraf/pull/17240/files#r2171642847):
https://github.com/influxdata/telegraf/blob/0631dbaed83e2ae84cf1abc79b84d44108f89a62/models/buffer_disk.go#L296-L310

// This is very messy and not ideal, but serves as the only way I can find currently
// to actually treat the walfile as empty if needed, since Truncate() calls require
// that at least one entry remains in them otherwise they return an error.
// Related issue: https://github.com/tidwall/wal/issues/20
func (b *DiskBuffer) handleEmptyFile() {
	if !b.isEmpty {
		return
	}
	if err := b.file.TruncateFront(b.readIndex() + 1); err != nil {
		log.Printf("E! readIndex: %d, buffer len: %d", b.readIndex(), b.length())
		panic(err)
	}
	b.mask = b.mask[1:]
	b.isEmpty = false
}

tidwall · 2025-08-21T05:30:53Z

Thanks for doing the additional research. I think we can remove the option altogether.

RangerCD · 2025-08-21T08:18:09Z

Option removed, along with a couple of redundant if conditions.
Truncating to empty state when the log is in initial empty state doesn't return ErrOutOfRange, but does nothing.

I think these changes complete the feature PR, please remind me if anything is missing.

I suggest to make a release statement for next version tag, since it should still be considered as a breaking change.

TruncateFront & TruncateBack now support truncating all entries.
FirstIndex doesn't handle initial empty log specially anymore, it now only returns the first index ("next to read" in another words).

tidwall · 2025-08-21T15:34:38Z

I'm checking on my side, and I want to see if we can avoid all breaking changes before merging.

tidwall · 2025-08-23T10:06:16Z

Overall the PR works. But I was wrong about removing the AllowEmpty option. Without the option this change does break some existing projects that use this package. I tested with a few additional external projects that import this package and ran into some failed tests, including my own https://github.com/tidwall/raft-wal project. 🤦‍♂️

So I added the option back. I also added an IsEmpty function along with some new tests and refactoring.

I'll merge this PR and then follow it up with my changes.

This commit includes the AllowEmpty option which allows for deleting all entries from the log using the TruncateFront and TruncateBack methods. Using AllowEmpty changes the behavior of the log as it relates to FirstIndex and LastIndex in the following ways: - An empty log will now always have the FirstIndex be equal to LastIndex+1. Witout AllowEmpty, FirstIndex can never be greater than LastIndex. - For a newly created log that has no entries, FirstIndex() and LastIndex() return 1 and 0, respectively. Without AllowEmpty, both return 0. Also in this commit: - IsEmpty method, which returns true if the log has no entries - More unit tests - Various code cleanup See #31

tidwall · 2025-08-23T10:21:57Z

Ok all pushed and a new release was created. 😅
Thanks a ton!

RangerCD · 2025-08-23T10:51:28Z

Thank you for your great package!

feat: TruncateFront & TruncateBack support truncating all entries

6e623cd

imih mentioned this pull request Apr 25, 2025

feat: recover corrupted tail automatically #22

Open

This was referenced Jun 26, 2025

fix(disk-buffer): Remove empty buffer on close influxdata/telegraf#17247

Merged

fix(disk-buffer): Correctly reset the mask after adding to an empty buffer influxdata/telegraf#17240

Merged

RangerCD added 2 commits August 18, 2025 18:07

feat: add AllowEmpty option

2d36aa8

test: add test cases for AllowEmpty

ebb9798

feat: remove AllowEmpty option

a105ffa

tidwall merged commit 850d828 into tidwall:master Aug 23, 2025
1 check passed

feat: TruncateFront & TruncateBack support truncating all entries #31

feat: TruncateFront & TruncateBack support truncating all entries #31

Uh oh!

Conversation

RangerCD commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this feature is needed

What's the difference between #29 and this PR

Complexity of usage

Corruption-safty

Index continuity

Uh oh!

tidwall commented Mar 16, 2025

Uh oh!

RangerCD commented Mar 16, 2025

Uh oh!

RangerCD commented Mar 18, 2025

Uh oh!

tidwall commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RangerCD commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tidwall commented Aug 16, 2025

Uh oh!

RangerCD commented Aug 19, 2025

Uh oh!

tidwall commented Aug 20, 2025

Uh oh!

tidwall commented Aug 20, 2025

Uh oh!

RangerCD commented Aug 21, 2025

Uh oh!

tidwall commented Aug 21, 2025

Uh oh!

RangerCD commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tidwall commented Aug 21, 2025

Uh oh!

tidwall commented Aug 23, 2025

Uh oh!

Uh oh!

tidwall commented Aug 23, 2025

Uh oh!

RangerCD commented Aug 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RangerCD commented Feb 17, 2025 •

edited

Loading

tidwall commented Aug 16, 2025 •

edited

Loading

RangerCD commented Aug 16, 2025 •

edited

Loading

RangerCD commented Aug 21, 2025 •

edited

Loading