High-performance, allocation-free text scanning and Arrow ingestion engine for Go.
Carve is a streaming parsing system built around a zero-allocation scanner and a columnar writer that emits Apache Arrow RecordBatches. It replaces regex-heavy parsing pipelines with deterministic byte-level scanning.
Carve is not a regex library.
It is a streaming ingestion engine:
text β Scanner VM β column buffers β Arrow RecordBatch
Designed for:
- Log ingestion
- High-throughput ETL pipelines
- Columnar transformation at ingest time
- ~7β14ns per line parsing
- 0 allocations in hot path
- deterministic byte-level scanning
- Precomputed scan plan
- delimiter-driven field extraction
- branch-light execution path
- Direct
RecordBatchemission - column builders per field
- batch-oriented memory control
- Legacy regex parsing still supported
- schema extraction via named capture groups
- gradual migration path
| Metric | Regex | Scanner |
|---|---|---|
| Parse speed | ~2.5M lines/sec | ~140M lines/sec |
| Memory | High | Zero in hot path |
| Allocations | Thousands | Zero |
| Pipeline | Throughput |
|---|---|
| Regex + Arrow | ~2K lines/sec |
| Scanner + Arrow | ~20K+ lines/sec |
A deterministic byte-state machine:
Scan(line []byte, out [][]byte) bool- Uses precomputed delimiter plan
- slices input without allocations (optional zero-copy mode)
- writes directly into scratch buffers
Columnar batch builder:
WriteLinesSIMD(lines [][]byte, scanner *Scanner)- accumulates column data
- emits Arrow RecordBatch on flush
- controls batch size and memory lifecycle
Derived from schema intent:
- maps fields β delimiter positions
- drives scan VM execution
- avoids runtime regex matching
scanner, err := carve.New(`(?P<ts>[^ ]+) (?P<level>[^ ]+) (?P<msg>.*)`)
if err != nil {
panic(err)
}writer := carve.NewWriter(schema, nil, 8192)
batch, err := writer.WriteLinesSIMD(lines, scanner)
if err != nil {
panic(err)
}
if batch != nil {
defer batch.Release()
}Carve is built on three principles:
Regex is expressive but unpredictable. Carve is strict and fast.
Zero allocations in hot path ensures stable runtime behavior under load.
Data is shaped into Arrow as early as possible.
- Requires delimiter-friendly or structured patterns
- ScanPlan derived from regex AST is heuristic-based
- Zero-copy mode requires careful lifetime management
- Arrow builder remains allocation boundary
Use Carve when:
- ingesting large logs or event streams
- regex performance becomes a bottleneck
- Arrow-native pipelines are required
- predictable latency matters
Avoid when:
- patterns are highly irregular or deeply nested
- one-off parsing tasks
Carve v0.4.0 is a performance-stable ingestion engine core.
The scanner layer is effectively production-grade. Future work focuses on:
- scan plan formalization
- SIMD multi-field extraction
- Arrow writer optimization
- streaming batch pipelines
MIT
Built for speed. Designed for structure. Optimized for streams.