Skip to content

gcol33/vectra

Repository files navigation

vectra

querying data that won't fit in memory

CRAN status CRAN downloads Monthly downloads R-CMD-check ASAN/UBSAN Codecov test coverage License: MIT

dplyr-style queries on larger-than-RAM data, backed by a pull-based columnar engine written in C11.

Point vectra at a file too big to load and query it with the verbs you already use. Data flows through the engine one row group at a time, so peak memory stays bounded no matter how large the file gets. Arrow needs compiled binaries that match your platform, DuckDB links a bundled library, Spark wants a JVM. vectra is a standard R extension with no external dependencies: it compiles where R compiles.

library(vectra)

# Lazy scan over a multi-GB CSV; nothing runs until collect()
tbl_csv("measurements.csv") |>
  filter(temperature > 30, year >= 2020) |>
  group_by(station) |>
  summarise(avg_temp = mean(temperature), n = n()) |>
  collect()

One engine, several file formats

.vtr (vectra's own columnar format), CSV, SQLite, and GeoTIFF all open into the same lazy query nodes, so the same pipeline runs against any of them:

# GeoTIFF climate raster as tidy data
tbl_tiff("worldclim_bio1.tif") |>
  filter(band1 > 0) |>
  mutate(temp_c = band1 / 10) |>
  collect()

# SQLite without DBI
tbl_sqlite("survey.db", "responses") |>
  filter(year == 2025) |>
  left_join(tbl_sqlite("survey.db", "sites"), by = "site_id") |>
  collect()

Convert to .vtr once for repeated queries, then read it back fast:

write_vtr(big_df, "data.vtr", batch_size = 100000)

tbl("data.vtr") |>
  filter(x > 0, region == "EU") |>
  group_by(region) |>
  summarise(total = sum(value), n = n()) |>
  collect()

How the engine stays bounded

The optimizer rewrites the plan before any data moves, and execution streams the result rather than materializing it:

  • Streaming execution: one row group flows through at a time, never the whole file
  • Column pruning: columns you never reference are skipped at the scan
  • Predicate pushdown: per-rowgroup min/max statistics skip row groups that cannot match a filter
  • Hash joins: build the right side, stream the left; join a fact table against a lookup without holding both in memory
  • External sort: a memory budget with automatic spill to disk

explain() shows the optimized plan, including which columns and row groups were pruned:

tbl("data.vtr") |>
  filter(x > 0) |>
  select(id, x) |>
  explain()
#> vectra execution plan
#>
#> ProjectNode [streaming]
#>   FilterNode [streaming]
#>     ScanNode [streaming, 2/5 cols (pruned), predicate pushdown, v3 stats]

Fuzzy matching in the engine

String distance runs inside the C engine, with no round-trip to R per row:

tbl("taxa.vtr") |>
  filter(levenshtein(species, "Quercus robur") <= 2) |>
  mutate(similarity = jaro_winkler(species, "Quercus robur")) |>
  arrange(desc(similarity)) |>
  collect()

levenshtein(), dl_dist() (Damerau-Levenshtein), and jaro_winkler() are available in filter() and mutate(), alongside nchar(), substr(), grepl(), gsub() and the rest of the string toolkit.

Star schemas instead of flat-table column creep

Register dimension tables once, then pull columns from any of them; joins are built for you, and unmatched keys are reported:

s <- vtr_schema(
  fact    = tbl("observations.vtr"),
  species = link("sp_id", tbl("species.vtr")),
  site    = link("site_id", tbl("sites.vtr"))
)

lookup(s, count, species$name, site$habitat) |> collect()
#> species: all 500 keys matched
#> site: 3/500 unmatched keys (X1, X2, X3)

Incremental updates

Append new rows as a new row group without rewriting the file, or take a key-based diff between two snapshots:

append_vtr(new_rows_df, "data.vtr")

d <- diff_vtr("snapshot_old.vtr", "snapshot_new.vtr", key_col = "id")
collect(d$added)   # rows present in new but not old
d$deleted          # key values present in old but not new

Streaming results into a model fit

collect() brings the whole result into memory. collect_chunked() folds a function over a query one batch at a time, holding a single batch plus the accumulator, so a result larger than memory reduces to a small summary in one pass: a running count, per-group sufficient statistics, or the cross-products behind a linear fit.

# Accumulate X'X and X'y for an exact OLS fit, one streaming batch at a time
acc <- tbl("survey.vtr") |>
  select(mpg, wt, hp) |>
  collect_chunked(
    function(acc, chunk) {
      X <- cbind(1, chunk$wt, chunk$hp)
      list(XtX = acc$XtX + crossprod(X),
           Xty = acc$Xty + crossprod(X, chunk$mpg))
    },
    .init = list(XtX = matrix(0, 3, 3), Xty = matrix(0, 3, 1))
  )
solve(acc$XtX, acc$Xty)

For models that re-read the data on every iteration, chunk_feeder() exposes a query as a resettable generator that biglm::bigglm() drives directly, fitting a GLM on data too large to hold in memory:

src <- function() tbl("occurrences.vtr") |> select(presence, bio1, bio12)
biglm::bigglm(presence ~ bio1 + bio12, data = chunk_feeder(src),
              family = binomial())

offload() spills a prepared query to disk once and streams it back, so each reweighted pass reads the prepared columns from a file rather than rebuilding the pipeline. With a by key it splits the query into per-key shards on disk, each small enough to pull into memory on its own. group_map() then runs your function on each shard, passing the shard as an ordinary data.frame along with its group key, and returns the results keyed by shard. That function is arbitrary R: a model fit, a clustering call, a custom summary, or any computation that needs the whole group in memory at once. group_modify() recombines per-shard data.frames into one table.

# Prepare once, then let bigglm re-read the spill on every pass
s <- offload(tbl("occurrences.vtr") |> select(presence, bio1, bio12))
biglm::bigglm(presence ~ bio1 + bio12, data = chunk_feeder(s),
              family = binomial())

# Per-region: group_map runs any function on each in-memory shard (here a GLM)
p <- offload(tbl("occurrences.vtr"), by = "region")
fits <- group_map(p, function(d, region)
  glm(presence ~ bio1 + bio12, data = d, family = binomial()))

Verbs

vectra covers the dplyr surface most analysis pipelines use: filter(), select(), mutate(), group_by() / summarise() with the common aggregations, all the joins (left_join() through cross_join(), plus fuzzy_join()), arrange() and the slice_*() family, and window functions (row_number(), rank(), lag(), lead(), cumsum(), ntile(), and more). Date/time and string functions evaluate in the engine. select(), rename(), relocate(), and across() accept the full tidyselect helper set (starts_with(), where(), all_of(), and the rest).

The Function Reference lists every verb and expression with examples.

Installation

install.packages("vectra")            # CRAN

install.packages("pak")               # development version
pak::pak("gcol33/vectra")

Documentation

Support

"Software is like sex: it's better when it's free." — Linus Torvalds

I'm a PhD student who builds R packages in my free time because I believe good tools should be free and open. I started these projects for my own work and figured others might find them useful too.

If this package saved you some time, buying me a coffee is a nice way to say thanks. It helps with my coffee addiction.

Buy Me A Coffee

License

MIT (see the LICENSE.md file)

Citation

@software{vectra,
  author = {Colling, Gilles},
  title  = {vectra: Columnar Query Engine for Larger-Than-RAM Data},
  year   = {2026},
  url    = {https://github.com/gcol33/vectra}
}

About

Columnar query engine for larger-than-RAM data in R

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages