Skip to content

USCbiostats/tabixr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tabixr

Fast region queries against BGZF-compressed VCF files using the Tabix index. The C++ layer is built with Rcpp and links against Rhtslib. All file handles and index structures are managed with RAII so memory is released correctly on every call path, including repeated queries in a loop.

Requirements

  • R ≥ 4.0
  • Rhtslib (Bioconductor)
  • Rcpp (CRAN)
  • Rtools (Windows) or standard C++ toolchain (Linux/macOS)

Installation

# Install dependencies first if needed
install.packages("BiocManager")
BiocManager::install("Rhtslib")
install.packages("Rcpp")

# Install tabixr from the source tarball
install.packages("tabixr_0.2.0.tar.gz", repos = NULL, type = "source")

Usage

library(tabixr)

vcf <- "path/to/your/file.vcf.gz"  # .tbi index must exist alongside it

# Chromosome names in the index
vcf_seqnames(vcf)

# Sample names from the #CHROM header line
vcf_samples(vcf)

# All header lines (## metadata + #CHROM column header)
vcf_header(vcf)

# Query a genomic region — returns a data.frame
df <- query_vcf(vcf, chrom = "chr21", start = 9411245, end = 9412000)

# Query a specific set of positions — opens the file and index only once
positions <- c(9411245L, 9411354L, 9411690L)
df <- query_vcf_positions(vcf, chrom = "chr21", positions = positions)

Use query_vcf_positions when looking up a pre-defined list of positions (e.g. a GWAS hit list). It is significantly faster than calling query_vcf once per position because the VCF file and Tabix index are opened only once for the entire vector.

Return value

Both functions return a data.frame with column names taken from the VCF #CHROM header line:

Column Type Notes
CHROM character
POS integer
ID character
REF character
ALT character
QUAL character
FILTER character
INFO character Raw KEY=VALUE;... string
FORMAT character e.g. GT:GP:DS
sample columns character One column per sample, named from the header

Both endpoints of a region query are inclusive. Multi-allelic sites contribute one row per ALT allele. An empty result returns a zero-row data.frame with the correct column names and types.

Input file requirements

  • The VCF must be compressed with bgzip (not plain gzip).
  • A Tabix index (.tbi) must exist in the same directory with the same base name, e.g. chr21.vcf.gzchr21.vcf.gz.tbi.
  • To create these from an uncompressed VCF using htslib tools:
    bgzip file.vcf
    tabix -p vcf file.vcf.gz

About

Fast VCF region queries via Tabix — Rcpp/Rhtslib with RAII memory management

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors