Skip to content

will-rowe/hulk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hulk-logo

Histosketching Using Little Kmers


travis Documentation Status reportcard License DOI

HULK is still under development - features and improvements are being added, so please check back soon.

Overview

HULK is a tool that creates small, fixed-size sketches from streaming microbiome sequencing data, enabling rapid metagenomic dissimilarity analysis. HULK generates a k-mer spectrum from a FASTQ data stream, incrementally sketches it and makes similarity search queries against other microbiome sketches.

It works by using count-min sketching to create a k-mer spectrum from a data stream. After some reads have been added to a k-mer spectrum, HULK begins to process the counter frequencies and populates a histosketch. Similarly to MinHash sketches, histosketches can be used to estimate similarity between microbiome samples.

The advantages of HULK include:

  • it's fast and can run on a laptop in minutes
  • hulk sketches are compact and a fixed size
  • it works on data streams and does not require complete data instances
  • it can use concept drift for histosketching
  • you get to type hulk smash into the command line...

Finally, you can use hulk sketches to with a Machine Learning classifier to bin microbiome samples (see BANNER). More info on this coming soon...

Installation

Check out the releases to download a binary. Alternatively, install using Bioconda or compile the software from source.

Source

HULK is written in Go (v1.9) - to compile from source you will first need the Go tool chain. Once you have it, try something like this to compile:

# Clone this repository
git clone https://github.com/will-rowe/hulk.git

# Go into the repository and get the package dependencies
cd hulk
go get -d -t -v ./...

# Run the unit tests
go test -v ./...

# Compile the program
go build ./

# Call the program
./hulk --help

Quick Start

HULK is called by typing hulk, followed by the subcommand you wish to run. There are three main subcommands: sketch, distance and smash. This quick start will show you how to get things running but it is recommended to follow the documentation.

# Create a hulk sketch
gunzip -c microbiome.fq.gz | hulk sketch -p 8 -s 256 -o microbiome.sketch

# Get similarity measures between two hulk sketches
hulk distance -1 a.sketch -2 b.sketch

#  Get a pairwise Jaccard Similarity matrix for a set of hulk sketches
hulk smash -d ./dir-with-sketches-in -o smash-matrix.csv

Further Information & Citing

Please readthedocs for more extensive documentation and a tutorial will be forthcoming.

A preprint describing HULK is in preparation and I'll post a link soon...

About

Histosketching Using Little Kmers

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •