Quilt is a data package manager, inspired by the likes of pip
and npm
. Just as software package managers provide versioned, reusable building blocks for execution, Quilt provides versioned, reusable building blocks for analysis.
- Reproducibility - Imagine source code without versions. Ouch. Why live with un-versioned data? Versioned data makes analysis reproducible by creating unambiguous references to potentially complex data dependencies.
- Less data cleaning - Finding, cleaning, and organizing data consumes 79% of the average data scientist's time. If data is cleaned once and packaged for posterity, it frees up time for analysis. Quilt further makes it possible to import data and start working immediately. Users can skip data preparation scripts for downloading, cleaning, and parsing data.
- De-duplication - Data fragments are hashed with
SHA256
. Duplicate data fragments are written to disk once globally per user. As a result, large, repeated data fragments consume less disk and network bandwidth. - Faster analysis -** **Serialized data loads 5 to 20 times faster than files. Moreover, specialized storage formats like Apache Parquet minimize I/O bottlenecks so that tools like Presto DB and Hive run faster.
- **Collaboration and transparency **- Data likes to be shared. Quilt offers a centralized data warehouse for finding and sharing data sets.