Skip to content
This repository was archived by the owner on May 4, 2019. It is now read-only.
This repository was archived by the owner on May 4, 2019. It is now read-only.

Interesting fingerprint method #3

@delagoya

Description

@delagoya

Thought I would put this paper out there, it is very interesting to me, given that we have to hash a huge amount of data.

Fast probabilistic file fingerprinting for big data.
Tretyakov K1, Laur S, Smant G, Vilo J, Prins P.
Author information
Abstract
BACKGROUND:
Biological data acquisition is raising new challenges, both in data analysis and handling. 
Not only is it proving hard to analyze the data at the rate it is generated today, but simply 
reading and transferring data files can be prohibitively slow due to their size. This primarily 
concerns logistics within and between data centers, but is also important for workstation 
users in the analysis phase. Common usage patterns, such as comparing and transferring 
files, are proving computationally expensive and are tying down shared resources.

RESULTS:
We present an efficient method for calculating file uniqueness for large scientific data files, 
that takes less  computational effort than existing techniques. This method, called Probabilistic 
Fast File Fingerprinting (PFFF), exploits the variation present in biological data and computes 
file fingerprints by sampling randomly from the file instead of reading it in full. Consequently, 
it has a flat performance characteristic, correlated with data variation rather than file size. 
We demonstrate that probabilistic fingerprinting can be as reliable as existing hashing 
techniques, with provably negligible risk of collisions. We measure the performance of the 
algorithm on a number of data storage and access technologies, identifying its strengths 
as well as limitations.

CONCLUSIONS:
Probabilistic fingerprinting may significantly reduce the use of computational resources when 
comparing very large files. Utilisation of probabilistic fingerprinting techniques can increase the 
speed of common file-related workflows, both in the data center and for workbench analysis. The 
implementation of the algorithm is available as an open-source tool named pfff, as a 
command-line tool as well as a C library. 

The tool can be downloaded from http://biit.cs.ut.ee/pfff.

PMID: 23445565 [PubMed - indexed for MEDLINE] PMCID: PMC3582436 Free PMC Article

Pubmed: http://www.ncbi.nlm.nih.gov/pubmed/23445565
Journal Link: http://www.biomedcentral.com/1471-2164/14/S2/S8

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions