Skip to content

spullara/corpuscompression

Repository files navigation

Copyright 2011 Sam Pullara

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


The GZIP algorithm uses a corpus built up from the file that you are compressing. If you are trying to compress
very small objects, you have the issue that overhead and how little it knows working against you. This
library lets you predefine a corpus for compression and decompression which can give you the benefits
of a custom compression algorithm without the effort.

There are two test examples:

Corpus: this is a test of the emergency broadcast system. this is only a test.
Data: testing... emergency! but I'm broadcasting.
Result: 32 characters compressed vs 43 uncompressed

Corpus: links.txt
Data: http://twitpic.com/7i7o21
Result: 12 characters compressed vs 25 uncompressed vs 45 compressed using GZIP directly

Corpus: links.txt
Data: http://oink.com/i/4eb383adee44744a9c000d0c
Result: 58 characters compressed vs 67 uncompressed vs 86 compressed using GZIP directly

Corpus: tweets.txt
Data: targettweet.txt
Result: 390 characters compressed vs 1990 uncompressed vs 891 compressed using GZIP directly

I think there is a big gap between the ultimate utility and the current random corpus of URLs
that I grabbed. Optimizing and evolving that corpus over time would likely be necessary.

About

Achieve better compression for small objects with a predefined corpus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages