GitHub - spullara/corpuscompression: Achieve better compression for small objects with a predefined corpus

spullara / corpuscompression Public

Notifications You must be signed in to change notification settings
Fork 0
Star 7

Achieve better compression for small objects with a predefined corpus

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
README		README
links.txt		links.txt
links2.txt		links2.txt
links3.txt		links3.txt
links4.txt		links4.txt
pom.xml		pom.xml
targettweet.txt		targettweet.txt
thrifttweet.bin		thrifttweet.bin
thrifttweettarget.bin		thrifttweettarget.bin
tweets.txt		tweets.txt

Repository files navigation

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

The GZIP algorithm uses a corpus built up from the file that you are compressing. If you are trying to compress
very small objects, you have the issue that overhead and how little it knows working against you. This
library lets you predefine a corpus for compression and decompression which can give you the benefits
of a custom compression algorithm without the effort.

There are two test examples:

Corpus: this is a test of the emergency broadcast system. this is only a test.
Data: testing... emergency! but I'm broadcasting.
Result: 32 characters compressed vs 43 uncompressed

Corpus: links.txt
Data: http://twitpic.com/7i7o21
Result: 12 characters compressed vs 25 uncompressed vs 45 compressed using GZIP directly

Corpus: links.txt
Data: http://oink.com/i/4eb383adee44744a9c000d0c
Result: 58 characters compressed vs 67 uncompressed vs 86 compressed using GZIP directly

Corpus: tweets.txt
Data: targettweet.txt
Result: 390 characters compressed vs 1990 uncompressed vs 891 compressed using GZIP directly

I think there is a big gap between the ultimate utility and the current random corpus of URLs
that I grabbed. Optimizing and evolving that corpus over time would likely be necessary.