Skip to content

tmalaska/hadcom.utils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HadCom.utils

Overview

This project creates a runnable jar file that can do some common advance functionality with hadoop.

This first version was built for CDH 4. I will be making it work for CDH 3 shortly.

##Functionality ###Put a collections of layered functionality for advance putting. For details on how to use this functionality click here The user will be able to use all the following:

Layer 1: Reading

CSV files, Delimiter Files, Flat Files, Variable Length Delimiter files, Variable Length Flat Files

Layer 2: Aggregating

Many files into a few

Appending file name to every row of aggregated files

Layer 3: Threading

Run in single or multi thread mode

Each thread writing to a different HDFS file to increase write speed

Layer 4: Listening

Report progress to console

Layer 5: Compresing

Use Snappy, Gzip, or Bzip2

Layer 6: Writing

Sequence, Avro Files, Rc Files, or to HBase

###Route This allows you to make one or more directories pumps files into HDFS as you favorite splittable formates (sequence, avro, or rc) Like the "put" functionality the route logic is also layered.

Layer 1: Route

Event driven

Schedule driven

Layer 2: Put Threads

Define number of put threads in the thread pool

Layer 3: Put

Get all the functionality and options from the above put command

###Get hadoop fs -get is good but. What if you want to get a sequence, avro, or rc file? And what if you want to be able to read the results? Well then you can use these get methods to uncompress sequence, avro or rc files into text to your local drive.

###Out hadoop fs -text only goes so far this takes us to the next step by being able to output rc files and avro files in clear text. Click here for more information.

###Env

Converting a {key}|{field}|{value} env files to an avro file with a generated schema

Converting a multiple row type file to multiple avro files each having a generated schema

###NonSplittableGzip

Converts a non-splittable gzip file stored in hdfs to a sequence file of your choose of compression (snappy, gzip, bzip2)

###NonSplittableZip

Converts a non-splittable zip file stored in hdfs to a sequence file(s) of your choose of compression (snappy, gzip, bzip2). There will be a sequence file for every file in the original zip file.

About

Advanced common functionality for hadoop

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages