kafka2hadoop

This project is aimed at making kafka->hadoop flow simple.
This is a java project, a hadoop map/reduce job, it loads kafka message to hadoop hdfs.

Features

automatically track topic reading offsets on hdfs, use text, every time only read new message, when nothing new, do nothing.
when something goes wrong, you can edit the offsets information by any editor.
use map/reduce job to load message, the task number can be controled through map/reduce way.
you can filter the message by extending or change job class : Shuffle.java

Content

a kafka offset manage util : InputGenerator
a hadoop map/reduce job related file
a simple bash script we use hourly
third party jars

Example

See run_hourly.sh for details, we only run it hourly through conttab, nothing more is needed.

Build From Source

If you want to build it from source, follow the steps below:

for it is a java project, you need jdk first
for it use hadoop hdfs, it need HADOOP_HOME set
unjar scala-libary.jar kafka_2.8.0-0.8.0-bete1.jar metrics-core.jar here(these 3 jars can be found from kafka project), all of these are needed to run kafka2hadoop map/reduce job
ensure hadoop related jars is in CLASSPATH
just run 'make', then you will get shuffle.jar
put shuffle.jar in your CLASSPATH, then you can use InputGenerator, and shuffle.jar is the usable hadoopjar

TODO

read broker from input
set reduce number from inputGenrator file, or make inputGenerator and mapreduce job single step(through bash script)
when broker down, do nothing, don't change hdfs
reduce read broker and port from map output
remore core-site.xml hdfs-site.xml absolute path from InputGenerator.java
move filter function to Shuffle.java
make it easier to use(run_hourly.sh)
[] give reduce offset details, besides message only
[] remove magic numbers
[] when different broker has different port, InputGenerator won't work
[] remove third party jar content out, read from hdfs is the best solution
[] use direct partitioner
[] move kafka related function to new Class KafkaHelper.java
[] make job name changable, by -D?

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
InputGenerator.java		InputGenerator.java
LICENSE		LICENSE
Makefile		Makefile
OffsetRange.java		OffsetRange.java
OffsetSplitMapper.java		OffsetSplitMapper.java
README.md		README.md
Shuffle.java		Shuffle.java
ShuffleReducer.java		ShuffleReducer.java
run_hourly.sh		run_hourly.sh
shuffle.jar		shuffle.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kafka2hadoop

Features

Content

Example

Build From Source

TODO

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kafka2hadoop

Features

Content

Example

Build From Source

TODO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages