GitHub - MoonKang/sparkstream_kafka: Real-time Data Pipeline: Kafka+Spark+Cassandra

This project intends to implement Kafka and Spark streaming. A few objectives:

I became interested in broadening spark experience and related tech stacks around it. I was inspired from https://www.youtube.com/watch?v=y3O94MnO_IU and https://www.youtube.com/watch?v=wQfm4P23Hew

Uber's data platform has transformed from

to

Changes have reduced data latency from 24 hours to < 1hr.

Prior to May 2018, SparkStreaming had more of concept of micro batching. As you create you had to set batching time with desired intervals.

val ssc = new StreamingContext("local[*]", "PrintTweets", Seconds(1))

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
images		images
kafkaproducer_twitter		kafkaproducer_twitter
sparkstreaming		sparkstreaming
.DS_Store		.DS_Store
.cache-main		.cache-main
.classpath		.classpath
.gitignore		.gitignore
README.md		README.md

Provide feedback