100% found this document useful (1 vote)
85 views9 pages

Cassandra: A Distributed Database With No Single Point of Failure

Cassandra is a distributed database with no single point of failure. It favors availability over consistency, allowing queries to specify consistency levels. Cassandra has no master node; every node runs the same software and performs the same functions. It uses a non-relational data model similar to BigTable and HBase, with a limited CQL query language. Cassandra is well-suited for fast access to rows of information and integrating with Spark for analytics on replicated data.

Uploaded by

Bora Yüret
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
85 views9 pages

Cassandra: A Distributed Database With No Single Point of Failure

Cassandra is a distributed database with no single point of failure. It favors availability over consistency, allowing queries to specify consistency levels. Cassandra has no master node; every node runs the same software and performs the same functions. It uses a non-relational data model similar to BigTable and HBase, with a limited CQL query language. Cassandra is well-suited for fast access to rows of information and integrating with Spark for analytics on replicated data.

Uploaded by

Bora Yüret
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

CASSANDRA

A distributed database with no single point of


failure
Cassandra – NoSQL with a twist

■ Unlike HBase, there is no master node at all – every node runs exactly the
same software and performs the same functions
■ Data model is similar to BigTable / Hbase
■ It’s non-relational, but has a limited CQL query language as its interface
Cassandra’s Design Choices

■ The CAP Theorem says you can only have 2 out of 3: consistency, availability,
partition-tolerance
– And partition-tolerance is a requirement with “big data,” so you really
only get to choose between consistency and availability
■ Cassandra favors availability over consistency
– It is “eventually consistent”
– But you can specify your consistency requirements as part of your
requests. So really it’s “tunable consistency”
Where Cassandra Fits in CAP
tradeoffs
Availability

Consistency Partition-Tolerance
Cassandra architecture
Node

Node Node

Node
Node

Node
Cassandra and your cluster

■ Cassandra’s great for fast access to rows of information


■ Get the best of both worlds – replicate Cassandra to a another ring that is used
for analytics and Spark integration

Node Node

Node Node Node Node

Node Node Node Node

Node Node
CQL (Wait, I thought this was
NoSQL!)
■ Cassandra’s API is CQL, which makes it easy to look like existing database
drivers to applications.
■ CQL is like SQL, but with some big limitations!
– NO JOINS
■ Your data must be de-normalized
■ So, it’s still non-relational
– All queries must be on some primary key
■ Secondary indices are supported, but…
■ CQLSH can be used on the command line to create tables, etc.
■ All tables must be in a keyspace – keyspaces are like databases
Cassandra and Spark

■ DataStax offers a Spark-Cassandra connector


■ Allows you to read and write Cassandra tables as DataFrames
■ Is smart about passing queries on those DataFrames down to the appropriate
level
■ Use cases:
– Use Spark for analytics on data stored in Cassandra
– Use Spark to transform data and store it into Cassandra for transactional
use
Let’s Play

■ Install Cassandra on our virtual Hadoop node


■ Set up a table for MovieLens users
■ Write into that table and query it from Spark!

You might also like