Introduction to
Apache
1
Me
Robert Stupp
Freelancer, Coder, Architect
@snazy snazy@snazy.de
Contributor to Apache Cassandra,
3.0 UDFs (CASSANDRA-7395 + related)
Databases, Network, Backend
2
Agenda
Apache Cassandra History
Design Principles
Outstanding differences
CQL Intro
Access C*
Clusters
Cassandra Future
3
Apache Cassandra
History
4
Apache Cassandra
started at Facebook
inspired by
Note: Facebook initially had
two data centers.
5
2.1 released in Sep 2014
6
Apache Cassandra
Design Principles
7
Hardware failures
can and will occur!
Cassandra handles failures.
From single node to whole data center.
From client to server.
8
The complicated part
when learning Cassandra,
is to understand
Cassandra’s simplicity
9
Keep it simple
all nodes are equal
master-less architecture
no name nodes
no SPOF (single point of failure)
no read before modify
(prevent race conditions)
10
Keep it running
No need to take cluster down … e.g.
during maintenance
during software update
Rolling restart is your friend
11
Outstanding
Differences
12
Cassandra
Highly scalable
runs with a few nodes
up to 1000+ nodes cluster!
Linear scalability (proven!)
Multi datacenter aware (world-wide!)
No SPOF
13
Cassandra @ Apple
14
Linear Scalability
15
Scaling Cassandra
More data?
-> add more nodes
Faster access?
-> add more nodes
16
Read / Write
performance
Reads are fast
Writes are even faster
17
Durability
Writes are durable - period.
18
Availability @
Netflix
Chaos
Monkey
kills nodes randomly
19
Availability @
Netflix
Chaos
Gorilla
kill regions randomly
20
Availability @
Netflix
Chaos
Kong
kills whole data centers
21
Availability @
Netflix
http://de.slideshare.net/planetcassandra/
active-active-c-behind-the-scenes-at-
netflix
22
32 node cluster (Rasperry PIs)
@DataStax
23
Most outstanding
Great documentation
Many blog posts
Many presentations
Many videos
Regular webinars
Huge, active and healthy community
24
Data Distribution
25
DHT
Data is organized in a
„Distributed Hash Table“
(hash over row key)
26
DHT
7 1
6 2
5 3
27
Replication
28
Replication Factor 2
Row A
0
7 1
6 2
Row B
5 3
29
Replication Factor 3
Row A
0
7 1
6 2
Row B
5 3
30
Consistency
Consistency defined per request
Several consistency levels (CLs)
for different needs
31
Eventual consistency
is not
hopefully consistent
EC means there’s a time gap until updates
are consistently readable
32
Consistency Levels
ANY (only for writes)
ONE, LOCAL_ONE,
TWO, THREE, (not recommended)
ALL, (not recommended)
QUORUM, LOCAL_QUORUM, EACH_QUORUM
SERIAL, LOCAL_SERIAL
33
Consistency
Data is always replicated
CL defines how many replicas must
fulfill the request
34
Write
Write
0
7 1
6 2
5 3
35
Write
Write
0
7 1
6 2
5 3
36
Mutli DC setup
DC 1 DC 2
37
Multi DC replication
Write
DC 1 DC 2
38
Mutli DC replication
Write
DC 1 DC 2
39
Mutli DC replication
Write
DC 1 DC 2
40
Replication &
Consistency
Define # of replicas
using replication factor
Define required consistency
per request
41
CQL Introduction
CQL = Cassandra query language
42
“CQL is SQL
minus joins,
minus subqueries,
plus collections”
(plus user types,
plus tuple types)
43
Why CQL?
Introduces a schema to Cassandra
Familiar syntax
Easy to understand
DML operations are atomic
44
Data model
(hierarchical view)
Keyspace (schema)
Table (column family)
Row
partition key (part of primary key)
static columns
clustering key (part of primary key)
columns
45
CQL / DDL
Similar to SQL
CREATE TABLE …
ALTER TABLE …
DROP TABLE …
46
CQL / DML
Similar to SQL
INSERT …
UPDATE …
DELETE …
SELECT …
47
CQL / BATCH
Group related modifications
(INSERT, UPDATE, DELETE)
Atomic operation
48
CQL types
boolean, int (32bit), bigint (64bit),
float, double,
decimal ("BigDecimal"),
varint ("BigInteger"),
ascii, text (= varchar), blob,
inet, timestamp, uuid, timeuuid
49
CQL collection
types
list < foo >
set < foo >
map < foo , bar >
Since C* 2.1 collections can contain
any type - even other collections.
50
CQL composite
types
user types (C* 2.1)
are composite types with named fields
tuple types (C* 2.1)
are unstructured lists of values
51
CQL / user types
CREATE TYPE address (
street text,
zip int,
city text);
CREATE TABLE users (
username text,
addresses map<text, address>,
...
52
Cassandra
Data Modeling
Access by key
no access by arbitrary WHERE clause
Duplicate data (it’s ok!)
Aggregate data
Build application maintained indexes
53
RDBMS modeling
54
C* modeling
55
Data Modeling
with RDBMS
Driven by
"How can I store
something right?"
"What answers
do I have?"
56
Data Modeling
with NoSQL
Driven by
"How can I access
something right?"
"What questions
do I have?"
57
Data Modeling
Basics
Work top-down. Think about:
What does the application do?
What are the access patterns?
Now design data model
58
Data Modeling
http://de.slideshare.net/planetcassandra/
cassandra-day-sv-2014-fundamentals-
of-apache-cassandra-data-modeling
http://de.slideshare.net/planetcassandra/
data-modeling-with-travis-price
59
Accessing
Cassandra
60
Command Line
cqlsh
CQL shell
nodetool
node/cluster administration
61
GUI: DevCenter
Visual query tool
62
Stress test?
Cassandra 2.1 comes with improved
stress tool
Simulate read+write workload
Uses configurable data
Works against older C* versions, too
63
DataStax APLv2
Open Source Drivers
for Java
for Python
for C#
for Scala / Spark
https://github.com/datastax/
or http://www.datastax.com/download
64
Native protocol
C*’s own net protocol for clients
Request multiplexing
Schema change notifications
Cluster change notifications
65
Third Party Drivers
for huge number of languages
66
Mappers
High level mappers exist at least for
Java
Special case: Scala
due to its strong+complex type
model (DataStax OSS Spark driver)
67
Spark + Hadoop
Yes - works really good
Note: Spark is about 100x faster
68
Clusters
69
Cluster sizes
C* works with a few nodes
C* works with several hundred /
thousand nodes
70
Cluster setup
Configure for multiple data centers
Plan for multi-DC setup :)
71
Cluster experience
Remember: A single Cassandra
clusters works over multiple data
centers all over the world
„Desaster proven“
Hurricanes
Amazon DC outages
72
Apache Cassandra
Future
73
Cassandra 3.0
(in development)
User Defined Functions
Subject
Aggregate functions to
change!!!
Functional indexes
Workload recording + playback
Better SSTables, Fully off-heap row cache, Better
serial consistency
Indexes w/ high cardinality
74
Get active !
75
Cassandra Community
http://cassandra.apache.org/
http://planetcassandra.org/ - Blog
http://www.slideshare.net/
planetcassandra/presentations
http://de.slideshare.net/DataStax/
presentations
76
Cassandra Community
https://www.youtube.com/user/
PlanetCassandra
https://www.youtube.com/user/DataStax
http://www.datastax.com/dev/blog/
http://www.datastax.com/docs/
Users Mailing List
users@cassandra.apache.org
77
Free C* Training!
http://planetcassandra.org/cassandra-
training/
78
Get involved!
Ask questions,
submit RFEs or experiences to
user mailing list
user@cassandra.apache.org
Answers arrive quickly!
79
Live Demo
User Defined Functions
80
C* 3.0 UDFs
Users create functions using
CREATE FUNCTION …
LANGUAGE …
AS …
Java, JavaScript, Scala, Groovy,
JRuby, Jython
Functions work on all nodes
81
C* 3.0 UDFs
Example
CREATE FUNCTION sin(input double)
RETURNS double
LANGUAGE javascript
AS 'Math.sin(input)';
This is JavaScript!
82
UDFs for what?
Targeted for C* 3.0
Own aggregation code - e.g.
SELECT sum(value) FROM table
WHERE …;
Functional indexes - e.g.
CREATE INDEX idx
ON table ( myFunction(colname) );
83
Thanks
for your attention
Download Apache Cassandra at
http://cassandra.apache.org/
Robert Stupp
@snazy
snazy@snazy.de
de.slideshare.net/RobertStupp
84
Q & A
85
86
BACKUP SLIDES
User-Defined-Functions
Demo
87
88
89
90
91
92
93
94
95
96
97
98
99