Valentina.Crisan@academyofdata.
com
Throughout the course we will work with 2 sets of data:
Movielens data set (real data): 6k users, 4k movies, 1 million ratings - we copy only 10k
entries
Weather data (random generated) 500k entries
Movielens data details:
Movies.csv extract
mid title genres year
1 Toy Story (1995) {Animation,Children's,Comedy} 1995
2 Jumanji (1995) {Adventure,Children's,Fantasy} 1995
3 Grumpier Old Men (1995) {Comedy,Romance} 1995
4 Waiting to Exhale (1995) {Comedy,Drama} 1995
5 Father of the Bride Part II (1995) {Comedy} 1995
Users.csv extract
uid gender age ocupation zip
1 F 1 10 48067
2 M 56 16 70072
3 M 25 15 55117
4 M 45 7 0246
Ratings2.csv
ruid rmid rating timestamp
1 1193 5 978300760
1 661 3 978302109
1 914 3 978301968
1 3408 4 978300275
1 2355 5 978824291
Denormalized data - Ratings.csv data
uid age gender ocupation zip rating datetime mid title year genres
5234 25 F 0 60657 1 2000-06-19 20:40:40 31 Dangerous Minds (1995) 1995 {Drama}
Valentina.Crisan@academyofdata.com
1835 25 M 19 11501 3 2000-11-22 08:11:31 31 Dangerous Minds (1995) 1995 {Drama}
6035 25 F 1 78734 2 2000-04-26 01:25:35 31 Dangerous Minds (1995) 1995 {Drama}
4637 35 F 9 48302 3 2000-07-19 18:48:41 31 Dangerous Minds (1995) 1995 {Drama}
3841 45 M 18 26101 3 2000-08-11 13:14:31 31 Dangerous Minds (1995) 1995 {Drama}
2242 18 M 4 60115 5 2000-11-19 02:40:43 31 Dangerous Minds (1995) 1995 {Drama}
weather-data.csv
w_id | event_time | temperature
------------+---------------------------------+-------------
PNOSCPQCNX | 2015-01-01 00:02:54.000000+0000 | 20
PNOSCPQCNX | 2015-01-01 00:10:00.000000+0000 | 23
PNOSCPQCNX | 2015-01-01 00:29:53.000000+0000 | 20
PNOSCPQCNX | 2015-01-01 01:04:22.000000+0000 | 12
PNOSCPQCNX | 2015-01-01 01:17:15.000000+0000 | 18
PNOSCPQCNX | 2015-01-01 02:52:25.000000+0000 | 23
PNOSCPQCNX | 2015-01-01 03:09:17.000000+0000 | 11
PNOSCPQCNX | 2015-01-01 03:14:57.000000+0000 | 13
PNOSCPQCNX | 2015-01-01 04:13:33.000000+0000 | 14
PNOSCPQCNX | 2015-01-01 04:47:56.000000+0000 | 13
All the above files can be found in the data directory
(both cloud and local installations).
cd /data
Ls -l
Note: for the intro part we play a bit with different
data types and we insert data manually, before we
work with the above data.
Valentina.Crisan@academyofdata.com
CQL exercises
Reference for syntax questions:
http://docs.datastax.com/en/cql/3.3/cql/cql_using/useAboutCQL.html
Run Cqlsh
1. Create a Keyspace - for local installations use
metro_systems, but for cloud installations use your name
(e.g. my keyspace is valentina_c) -- since all cloud users need
to create a separate working space. You will use that name
instead of metro_systems in the commands below.
CREATE KEYSPACE metro_systems WITH REPLICATION = { 'class' :
'SimpleStrategy' , 'replication_factor' : 1 };
Desc keyspaces // will list all keyspaces on the cluster
describe metro_systems; // you can see the Replication factor & strategy
set at keyspace level
Exit cqlsh and run nodetool status command - this should give you an image of your cluster
(even if it’s a 1 node cluster). Note the datacenter name.
Re-enter Cqlsh
Let’s create another keyspaceNetwork Topology replication strategy and Replication factor
1.
CREATE KEYSPACE metro_systems1 WITH REPLICATION = { 'class' :
'NetworkTopologyStrategy', 'datacenter1' : 1 };
describe metro_systems1;
Let’s alter a created keyspace and change the property durable writes:
ALTER KEYSPACE metro_systems1 WITH REPLICATION = { 'class' :
'NetworkTopologyStrategy' , 'datacenter1' : 1} and DURABLE_WRITES =
false;
OR if we want to modify the replication factor from 1 to 3:
ALTER KEYSPACE metro_systems1 WITH REPLICATION = { 'class' :
'NetworkTopologyStrategy' , 'datacenter1' : 3};
Valentina.Crisan@academyofdata.com
ALTER KEYSPACE valentina_c_1 W
ITH REPLICATION = { 'class' :
'NetworkTopologyStrategy' , ' datacenter1' : 3};
Desc metro_systems1
Let’s drop keyspace metro_systems1:
DROP KEYSPACE metro_systems1;
Stop: let’s explain what means durable writes option in the keyspace description?
2. Let’s create and modify a column family
First, connect to the created keyspace: i.e. all further CQL commands will be executed in the
context of the chosen keyspace.
USE metro_systems;
2.1. Create Table users with the next structure: userid (uuid), gender (text), age (int),
occupation (text), zipcode (int) and compound primary key: partition key userid and cluster
key gender.
CREATE TABLE users (
userid uuid,
name text,
gender text,
age int,
occupation text,
zipcode int,
PRIMARY KEY ((userid),name));
DESCRIBE users; // you can see all properties of the table
2.2. ALTER: Add/Delete/Rename columns, Alter Column type;
ALTER TABLE users ADD relatives int; // add a new column to the table
SELECT * FROM users;
ALTER TABLE users RENAME userid TO uid; //rename column
Valentina.Crisan@academyofdata.com
ALTER TABLE users ALTER relatives type blob; // alter column type from int to blob
Describe users; // to see the types of the columns
Please note: Alter columns types, some changes will not be possible, like for example:
varchar to ascii.
ALTER TABLE users DROP relatives; //delete the new added column to the table
SELECT * FROM users;
Describe users;
DROP TABLE users; // deletes all the data of the table including the table structure &
indexes
Desc tables; // no tables
2.3. Insert values in the created table
Assuming you dropped the table users previously, let’s re-create it.
CREATE TABLE users (
uid uuid,
name text,
gender text,
age int,
occupation text,
zipcode int,
PRIMARY KEY ((uid),name));
We have a table, now let’s insert data manually. We will learn later also how to insert data
from a file, but for the moment let’s do some manual inserts/updates.
Observation: the UUID can be generated automatically with uuid ();
INSERT INTO users (uid, name, age, gender, occupation, zipcode) VALUES
(5132b130-ae79-11e4-ab27-0800200c9a66, 'Andrew', 25, 'M', 'plummer',
11501);
INSERT INTO users (uid, name, a
ge, gender, occupation, zipcode) VALUES
(UUID(),'Dana', 24, 'F','data s cientist', 11501);
Valentina.Crisan@academyofdata.com
SELECT * FROM users;
TRACING ON; // allows to see the events in background for Cassandra.
INSERT INTO users (uid, name, age, gender, occupation, zipcode) VALUES
(5132b130-ae79-11e4-ab27-0800200c9a66, 'Lili', 24, 'F', 'actress', 11503);
// note that I am inserting data in the same partition as the first
insert. Given that the Clustering Key is different the data will be added
to the existing partition.
TRACING OFF
SELECT * FROM users; // remember the way Cassandra stores data is
different than the CQL view.
Remember for the next section: Cassandra doesn’t read (by default) before
performing writes.
2.4. Let’s do also some upserts
UPDATE u sers SET a
ge = 3 5 WHERE uid = 5
132b130-ae79-11e4-ab27-0800200c9a66 and
name = ' Andrew'; / / we u pdate the age o f an existing user
Select * from users; // to see the change
UPDATE users SET age = 35 WHERE uid = 5132b130-ae79-11e4-ab27-0800200c9a77 and
name = 'Andrew'; // the UUID - partition key - is not existent thus the update
will do an insert
SELECT * from users; // let’s check that everything went well
UPDATE u sers SET a ge = 44 WHERE uid = 5132b130-ae79-11e4-ab27-0800200c9a66 and
name = ' Elaine'; / / the partition key existing but not with this clustering
key, a n ew entry i nside an existing partition will be generated
Select * from users;
Select * from users where uid = 5132b130-ae79-11e4-ab27-0800200c9a66; // to
see all entries in this partition
Lets do some more inserts. The first row inserted will have a new partition and cluster key. The
second row will have the same partition key but a different clustering key.
insert into users (uid, name, age, gender, occupation, zipcode) values
(uuid(),'John',35,'M','data analyst',11502); // see the uuid generated here
and use it for the next insert
insert into users (uid, name, age, gender, occupation, zipcode) values
(90278d0f-6053-4c87-a1f3-69a05853d2a8,'Elaine',34,'F','engineer',11502);
Valentina.Crisan@academyofdata.com
Select * from users;
Insert into users (uid,name, age) values(
90278d0f-6053-4c87-a1f3-69a05853d2a8, 'Elaine',43); // see how insert can
be used as an update as well
Before we move on, type exit in cqlsh and lets see some stats for our table.
Nodetool cfstats keyspace.table //generic command, see below for metro_systems
keyspace
Nodetool cfstats metro_systems.users // for cloud users use yourname.users
See that data is still in memtable, no SStables generated, see also the estimated number
of partitions (Number of keys (estimate))
2.5. Lightweight operations
Re-enter cqlsh and your respective keyspace
Use metro_systems;
Let’s try a an update on another non existing entry but with LWT:
UPDATE users SET age = 3
5 WHERE uid = 5132b130-ae79-11e4-ab27-0800200c9a88
and name = 'Andrew' IF E XISTS;
Now let’s try an update on an existing entry with LWT:
UPDATE users SET age = 4
4 WHERE uid = 5132b130-ae79-11e4-ab27-0800200c9a77
and name = 'Andrew' IF E XISTS;
Exit CQL
Let’s force a flush to write the data on disk in an SStable. Run
Nodetool flush metro_systems users // For cloud users please make sure you use
your own keyspace
Go to cd /var/lib/data/cassandra/ . Choose your keyspace directory and the directory
of the table. Use ls -l to list the tables generated after flush.
IF you would like to see how an SStable looks like you can use : sstabledump
nameofthedatafile // e.g. sstabledump mc-1-big-Data.db
Sstabledump -e nameofthedatafile // will show us the partitions in the specific SStable
Valentina.Crisan@academyofdata.com
STOP here for theory update. Queries section.
3. Let’s do some queries on the data - we will come back to
Queries, this is just to get us going with the syntax
Cd /
Re-enter CQL with Cqlsh command and use your keyspace;
SELECT * FROM users where uid = 5132b130-ae79-11e4-ab27-0800200c9a66;
// You can also pick the columns to display instead of choosing all data.
SELECT age,gender FROM users where uid =
5132b130-ae79-11e4-ab27-0800200c9a66;
SELECT * FROM users LIMIT 3; // you can limit the number of returned
values
//You can fine-tune the display order using the ORDER BY clause, either descending
or ascending. The partition key must be defined in the WHERE clause and the ORDER
BY clause defines the clustering column to use for ordering. Order By works only on
clustering keys.
SELECT * FROM users where uid = 5132b130-ae79-11e4-ab27-0800200c9a66 ORDER
BY name DESC;
// You can query for a partition key that is part of a list of partition keys
SELECT * FROM users where uid IN (5132b130-ae79-11e4-ab27-0800200c9a66,
edb032e8-5c4e-403f-8dfc-43c2dcd69dba);
You can return the number of records that fulfills a request:
SELECT COUNT(*) FROM users where uid =
5132b130-ae79-11e4-ab27-0800200c9a66;
You can ask for the distinct partitions in a table (DISTINCT works only on partition
keys):
select distinct uid from users;
And also count how many distinct partitions exist:
Valentina.Crisan@academyofdata.com
select distinct uid, count(*) from users;
3.1. CREATE AN INDEX
SELECT * FROM users WHERE gender = 'M'; // WILL NOT WORK
//let’s create an index
NDEX users_by_gender ON users(gender);
CREATE I
SELECT * FROM users WHERE gender = 'F';
//lets count how many records are in the query outcome
SELECT count(*) FROM users WHERE gender = 'F';
Tracing on;
select * from users where gender = 'F'; // to see all the operations C* is doing in the
background (even for a 1 node cluster you can see loads of operations).
CAN YOU SEE THAT THERE ARE 2 TABLES queried ON THE NODE IN ORDER TO GET
THE DATA: on the index table and on the users table
Executing single-partition query on users.users_by_gender
Executing single-partition query on users -- appears 3 times. Can you tell me why?
We will come back on this point in section Indexing in C*.
4.Collections
SETS
ALTER TABLE users ADD emails set <text>;
Let’s insert some data in the set:
insert into users (uid,name,age,emails,gender,occupation,zipcode)
values(uuid(),'Lilian',
18,{'lilian@gmail.com','lilian_briggs@oxford.com'},'F', 'student', 11503);
Select * from users;
Let’s add some data to existing entries:
UPDATE users SET emails = emails + {'elaine@yahoo.com'} WHERE uid =
89d3a708-0571-440a-8a9b-997856c17789 AND name = 'Elaine';
Valentina.Crisan@academyofdata.com
Let’s delete some data from existing entries:
UPDATE users SET emails = emails - {'elaine@yahoo.com'} WHERE uid =
89d3a708-0571-440a-8a9b-997856c17789 AND name = 'Elaine';
SELECT * FROM USERS;
create index on users(emails);
select * from users where emails CONTAINs 'lilian@gmail.com';
LISTS
LET’S ADD A LIST COLUMN TO TABLE USERS:
ALTER TABLE users ADD top_movies list<text>;
LET’S ADD SOME DATA TO EXISTING ENTRIES:
update users set top_movies = ['titanic','gatsby'] where uid =
89d3a708-0571-440a-8a9b-997856c17789 and name = 'Elaine';
Select * from users;
update users set top_movies = ['romeo and juliet']+ top_movies where uid =
89d3a708-0571-440a-8a9b-997856c17789 and name = 'Elaine';
Select * from users;
update users set top_movies = top_movies +
['the renevant'] where uid =
89d3a708-0571-440a-8a9b-997856c17789 and n ame = 'Elaine';
Let’s replace position 1 Titanic with Blood Diamonds:
update users set top_movies[1]='blood diamonds' w
here uid =
89d3a708-0571-440a-8a9b-997856c17789 and name = ' Elaine';
Select * from users;
Let’s delete entry 1 from the list. //blood diamonds
delete top_movies[1] from users where uid =
89d3a708-0571-440a-8a9b-997856c17789 and name = 'Elaine';
Valentina.Crisan@academyofdata.com
select name , top_movies from users where uid =
89d3a708-0571-440a-8a9b-997856c17789 and name = 'Elaine'; // we select only the
collection
You can index a collection, let’s query on a movie name:
create index top_movies on users(top_movies);
select * from users where top_movies contains 'gatsby';
MAPS
LET’S ADD A MAP COLUMN TO TABLE USERS:
ALTER TABLE users ADD movies map<timestamp, text>;
LET’S INSERT DATA:
update users set movies = {'2016-06-05':'rated
movie','2016-06-06':'screened trailer'} where uid =
89d3a708-0571-440a-8a9b-997856c17789 and name = 'Elaine';
Select movies from users;
You can use a collection as a primary key.
STOP here for theory update.
5. Copy data from/to a CSV file
Let’s start working on some real data.
5.1. Copy from a CSV file
Drop table users;
Let’s re-create table users with below syntax and import data from a CSV file.
CREATE TABLE users(
uid int,
age int,
gender text,
occupation int,
zipcode text,
PRIMARY KEY (uid));
Valentina.Crisan@academyofdata.com
WE HAVE IN THE DATA DIRECTORY (CLOUD OR LOCAL INSTALLATIONS) THE CSV
FILES THAT WE WILL BE WORKING WITH. WE NEED Users.csv
copy users(uid,gender,age,occupation,zipcode) from '/data/users.csv' with
header = true;
select * from users limit 10;
5.2. Copy to a CSV file
COPY users(uid,gender,age) TO '/data/yourname.csv' WITH HEADER=TRUE;
6.1 Collections as Primary key
Did you know we can have a collection as part Primary Key? So if we want to
know movies by genres - we could define Genre as part of the Primary key.
CREATE TABLE movies_by_genres (
movieid int,
title text,
genres frozen <set<text>>,
Year int,
PRIMARY KEY (genres, movieid));
copy movies_by_genres(movieid,title,genres,year) from '/data/movies.csv' with header =
true;
Select * from movies_by_genres limit 10;
6.2. Lets also do an exercise on User Defined Types
CREATE TYPE address (
Street text,
Number int,
Flat text);
CREATE TABLE users_udt (
Uid int,
Address frozen<address>,
Primary key (uid));
Valentina.Crisan@academyofdata.com
// When using the frozen keyword, you cannot update parts of a
user-defined type value. The entire value must be overwritten. Cassandra
treats the value of a frozen, user-defined type like a blob.
INSERT INTO users_udt(uid,address) VALUES (1, {street : 'Pipera', number :
34, flat : '34C'});
Select * from users_udt;
//Let’s see if we can select only parts of the UDT
select address.street from users_udt where uid = 1;
// Let’s see if we can update parts of the UDT
update users_udt_1 set address.street = 'Vitan' where uid = 1; // will not
work
Important:From 3.4.2: User-defined types may now be stored in a non-frozen
form, allowing individual fields to be updated and deleted in UPDATE
statements and DELETE statements, respectively. (CASSANDRA-7423). Let’s
test.
CREATE TABLE users_udt_1 (
Uid int,
Address address,
Primary key (uid));
INSERT INTO users_udt_1(uid,address) VALUES (1, {street : 'Pipera', number
: 34, flat : '34C'});
Select * from users_udt_1;
update users_udt_1 set address.street = 'Vitan' where uid = 1;
Select * from users_udt_1;
Valentina.Crisan@academyofdata.com
7. Let’s do some data deletes:
SELECT * from users limit 10;
Delete value of a cell:
select * from users where uid = 4317;
delete age from users where uid = 4317;
select * from users where uid = 4317;
Delete a partition:
delete from users where uid = 4317;
Before we move on
//see table statistics
nodetool cfstats metro_systems.users
// see number of keys = number of partitions.
fhistograms metro_systems users
nodetool c
or
nodetool t ablehistograms metro_systems users
Select token(uid) from users1 where uid=1; // // we want to see which token is
allocated for this partition
Basic data model -- update theory
Exercise 8: Time series: weather stations
CREATE TABLE temperature (
w_id text,
event_time timestamp,
temperature int,
PRIMARY KEY (w_id,event_time)
) with clustering order by (event_time DESC);
Let’s import data in the temperature table:
Valentina.Crisan@academyofdata.com
copy temperature(w_id,event_time,temperature) from '/data/weatherdata.csv' with
header = true and maxrows = 100000;
Select * from temperature limit 30;
Let’s look for the data (time & temp) on a single weather station, how would the
query look like?
select event_time, temperature from temperature where w_id = 'PNOSCPQCNX';
What about all the temperatures between 2016-01-10 03:49:46 and 2016-01-11
16:29:59?
select temperature from temperature where w_id = 'PNOSCPQCNX' AND
event_time > '2016-01-10 03:49:46' AND event_time < '2016-01-11 16:29:59';
Average temperatures between 2016-01-10 03:49:46 and 2016-01-11 16:29:59?
select avg(temperature) from temperature where w_id = 'PNOSCPQCNX' AND
event_time > '2016-01-10 03:49:46' AND event_time < ' 2016-01-11 16:29:59';
Create table temperature_by_day ( solution for wide rows)
CREATE TABLE temperature_by_day (
w_id text,
date date,
event_time timestamp,
temperature int,
PRIMARY KEY ((w_id,date),event_time)
);
Let’s import data in the new table:
copy temperature_by_day (w_id,event_time,date,temperature) from
'/data/weatherdata-ext.csv' with header = true and maxrows = 100000;
Query the weather data for the single day and then 2 consecutive days
SELECT * FROM temperature_by_day WHERE w_id='FFOERIRWOL' AND date='2015-03-19';
SELECT * FROM temperature_by_day WHERE w_id='FFOERIRWOL' AND date in
('2015-03-19','2015-03-18');
Valentina.Crisan@academyofdata.com
Let’s see how to query using a non primary key column:
This query will not work - because we are trying to query without partition key:
select * from temperature_by_day where temperature = 29;
In order to work either we allow filtering of data:
select * from temperature_by_day where temperature = 2 9 a llow filtering; // use
this with caution, brings all the table data locally i n o ne coordinator.
Or we create an index on temperature
ndex on temperature_by_day(temperature);
Create i
select * from temperature_by_day where temperature = 29
Deleting old entries - WE CAN SET EXPIRING DATA AT TABLE LEVEL OR
PER ENTRY
EXPIRE DATA PER TABLE
CREATE TABLE latest_temperatures (
w_id text,
event_time timestamp,
temperature int,
PRIMARY KEY (w_id,event_time),
) WITH CLUSTERING ORDER BY (event_time DESC) AND default_time_to_live = 10;
copy latest_temperatures(w_id,event_time,temperature) from '/data/weatherdata.csv'
with header = true and maxrows = 10000;
Select * from latest_temperatures
Repeat select in 10 sec
EXPIRE DATA PER ENTRY
INSERT INTO latest_temperatures(w_id,event_time,temperature) VALUES
('12345', '2015-09-17 21:29:35', 20) USING TTL 20;
Select * from latest_temperatures; // in 20 s