100% found this document useful (1 vote)
447 views16 pages

8.2. CQL Exercises

The document provides details about the data that will be used in a course on data analysis. It includes information on two datasets - the Movielens dataset containing movie ratings from 6,000 users on 4,000 movies, and a randomly generated weather dataset with 500,000 entries. It then describes the structure of the Movielens data files and provides sample extracts. Finally, it outlines exercises that will be completed using CQL to interact with the data, including creating and modifying keyspaces and tables.

Uploaded by

Alexandra Ureche
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
447 views16 pages

8.2. CQL Exercises

The document provides details about the data that will be used in a course on data analysis. It includes information on two datasets - the Movielens dataset containing movie ratings from 6,000 users on 4,000 movies, and a randomly generated weather dataset with 500,000 entries. It then describes the structure of the Movielens data files and provides sample extracts. Finally, it outlines exercises that will be completed using CQL to interact with the data, including creating and modifying keyspaces and tables.

Uploaded by

Alexandra Ureche
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Valentina.Crisan@academyofdata.

com 

Throughout​ ​the​ ​course​ ​we​ ​will​ ​work​ ​with​ ​2​ ​sets​ ​of​ ​data:

Movielens​ ​data​ ​set​ ​(real​ ​data):​​ ​6k​ ​users,​ ​4k​ ​movies,​ ​1​ ​million​ ​ratings​ ​-​ ​we​ ​copy​ ​only​ ​10k
entries
Weather​ ​data​ ​(random​ ​generated)​​ ​500k​ ​entries

Movielens​ ​data​ ​details:

Movies.csv​ ​extract

mid title genres year

1 Toy​ ​Story​ ​(1995) {Animation,Children's,Comedy} 1995

2 Jumanji​ ​(1995) {Adventure,Children's,Fantasy} 1995

3 Grumpier​ ​Old​ ​Men​ ​(1995) {Comedy,Romance} 1995

4 Waiting​ ​to​ ​Exhale​ ​(1995) {Comedy,Drama} 1995

5 Father​ ​of​ ​the​ ​Bride​ ​Part​ ​II​ ​(1995) {Comedy} 1995

Users.csv​ ​extract

uid gender age ocupation zip

1 F 1 10 48067

2 M 56 16 70072

3 M 25 15 55117

4 M 45 7 0246

Ratings2.csv

ruid rmid rating timestamp

1 1193 5 978300760

1 661 3 978302109

1 914 3 978301968

1 3408 4 978300275

1 2355 5 978824291

Denormalized​ ​data​ ​-​ ​Ratings.csv​ ​data

uid age gender ocupation zip rating datetime mid title year genres

5234 25 F 0 60657 1 2000-06-19​ ​20:40:40 31 Dangerous​ ​Minds​ ​(1995) 1995 {Drama}


Valentina.Crisan@academyofdata.com 

1835 25 M 19 11501 3 2000-11-22​ ​08:11:31 31 Dangerous​ ​Minds​ ​(1995) 1995 {Drama}

6035 25 F 1 78734 2 2000-04-26​ ​01:25:35 31 Dangerous​ ​Minds​ ​(1995) 1995 {Drama}

4637 35 F 9 48302 3 2000-07-19​ ​18:48:41 31 Dangerous​ ​Minds​ ​(1995) 1995 {Drama}

3841 45 M 18 26101 3 2000-08-11​ ​13:14:31 31 Dangerous​ ​Minds​ ​(1995) 1995 {Drama}

2242 18 M 4 60115 5 2000-11-19​ ​02:40:43 31 Dangerous​ ​Minds​ ​(1995) 1995 {Drama}

weather-data.csv

w_id​​ ​ ​ ​ ​ ​ ​ ​|​ ​event_time​​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​|​ ​temperature

------------+---------------------------------+-------------

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​00:02:54.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​20

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​00:10:00.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​23

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​00:29:53.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​20

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​01:04:22.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​12

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​01:17:15.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​18

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​02:52:25.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​23

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​03:09:17.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​11

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​03:14:57.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​13

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​04:13:33.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​14

​ ​PNOSCPQCNX​​ ​|​ ​2015-01-01​ ​04:47:56.000000+0000​​ ​|​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​13

All​ ​the​ ​above​ ​files​ ​can​ ​be​ ​found​ ​in​ ​the​ ​data​ ​directory
(both​ ​cloud​ ​and​ ​local​ ​installations).

cd​ ​/data
Ls​ ​-l

Note:​ ​for​ ​the​ ​intro​ ​part​ ​we​ ​play​ ​a​ ​bit​ ​with​ ​different
data​ ​types​ ​and​ ​we​ ​insert​ ​data​ ​manually,​ ​before​ ​we
work​ ​with​ ​the​ ​above​ ​data.
Valentina.Crisan@academyofdata.com 

CQL​ ​exercises
Reference​ ​for​ ​syntax​ ​questions:
http://docs.datastax.com/en/cql/3.3/cql/cql_using/useAboutCQL.html

Run​ ​Cqlsh

1. Create​ ​a​ ​Keyspace​ ​-​ ​for​ ​local​ ​installations​ ​use


metro_systems,​ ​but​ ​for​ ​cloud​ ​installations​ ​use​ ​your​ ​name
(e.g.​ ​my​ ​keyspace​ ​is​ ​valentina_c)​ ​--​ ​since​ ​all​ ​cloud​ ​users​ ​need
to​ ​create​ ​a​ ​separate​ ​working​ ​space.​ ​You​ ​will​ ​use​ ​that​ ​name
instead​ ​of​ ​metro_systems​ ​in​ ​the​ ​commands​ ​below.

CREATE​ ​KEYSPACE​ ​metro_systems​ ​WITH​ ​REPLICATION​ ​=​ ​{​ ​'class'​ ​:


'SimpleStrategy'​ ​,​ ​'replication_factor'​ ​:​ ​1​ ​};

Desc​ ​keyspaces​ ​ ​ ​//​ ​will​ ​list​ ​all​ ​keyspaces​ ​on​ ​the​ ​cluster

describe​ ​metro_systems;​ ​//​ ​you​ ​can​ ​see​ ​the​ ​Replication​ ​factor​ ​&​ ​strategy
set​ ​at​ ​keyspace​ ​level

Exit​ ​cqlsh​ ​and​ ​run​ ​nodetool​ ​status​ ​command​ ​-​ ​this​ ​should​ ​give​ ​you​ ​an​ ​image​ ​of​ ​your​ ​cluster
(even​ ​if​ ​it’s​ ​a​ ​1​ ​node​ ​cluster).​ ​Note​ ​the​ ​datacenter​ ​name.

Re-enter​ ​Cqlsh

Let’s​ ​create​ ​another​ ​keyspaceNetwork​ ​Topology​ ​replication​ ​strategy​ ​and​ ​Replication​ ​factor
1.

CREATE​ ​KEYSPACE​ ​metro_systems1​ ​ ​WITH​ ​REPLICATION​ ​=​ ​{​ ​'class'​ ​:


'NetworkTopologyStrategy',​ ​ ​ ​'datacenter1'​ ​:​ ​1​ ​};

​ ​describe​ ​metro_systems1;

Let’s​ ​alter​ ​a​ ​created​ ​keyspace​ ​and​ ​change​ ​the​ ​property​ ​durable​ ​writes:

ALTER​ ​KEYSPACE​ ​metro_systems1​ ​WITH​ ​REPLICATION​ ​=​ ​{​ ​'class'​ ​:


'NetworkTopologyStrategy'​ ​,​ ​ ​'datacenter1'​ ​:​ ​1}​ ​and​ ​DURABLE_WRITES​ ​=
false;

OR​ ​if​ ​we​ ​want​ ​to​ ​modify​ ​the​ ​replication​ ​factor​ ​from​ ​1​ ​to​ ​3:

ALTER​ ​KEYSPACE​ ​metro_systems1​ ​WITH​ ​REPLICATION​ ​=​ ​{​ ​'class'​ ​:


'NetworkTopologyStrategy'​ ​,​ ​ ​'datacenter1'​ ​:​ ​3};
Valentina.Crisan@academyofdata.com 

ALTER​ ​KEYSPACE​ ​valentina_c_1​ W


​ ITH​ ​REPLICATION​ ​=​ ​{​ ​'class'​ ​:
'NetworkTopologyStrategy'​ ​,​ ​ ' ​ datacenter1'​ ​:​ ​3};

Desc​ ​metro_systems1

Let’s​ ​drop​ ​keyspace​ ​metro_systems1​:

DROP​ ​KEYSPACE​ ​metro_systems1;

Stop:​ ​let’s​ ​explain​ ​what​ ​means​ ​durable​ ​writes​ ​option​ ​in​ ​the​ ​keyspace​ ​description?

2.​ ​Let’s​ ​create​ ​and​ ​modify​ ​a​ ​column​ ​family

First,​ ​connect​ ​to​ ​the​ ​created​ ​keyspace:​ ​i.e.​ ​all​ ​further​ ​CQL​ ​commands​ ​will​ ​be​ ​executed​ ​in​ ​the
context​ ​of​ ​the​ ​chosen​ ​keyspace.

USE​ ​metro_systems;

2.1.​ ​Create​ ​Table​ ​users​ ​with​ ​the​ ​next​ ​structure​:​ ​userid​ ​(uuid),​ ​gender​ ​(text),​ ​age​ ​(int),
occupation​ ​(text),​ ​zipcode​ ​(int)​ ​and​ ​compound​ ​primary​ ​key:​ ​partition​ ​key​ ​userid​ ​and​ ​cluster
key​ ​gender.

CREATE​ ​TABLE​ ​users​ ​(


userid​ ​uuid,
name​ ​text,
gender​ ​text,
age​ ​int,
occupation​ ​text,
zipcode​ ​int,
PRIMARY​ ​KEY​ ​((userid),name));

DESCRIBE​ ​users;​ ​//​ ​you​ ​can​ ​see​ ​all​ ​properties​ ​of​ ​the​ ​table

2.2.​ ​ ​ALTER:​ ​Add/Delete/Rename​ ​columns,​ ​Alter​ ​Column​ ​type;

ALTER​ ​TABLE​ ​users​ ​ADD​ ​relatives​ ​int;​ ​ ​ ​//​ ​add​ ​a​ ​new​ ​column​ ​to​ ​the​ ​table 

SELECT​ ​*​ ​FROM​ ​users;

ALTER​ ​TABLE​ ​users​ ​RENAME​ ​userid​ ​TO​ ​uid;​​ ​//rename​ ​column 


Valentina.Crisan@academyofdata.com 

ALTER​ ​TABLE​ ​users​ ​ALTER​ ​relatives​ ​type​ ​blob;​ ​ ​//​ ​alter​ ​column​ ​type​ ​from​ ​int​ ​to​ ​blob  

Describe​ ​users​;​ ​//​ ​to​ ​see​ ​the​ ​types​ ​of​ ​the​ ​columns 

Please​ ​note​:​ ​Alter​ ​columns​ ​types,​ ​some​ ​changes​ ​will​ ​not​ ​be​ ​possible,​ ​like​ ​for​ ​example: 
varchar​ ​to​ ​ascii.  

ALTER​ ​TABLE​ ​users​ ​DROP​ ​relatives;​ ​ ​ ​//delete​ ​the​ ​new​ ​added​ ​column​ ​to​ ​the​ ​table 

SELECT​ ​*​ ​FROM​ ​users;

Describe​ ​users;

DROP​ ​TABLE​ ​users;​ ​//​ ​deletes​ ​all​ ​the​ ​data​ ​of​ ​the​ ​table​ ​including​ ​the​ ​table​ ​structure​ ​& 
indexes 

Desc​ ​tables;​ ​ ​//​ ​no​ ​tables 

2.3.​ ​Insert​ ​values​ ​in​ ​the​ ​created​ ​table

Assuming​ ​you​ ​dropped​ ​the​ ​table​ ​users​ ​previously,​ ​let’s​ ​re-create​ ​it.

CREATE​ ​TABLE​ ​users​ ​(


uid​ ​uuid,
name​ ​text,
gender​ ​text,
age​ ​int,
occupation​ ​text,
zipcode​ ​int,
PRIMARY​ ​KEY​ ​((uid),name));

We​ ​have​ ​a​ ​table,​ ​now​ ​let’s​ ​insert​ ​data​ ​manually.​ ​We​ ​will​ ​learn​ ​later​ ​also​ ​how​ ​to​ ​insert​ ​data
from​ ​a​ ​file,​ ​but​ ​for​ ​the​ ​moment​ ​let’s​ ​do​ ​some​ ​manual​ ​inserts/updates.

Observation​:​ ​the​ ​UUID​ ​can​ ​be​ ​generated​ ​automatically​ ​with​ ​uuid​ ​();

INSERT​ ​INTO​ ​users​ ​(uid,​ ​name,​ ​age,​ ​gender,​ ​occupation,​ ​zipcode)​ ​VALUES
(5132b130-ae79-11e4-ab27-0800200c9a66,​ ​'Andrew',​ ​25,​ ​'M',​ ​'plummer',
11501);

INSERT​ ​INTO​ ​users​ ​(uid,​ ​name,​ a


​ ge,​ ​gender,​ ​occupation,​ ​zipcode)​ ​VALUES
(UUID(),'Dana',​ ​24,​ ​'F','data​ s ​ cientist',​ ​11501);
Valentina.Crisan@academyofdata.com 

SELECT​ ​*​ ​FROM​ ​users;

TRACING​ ​ON;​ ​//​ ​allows​ ​to​ ​see​ ​the​ ​events​ ​in​ ​background​ ​for​ ​Cassandra.

INSERT​ ​INTO​ ​users​ ​(uid,​ ​name,​ ​age,​ ​gender,​ ​occupation,​ ​zipcode)​ ​VALUES
(5132b130-ae79-11e4-ab27-0800200c9a66,​ ​'Lili',​ ​24,​ ​'F',​ ​'actress',​ ​11503);
//​ ​note​ ​that​ ​I​ ​am​ ​inserting​ ​data​ ​in​ ​the​ ​same​ ​partition​ ​as​ ​the​ ​first
insert.​ ​Given​ ​that​ ​the​ ​Clustering​ ​Key​ ​is​ ​different​ ​the​ ​data​ ​will​ ​be​ ​added
to​ ​the​ ​existing​ ​partition.

TRACING​ ​OFF

SELECT​ ​*​ ​FROM​ ​users;​ ​//​ ​remember​ ​the​ ​way​ ​Cassandra​ ​stores​ ​data​ ​is
different​ ​than​ ​the​ ​CQL​ ​view.

Remember​ ​for​ ​the​ ​next​ ​section:​ ​Cassandra​ ​doesn’t​ ​read​ ​(by​ ​default)​ ​before
performing​ ​writes.

2.4.​ ​Let’s​ ​do​ ​also​ ​some​ ​upserts

UPDATE​ u ​ sers​ ​SET​ a


​ ge​ ​=​ 3​ 5​ ​WHERE​ ​uid​ ​=​ 5
​ 132b130-ae79-11e4-ab27-0800200c9a66​ ​and
name​ ​=​ '​ Andrew';​ / ​ /​ ​we​ u​ pdate​ ​the​ ​age​ o ​ f​ ​an​ ​existing​ ​user

Select​ ​*​ ​from​ ​users;​ ​//​ ​to​ ​see​ ​the​ ​change

UPDATE​ ​users​ ​SET​ ​age​ ​=​ ​35​ ​WHERE​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a77​ ​and
name​ ​=​ ​'Andrew';​ ​//​ ​the​ ​UUID​ ​-​ ​partition​ ​key​ ​-​ ​is​ ​not​ ​existent​ ​thus​ ​the​ ​update
will​ ​do​ ​an​ ​insert

SELECT​ ​*​ ​from​ ​users;​ ​//​ ​let’s​ ​check​ ​that​ ​everything​ ​went​ ​well

UPDATE​ u ​ sers​ ​SET​ a ​ ge​ ​=​ ​44​ ​WHERE​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a66​ ​and
name​ ​=​ '​ Elaine';​ / ​ /​ ​the​ ​partition​ ​key​ ​existing​ ​but​ ​not​ ​with​ ​this​ ​clustering
key,​ ​a​ n ​ ew​ ​entry​ i ​ nside​ ​an​ ​existing​ ​partition​ ​will​ ​be​ ​generated

Select​ ​*​ ​from​ ​users;

Select​ ​*​ ​from​ ​users​ ​where​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a66;​ ​//​ ​to
see​ ​all​ ​entries​ ​in​ ​this​ ​partition

Lets​ ​do​ ​some​ ​more​ ​inserts.​ ​The​ ​first​ ​row​ ​inserted​ ​will​ ​have​ ​a​ ​new​ ​partition​ ​and​ ​cluster​ ​key​.​ ​The
second​ ​row​ ​will​ ​have​ ​the​ ​same​ ​partition​ ​key​ ​but​ ​a​ ​different​ ​clustering​ ​key​.

insert​ ​into​ ​users​ ​(uid,​ ​name,​ ​age,​ ​gender,​ ​occupation,​ ​zipcode)​ ​values
(uuid(),'John',35,'M','data​ ​analyst',11502);​ ​//​ ​see​ ​the​ ​uuid​ ​generated​ ​here
and​ ​use​ ​it​ ​for​ ​the​ ​next​ ​insert

insert​ ​into​ ​users​ ​(uid,​ ​name,​ ​age,​ ​gender,​ ​occupation,​ ​zipcode)​ ​values
(90278d0f-6053-4c87-a1f3-69a05853d2a8,'Elaine',34,'F','engineer',11502);
Valentina.Crisan@academyofdata.com 

Select​ ​*​ ​from​ ​users;

Insert​ ​into​ ​users​ ​(uid,name,​ ​age)​ ​values(


90278d0f-6053-4c87-a1f3-69a05853d2a8,​ ​'Elaine',43);​ ​//​ ​see​ ​how​ ​insert​ ​can
be​ ​used​ ​as​ ​an​ ​update​ ​as​ ​well

Before​ ​we​ ​move​ ​on,​ ​type​ ​exit​ ​in​ ​cqlsh​ ​and​ ​lets​ ​see​ ​some​ ​stats​ ​for​ ​our​ ​table.

Nodetool​ ​cfstats​ ​keyspace.table​ ​//generic​ ​command,​ ​see​ ​below​ ​for​ ​metro_systems


keyspace

Nodetool​ ​cfstats​ ​metro_systems.users​ ​ ​//​ ​for​ ​cloud​ ​users​ ​use​ ​yourname.users

See​ ​that​ ​data​ ​is​ ​still​ ​in​ ​memtable,​ ​no​ ​SStables​ ​generated,​ ​see​ ​also​ ​the​ ​estimated​ ​number
of​ ​partitions​ ​(Number​ ​of​ ​keys​ ​(estimate))

2.5.​ ​Lightweight​ ​operations

Re-enter​ ​cqlsh​ ​and​ ​your​ ​respective​ ​keyspace

Use​ ​metro_systems;

Let’s​ ​try​ ​a​ ​an​ ​update​ ​on​ ​another​ ​non​ ​existing​ ​entry​ ​but​ ​with​ ​LWT:

UPDATE​ ​users​ ​SET​ ​age​ ​=​ 3


​ 5​ ​WHERE​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a88
and​ ​name​ ​=​ ​'Andrew'​ ​IF​ E​ XISTS;

Now​ ​let’s​ ​try​ ​an​ ​update​ ​on​ ​an​ ​existing​ ​entry​ ​with​ ​LWT:

UPDATE​ ​users​ ​SET​ ​age​ ​=​ 4


​ 4​ ​WHERE​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a77
and​ ​name​ ​=​ ​'Andrew'​ ​IF​ E​ XISTS;

Exit​ ​CQL

Let’s​ ​force​ ​a​ ​flush​ ​to​ ​write​ ​the​ ​data​ ​on​ ​disk​ ​in​ ​an​ ​SStable.​ ​Run

Nodetool​ ​flush​ ​metro_systems​ ​users​ ​//​ ​For​ ​cloud​ ​users​ ​please​ ​make​ ​sure​ ​you​ ​use
your​ ​own​ ​keyspace

Go​ ​to​ ​cd​ ​/var/lib/data/cassandra/​ ​.​ ​Choose​ ​your​ ​keyspace​ ​directory​ ​and​ ​the​ ​directory
of​ ​the​ ​table.​ ​Use​ ​ls​ ​-l​ ​to​ ​list​ ​the​ ​tables​ ​generated​ ​after​ ​flush.

IF​ ​you​ ​would​ ​like​ ​to​ ​see​ ​how​ ​an​ ​SStable​ ​looks​ ​like​ ​you​ ​can​ ​use​ ​:​ ​sstabledump
nameofthedatafile​ ​//​ ​e.g.​ ​sstabledump​ ​mc-1-big-Data.db

Sstabledump​ ​-e​ ​nameofthedatafile​ ​//​ ​will​ ​show​ ​us​ ​the​ ​partitions​ ​in​ ​the​ ​specific​ ​SStable
Valentina.Crisan@academyofdata.com 

STOP​ ​here​ ​for​ ​theory​ ​update.​ ​Queries​ ​section.

3.​ ​Let’s​ ​do​ ​some​ ​queries​ ​on​ ​the​ ​data​ ​-​ ​ ​we​ ​will​ ​come​ ​back​ ​to
Queries,​ ​this​ ​is​ ​just​ ​to​ ​get​ ​us​ ​going​ ​with​ ​the​ ​syntax

Cd​ ​/

Re-enter​ ​CQL​ ​with​ ​Cqlsh​ ​command​ ​and​ ​use​ ​your​ ​keyspace;

SELECT​ ​*​ ​FROM​ ​users​ ​where​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a66;

//​ ​You​ ​can​ ​also​ ​pick​ ​the​ ​columns​ ​to​ ​display​ ​instead​ ​of​ ​choosing​ ​all​ ​data.

SELECT​ ​age,gender​ ​FROM​ ​users​ ​where​ ​uid​ ​=


5132b130-ae79-11e4-ab27-0800200c9a66;

SELECT​ ​*​ ​FROM​ ​users​ ​LIMIT​ ​3;​ ​//​ ​you​ ​can​ ​limit​ ​the​ ​number​ ​of​ ​returned
values

//You​ ​can​ ​fine-tune​ ​the​ ​display​ ​order​ ​using​ ​the​ ​ORDER​ ​BY​ ​clause,​ ​either​ ​descending
or​ ​ascending.​ ​The​ ​partition​ ​key​ ​must​ ​be​ ​defined​ ​in​ ​the​ ​WHERE​ ​clause​ ​and​ ​the​ ​ORDER
BY​ ​clause​ ​defines​ ​the​ ​clustering​ ​column​ ​to​ ​use​ ​for​ ​ordering.​​ ​Order​ ​By​ ​works​ ​only​ ​on
clustering​ ​keys.

SELECT​ ​*​ ​FROM​ ​users​ ​where​ ​uid​ ​=​ ​5132b130-ae79-11e4-ab27-0800200c9a66​ ​ORDER


BY​ ​name​ ​DESC;

//​ ​You​ ​can​ ​query​ ​for​ ​a​ ​partition​ ​key​ ​that​ ​is​ ​part​ ​of​ ​a​ ​list​ ​of​ ​partition​ ​keys

SELECT​ ​*​ ​FROM​ ​users​ ​where​ ​uid​ ​IN​ ​(5132b130-ae79-11e4-ab27-0800200c9a66,


edb032e8-5c4e-403f-8dfc-43c2dcd69dba);

You​ ​can​ ​return​ ​the​ ​number​ ​of​ ​records​ ​that​ ​fulfills​ ​a​ ​request:

SELECT​ ​COUNT(*)​ ​FROM​ ​users​ ​where​ ​uid​ ​=


5132b130-ae79-11e4-ab27-0800200c9a66;

You​ ​can​ ​ask​ ​for​ ​the​ ​distinct​ ​partitions​ ​in​ ​a​ ​table​ ​(DISTINCT​ ​works​ ​only​ ​on​ ​partition
keys):

select​ ​distinct​ ​uid​ ​from​ ​users;

And​ ​also​ ​count​ ​how​ ​many​ ​distinct​ ​partitions​ ​exist:


Valentina.Crisan@academyofdata.com 

select​ ​distinct​ ​uid,​ ​count(*)​ ​from​ ​users;

3.1.​ ​CREATE​ ​AN​ ​INDEX

SELECT​ ​*​ ​FROM​ ​users​ ​WHERE​ ​gender​ ​=​ ​'M';​​ ​//​ ​WILL​ ​NOT​ ​WORK

//let’s​ ​create​ ​an​ ​index

​ NDEX​ ​users_by_gender​ ​ON​ ​users(gender);


CREATE​ I
SELECT​ *​ ​ ​FROM​ ​users​ ​WHERE​ ​gender​ ​=​ ​'F';

//lets​ ​count​ ​how​ ​many​ ​records​ ​are​ ​in​ ​the​ ​query​ ​outcome

SELECT​ ​count(*)​ ​FROM​ ​users​ ​WHERE​ ​gender​ ​=​ ​'F';

Tracing​ ​on;

select​ ​*​ ​from​ ​users​ ​where​ ​gender​ ​=​ ​'F';​ ​ ​//​ ​to​ ​see​ ​all​ ​the​ ​operations​ ​C*​ ​is​ ​doing​ ​in​ ​the
background​ ​(even​ ​for​ ​a​ ​1​ ​node​ ​cluster​ ​you​ ​can​ ​see​ ​loads​ ​of​ ​operations).

CAN​ ​YOU​ ​SEE​ ​THAT​ ​THERE​ ​ARE​ ​2​ ​TABLES​ ​queried​ ​ON​ ​THE​ ​NODE​ ​IN​ ​ORDER​ ​TO​ ​GET
THE​ ​DATA:​ ​on​ ​the​ ​index​ ​table​ ​and​ ​on​ ​the​ ​users​ ​table
​ ​Executing​ ​single-partition​ ​query​ ​on​ ​users.users_by_gender
​ ​Executing​ ​single-partition​ ​query​ ​on​ ​users​ ​--​ ​appears​ ​3​ ​times.​ ​Can​ ​you​ ​tell​ ​me​ ​why?
We​ ​will​ ​come​ ​back​ ​on​ ​this​ ​point​ ​in​ ​section​ ​Indexing​ ​in​ ​C*.

4.Collections

SETS

ALTER​ ​TABLE​ ​users​ ​ADD​ ​emails​ ​set​ ​<text>;

Let’s​ ​insert​ ​some​ ​data​ ​in​ ​the​ ​set:

insert​ ​into​ ​users​ ​(uid,name,age,emails,gender,occupation,zipcode)


values(uuid(),'Lilian',
18,{'lilian@gmail.com','lilian_briggs@oxford.com'},'F',​ ​'student',​ ​11503);

Select​ ​*​ ​from​ ​users;

Let’s​ ​add​ ​some​ ​data​ ​to​ ​existing​ ​entries:

UPDATE​ ​users​ ​SET​ ​emails​ ​=​ ​emails​ ​+​ ​{'elaine@yahoo.com'}​ ​WHERE​ ​uid​ ​=
89d3a708-0571-440a-8a9b-997856c17789​ ​AND​ ​name​ ​=​ ​'Elaine';
Valentina.Crisan@academyofdata.com 

Let’s​ ​delete​ ​some​ ​data​ ​from​ ​existing​ ​entries:

UPDATE​ ​users​ ​SET​ ​emails​ ​=​ ​emails​ ​-​ ​{'elaine@yahoo.com'}​ ​WHERE​ ​uid​ ​=
89d3a708-0571-440a-8a9b-997856c17789​ ​AND​ ​name​ ​=​ ​'Elaine';

SELECT​ ​*​ ​FROM​ ​USERS;

create​ ​index​ ​on​ ​users(emails);

select​ ​*​ ​from​ ​users​ ​where​ ​emails​ ​CONTAINs​ ​'lilian@gmail.com';

LISTS

LET’S​ ​ADD​ ​A​ ​LIST​ ​COLUMN​ ​TO​ ​TABLE​ ​USERS:

ALTER​ ​TABLE​ ​users​ ​ADD​ ​top_movies​ ​list<text>;

LET’S​ ​ADD​ ​SOME​ ​DATA​ ​TO​ ​EXISTING​ ​ENTRIES:

update​ ​users​ ​set​ ​top_movies​ ​=​ ​['titanic','gatsby']​ ​where​ ​uid​ ​=


89d3a708-0571-440a-8a9b-997856c17789​ ​and​ ​name​ ​=​ ​'Elaine';

Select​ ​*​ ​from​ ​users;

update​ ​users​ ​set​ ​top_movies​ ​=​ ​['romeo​ ​and​ ​juliet']+​ ​top_movies​ ​where​ ​uid​ ​=
89d3a708-0571-440a-8a9b-997856c17789​ ​and​ ​name​ ​=​ ​'Elaine';

Select​ ​*​ ​from​ ​users;

update​ ​users​ ​set​ ​top_movies​ ​=​ ​top_movies​ +


​ ​ ​['the​ ​renevant']​ ​ ​where​ ​uid​ ​=
89d3a708-0571-440a-8a9b-997856c17789​ ​and​ n ​ ame​ ​=​ ​'Elaine';

Let’s​ ​replace​ ​position​ ​1​ ​Titanic​ ​with​ ​Blood​ ​Diamonds:

update​ ​users​ ​set​ ​top_movies[1]='blood​ ​diamonds'​ w


​ here​ ​uid​ ​=
89d3a708-0571-440a-8a9b-997856c17789​ ​and​ ​name​ ​=​ ' ​ Elaine';

Select​ ​*​ ​from​ ​users;

Let’s​ ​delete​ ​entry​ ​1​ ​from​ ​the​ ​list.​ ​//blood​ ​diamonds

delete​ ​top_movies[1]​ ​from​ ​users​ ​ ​where​ ​uid​ ​=


89d3a708-0571-440a-8a9b-997856c17789​ ​and​ ​name​ ​=​ ​'Elaine';
Valentina.Crisan@academyofdata.com 

select​ ​name​ ​,​ ​top_movies​ ​from​ ​users​ ​where​ ​uid​ =



89d3a708-0571-440a-8a9b-997856c17789​ ​and​ ​name​ = ​ ​ ​'Elaine';​ ​//​ ​we​ ​select​ ​only​ ​the
collection

You​ ​can​ ​index​ ​a​ ​collection,​ ​let’s​ ​query​ ​on​ ​a​ ​movie​ ​name:

create​ ​index​ ​top_movies​ ​on​ ​users(top_movies);

select​ ​*​ ​from​ ​users​ ​where​ ​top_movies​ ​contains​ ​'gatsby';

MAPS
LET’S​ ​ADD​ ​A​ ​MAP​ ​COLUMN​ ​TO​ ​TABLE​ ​USERS:

ALTER​ ​TABLE​ ​users​ ​ADD​ ​movies​ ​map<timestamp,​ ​text>;

LET’S​ ​INSERT​ ​DATA:

update​ ​users​ ​set​ ​movies​ ​=​ ​{'2016-06-05':'rated


movie','2016-06-06':'screened​ ​trailer'}​ ​where​ ​uid​ ​=
89d3a708-0571-440a-8a9b-997856c17789​ ​and​ ​name​ ​=​ ​'Elaine';

Select​ ​movies​ ​from​ ​users;

You​ ​can​ ​use​ ​a​ ​collection​ ​as​ ​a​ ​primary​ ​key.

STOP​ ​here​ ​for​ ​theory​ ​update.

5.​ ​Copy​ ​data​ ​from/to​ ​a​ ​CSV​ ​file

Let’s​ ​start​ ​working​ ​on​ ​some​ ​real​ ​data.

5.1.​ ​Copy​ ​from​ ​a​ ​CSV​ ​file

Drop​ ​table​ ​users;

Let’s​ ​re-create​ ​table​ ​users​ ​with​ ​below​ ​syntax​ ​and​ ​import​ ​data​ ​from​ ​a​ ​CSV​ ​file.

CREATE​ ​TABLE​ ​users(


​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​uid​ ​int,
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​age​ ​int,
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​gender​ ​text,
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​occupation​ ​int,
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​zipcode​ ​text,
​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​PRIMARY​ ​KEY​ ​(uid));
Valentina.Crisan@academyofdata.com 

WE​ ​HAVE​ ​IN​ ​THE​ ​DATA​ ​DIRECTORY​ ​(CLOUD​ ​OR​ ​LOCAL​ ​INSTALLATIONS)​ ​THE​ ​CSV
FILES​ ​THAT​ ​WE​ ​WILL​ ​BE​ ​WORKING​ ​WITH.​ ​WE​ ​NEED​ ​Users.csv

copy​ ​users(uid,gender,age,occupation,zipcode)​ ​from​ ​'/data/users.csv'​ ​with


header​ ​=​ ​true;

select​ ​*​ ​from​ ​users​ ​limit​ ​10;

5.2.​ ​Copy​ ​to​ ​a​ ​CSV​ ​file

COPY​ ​users(uid,gender,age)​ ​TO​ ​'/data/yourname.csv'​ ​WITH​ ​HEADER=TRUE;

6.1​ ​Collections​ ​as​ ​Primary​ ​key

Did​ ​you​ ​know​ ​we​ ​can​ ​have​ ​a​ ​collection​ ​as​ ​part​ ​Primary​ ​Key?​ ​So​ ​if​ ​we​ ​want​ ​to
know​ ​movies​ ​by​ ​genres​ ​-​ ​we​ ​could​ ​define​ ​Genre​ ​as​ ​part​ ​of​ ​the​ ​Primary​ ​key.

CREATE​ ​TABLE​ ​movies_by_genres​ ​(


movieid​ ​int,
title​ ​text,
genres​ ​frozen​ ​<set<text>>,
Year​ ​int,
PRIMARY​ ​KEY​ ​(genres,​ ​movieid));

copy​ ​movies_by_genres(movieid,title,genres,year)​ ​from​ ​'/data/movies.csv'​ ​with​ ​header​ ​=


true;

Select​ ​*​ ​from​ ​movies_by_genres​ ​limit​ ​10;

6.2.​ ​Lets​ ​also​ ​do​ ​an​ ​exercise​ ​on​ ​User​ ​Defined​ ​Types

CREATE​ ​TYPE​ ​address​ ​(


Street​ ​text,
Number​ ​int,
Flat​ ​text);

CREATE​ ​TABLE​ ​users_udt​ ​(


Uid​ ​int,
Address​ ​frozen<address>,
Primary​ ​key​ ​(uid));
Valentina.Crisan@academyofdata.com 

//​ ​When​ ​using​ ​the​ ​frozen​ ​keyword,​ ​you​ ​cannot​ ​update​ ​parts​ ​of​ ​a
user-defined​ ​type​ ​value.​ ​The​ ​entire​ ​value​ ​must​ ​be​ ​overwritten.​ ​Cassandra
treats​ ​the​ ​value​ ​of​ ​a​ ​frozen,​ ​user-defined​ ​type​ ​like​ ​a​ ​blob.

INSERT​ ​INTO​ ​users_udt(uid,address)​ ​VALUES​ ​(1,​ ​{street​ ​:​ ​'Pipera',​ ​number​ ​:


34,​ ​flat​ ​:​ ​'34C'});

Select​ ​*​ ​from​ ​users_udt;

//Let’s​ ​see​ ​if​ ​we​ ​can​ ​select​ ​only​ ​parts​ ​of​ ​the​ ​UDT
select​ ​address.street​ ​from​ ​users_udt​ ​where​ ​uid​ ​=​ ​1;

//​ ​Let’s​ ​see​ ​if​ ​we​ ​can​ ​update​ ​parts​ ​of​ ​the​ ​UDT

update​ ​users_udt_1​ ​set​ ​address.street​ ​=​ ​'Vitan'​ ​where​ ​uid​ ​=​ ​1;​ ​//​ ​will​ ​not
work

Important​:From​ ​3.4.2:​ ​User-defined​ ​types​ ​may​ ​now​ ​be​ ​stored​ ​in​ ​a​ ​non-frozen
form,​ ​allowing​ ​individual​ ​fields​ ​to​ ​be​ ​updated​ ​and​ ​deleted​ ​in​ ​UPDATE
statements​ ​and​ ​DELETE​ ​statements,​ ​respectively.​ ​(CASSANDRA-7423).​ ​Let’s
test.

CREATE​ ​TABLE​ ​users_udt_1​ ​(


Uid​ ​int,
Address​ ​address,
Primary​ ​key​ ​(uid));

INSERT​ ​INTO​ ​users_udt_1(uid,address)​ ​VALUES​ ​(1,​ ​{street​ ​:​ ​'Pipera',​ ​number


:​ ​34,​ ​flat​ ​:​ ​'34C'});

Select​ ​*​ ​from​ ​users_udt_1;

update​ ​users_udt_1​ ​set​ ​address.street​ ​=​ ​'Vitan'​ ​where​ ​uid​ ​=​ ​1;

Select​ ​*​ ​from​ ​users_udt_1;


Valentina.Crisan@academyofdata.com 

​ ​7.​ ​Let’s​ ​do​ ​some​ ​data​ ​deletes:

SELECT​ ​*​ ​from​ ​users​ ​limit​ ​10;

Delete​ ​value​ ​of​ ​a​ ​cell:


select​ ​*​ ​from​ ​users​ ​where​ ​uid​ ​=​ ​4317;

delete​ ​age​ ​from​ ​users​ ​where​ ​uid​ ​=​ ​4317;

select​ ​*​ ​from​ ​users​ ​where​ ​uid​ ​=​ ​4317;

Delete​ ​a​ ​partition:


delete​ ​from​ ​users​ ​where​ ​uid​ ​=​ ​4317;

Before​ ​we​ ​move​ ​on

//see​ ​table​ ​statistics

nodetool​ ​cfstats​ ​metro_systems.users

//​ ​see​ ​number​ ​of​ ​keys​ ​=​ ​number​ ​of​ ​partitions.

​ fhistograms​ ​ ​metro_systems​ ​users


nodetool​ c
or
nodetool​ t​ ablehistograms​ ​metro_systems​ ​users

Select​ ​token(uid)​ ​from​ ​users1​ ​where​ ​uid=1;​ ​//​ ​//​ ​we​ ​want​ ​to​ ​see​ ​which​ ​token​ ​is
allocated​ ​for​ ​this​ ​partition

Basic​ ​data​ ​model​ ​--​ ​update​ ​theory

Exercise​ ​8:​ ​Time​ ​series:​ ​weather​ ​stations


CREATE​ ​TABLE​ ​temperature​ ​(
w_id​ ​text,
event_time​ ​timestamp,
temperature​ ​int,
PRIMARY​ ​KEY​ ​(w_id,event_time)
)​ ​with​ ​clustering​ ​order​ ​by​ ​(event_time​ ​DESC);

Let’s​ ​import​ ​data​ ​in​ ​the​ ​temperature​ ​table:


Valentina.Crisan@academyofdata.com 

copy​ ​temperature(w_id,event_time,temperature)​ ​from​ ​'/data/weatherdata.csv'​ ​with


header​ ​=​ ​true​ ​and​ ​maxrows​ ​=​ ​100000;

Select​ ​*​ ​from​ ​temperature​ ​limit​ ​30;

Let’s​ ​look​ ​for​ ​the​ ​data​ ​(time​ ​&​ ​temp)​ ​on​ ​a​ ​single​ ​weather​ ​station,​ ​how​ ​would​ ​the
query​ ​look​ ​like?

select​ ​event_time,​ ​temperature​ ​from​ ​temperature​ ​where​ ​w_id​ ​=​ ​'PNOSCPQCNX';

What​ ​about​ ​all​ ​the​ ​temperatures​ ​between​ ​2016-01-10​ ​03:49:46​ ​and​ ​2016-01-11
16:29:59?

select​ ​temperature​ ​from​ ​temperature​ ​where​ ​w_id​ ​=​ ​'PNOSCPQCNX'​ ​AND


event_time​ ​>​ ​'2016-01-10​ ​03:49:46'​ ​AND​ ​event_time​ ​<​ ​'2016-01-11​ ​16:29:59';

Average​ ​temperatures​ ​between​ ​2016-01-10​ ​03:49:46​ ​and​ ​2016-01-11​ ​16:29:59?

select​ ​avg(temperature)​ ​from​ ​temperature​ ​where​ ​w_id​ = ​ ​ ​'PNOSCPQCNX'​ ​AND


event_time​ ​>​ ​'2016-01-10​ ​03:49:46'​ ​AND​ ​event_time​ ​<​ '​ 2016-01-11​ ​16:29:59';

Create​ ​table​ ​temperature_by_day​ ​(​ ​solution​ ​for​ ​wide​ ​rows)

CREATE​ ​TABLE​ ​temperature_by_day​ ​(


​ ​ ​ ​w_id​ ​text,
​ ​ ​ ​date​ ​date,
​ ​ ​ ​event_time​ ​timestamp,
​ ​ ​ ​temperature​ ​int,
​ ​ ​ ​PRIMARY​ ​KEY​ ​((w_id,date),event_time)
);

Let’s​ ​import​ ​data​ ​in​ ​the​ ​new​ ​table:

copy​ ​temperature_by_day​ ​(w_id,event_time,date,temperature)​ ​from


'/data/weatherdata-ext.csv'​ ​with​ ​header​ ​=​ ​true​ ​and​ ​maxrows​ ​=​ ​100000;

Query​ ​the​ ​weather​ ​data​ ​for​ ​the​ ​single​ ​day​ ​and​ ​then​ ​2​ ​consecutive​ ​days
SELECT​ ​*​ ​FROM​ ​temperature_by_day​ ​WHERE​ ​w_id='FFOERIRWOL'​ ​AND​ ​date='2015-03-19';
SELECT​ ​*​ ​FROM​ ​temperature_by_day​ ​WHERE​ ​w_id='FFOERIRWOL'​ ​AND​ ​date​ ​in
('2015-03-19','2015-03-18');
Valentina.Crisan@academyofdata.com 

Let’s​ ​see​ ​how​ ​to​ ​query​ ​using​ ​a​ ​non​ ​primary​ ​key​ ​column:

This​ ​query​ ​will​ ​not​ ​work​ ​-​ ​because​ ​we​ ​are​ ​trying​ ​to​ ​query​ ​without​ ​partition​ ​key:

select​ ​*​ ​from​ ​temperature_by_day​ ​where​ ​temperature​ ​=​ ​29;

In​ ​order​ ​to​ ​work​ ​either​ ​we​ ​allow​ ​filtering​ ​of​ ​data:

select​ ​*​ ​from​ ​temperature_by_day​ ​where​ ​temperature​ ​=​ 2 ​ 9​ a​ llow​ ​filtering;​ ​//​ ​use
this​ ​with​ ​caution,​ ​brings​ ​all​ ​the​ ​table​ ​data​ ​locally​ i​ n​ o​ ne​ ​coordinator.

Or​ ​we​ ​create​ ​an​ ​index​ ​on​ ​temperature

​ ndex​ ​on​ ​temperature_by_day(temperature);


Create​ i
select​ *​ ​ ​from​ ​temperature_by_day​ ​where​ ​temperature​ ​=​ ​29

Deleting​ ​old​ ​entries​ ​-​ ​ ​WE​ ​CAN​ ​SET​ ​EXPIRING​ ​DATA​ ​AT​ ​TABLE​ ​LEVEL​ ​OR
PER​ ​ENTRY

EXPIRE​ ​DATA​ ​PER​ ​TABLE

CREATE​ ​TABLE​ ​latest_temperatures​ ​(


w_id​ ​text,
event_time​ ​timestamp,
temperature​ ​int,
PRIMARY​ ​KEY​ ​(w_id,event_time),
)​ ​WITH​ ​CLUSTERING​ ​ORDER​ ​BY​ ​(event_time​ ​DESC)​ ​AND​ ​default_time_to_live​ ​=​ ​10;

copy​ ​latest_temperatures(w_id,event_time,temperature)​ ​from​ ​'/data/weatherdata.csv'


with​ ​header​ ​=​ ​true​ ​and​ ​maxrows​ ​=​ ​10000;

Select​ ​*​ ​from​ ​latest_temperatures

Repeat​ ​select​ ​in​ ​10​ ​sec

EXPIRE​ ​DATA​ ​PER​ ​ENTRY

INSERT​ ​INTO​ ​latest_temperatures(w_id,event_time,temperature)​ ​VALUES


('12345',​ ​'2015-09-17​ ​21:29:35',​ ​20)​ ​USING​ ​ ​TTL​ ​20;

Select​ ​*​ ​from​ ​latest_temperatures;​ ​//​ ​in​ ​20​ ​s

You might also like