0% found this document useful (0 votes)
163 views26 pages

Snowflake Document

The document describes the steps taken to complete a Snowflake mini project. This included: 1) Creating a warehouse named SFTRNG and resource monitor named TRNG_TRACK to monitor the warehouse. 2) Unloading data from the LINEITEM table into local files and S3, including whole data into multiple CSV files and a single file. 3) Creating a database, schema, and table in Snowflake and loading the unloaded data. 4) Using COPY to load data from local, GCS, and S3 storage. CLONE was used to backup the ORDERS table. 5) Experiments with time travel to restore dropped and deleted data from the ORDERS_

Uploaded by

s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
163 views26 pages

Snowflake Document

The document describes the steps taken to complete a Snowflake mini project. This included: 1) Creating a warehouse named SFTRNG and resource monitor named TRNG_TRACK to monitor the warehouse. 2) Unloading data from the LINEITEM table into local files and S3, including whole data into multiple CSV files and a single file. 3) Creating a database, schema, and table in Snowflake and loading the unloaded data. 4) Using COPY to load data from local, GCS, and S3 storage. CLONE was used to backup the ORDERS table. 5) Experiments with time travel to restore dropped and deleted data from the ORDERS_

Uploaded by

s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Snowflake Mini Project

Project Activities –

1) Warehouse Creation:

Create a warehouse of name SFTRNG. Document the parameters with which


the WH is created and the reasons behind the WH options used.

Solution:

Warehouse: A virtual warehouse on Snowflake is a cluster of database servers


deployed on-demand to execute user queries.

Parameters used:

1. Name: For naming the warehouse this Name tab is used. Here I used SFTRNG
as warehouse name.

2. Size: Size is used to set the size of warehouse. Here warehouse size is
small(2 credits/hour).
3. Maximum Cluster: It defines maximum clusters can be used while processing
the queries.

4. Minimum cluster: It defines minimum clusters can be used while processing


the queries.

5. Scaling Policy: There are two types of scaling policies one is standard(default)
and another is economy. I have used default scaling policy.

6. Auto Suspend: Auto suspend is used to define the time. When warehouse is
not I use to after the auto suspend time it will automatically shut down.

7. Auto Resume: Auto resume is set to yes by default, it is used to automatically


resume the warehouse when we start to run any query.

2)Resource Monitor:

Create a Resource Monitor of name TRNG_TRACK to track the WH SFTRNG.


Document the parameters with which the Resource Monitor is created and the
reasons behind the WH options used.

Solution:

Parameters used:

1. Name: name is used as TRNG_TRACK.


2. Credit quota: it is used to set the credit limit to resource monitor. Here we set
it as 2.

3. Monitor level: monitor level have two types of options, warehouse and
account so here we want to monitor the warehouse so I choose it as
warehouse.

4. Actions and Notification: it is used to set notifications when some amount of


warehouse limit is used.

3) Unloading data:

Unload the data from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF10"."LINEITEM"

1)  Whole data into multiple csv files with delimiter | (Pipe)

2)  Whole data into single file with delimiter |-|

Document the queries used to unload the data.Unload to happen to 2 places –


Local system and S3 assigning appropriate names.

Solution:

Unloading data in Local System:-

A.  Whole data into multiple CSV files with delimiter | (Pipe) :

select * from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF10"."LINEITEM";

create or replace file format my_unload_format

type='CSV'

field_delimiter = '|'

COMPRESSION = NONE;

create or replace stage my_unload_stage

file_format=my_unload_format;

remove @my_unload_stage pattern='.*.csv.gz';

copy into @my_unload_stage from


"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF10"."LINEITEM"
SINGLE= FALSE

max_file_size=5368709120;

list @my_unload_stage;

get @my_unload_stage/data file://C:\Users\VMAdmin\Documents\snowflake\


unload\assign1;

Command Prompt :
OUTPUT:

2)  Whole data into single file with delimiter |-| :

select * from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF10"."LINEITEM";

create or replace file format my_unload_format

type='CSV'

field_delimiter = '|-|'

COMPRESSION = NONE;

create or replace stage my_unload_stage

file_format=my_unload_format;

remove @my_unload_stage pattern='.*.csv.gz';

copy into @my_unload_stage from


"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF10"."LINEITEM"

SINGLE= FALSE
max_file_size=5368709120;

list @my_unload_stage;

get @my_unload_stage/data file://C:\Users\VMAdmin\Documents\snowflake\


unload\assign2;

Unloading data in S3 (External Storage):

For unloading in AWS I followed below steps:

A. Login to AWS account and created a bucket called “sfsneha2022” and in that
bucket I created a folder named “snehaunload”.
2)After it created a IAM policy:

3) Code used in policy:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:GetObjectVersion",
"s3:DeleteObject",
"s3:DeleteObjectVersion"
],
"Resource": "arn:aws:s3:::sfsneha2022/snehaload/*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::sfsneha2022",
"Condition": {
"StringLike": {
"s3:prefix": [
"snehaload/*"
]
}
}
}
]
}

4) created a role for this policy:


5) permission given to this role:

6) Trust relationships entities used:

7) code of Trusted entities:


{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::104779808655:user/9g9b-s-insr9275"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId":
"ZZ49050_SFCRole=3_p2K0gjOuTuytHgvsVQdEgBJ+AB8="
}
}
}
]
}

8)code run on snowflake worksheet:

create or replace stage my_ext_unload_stage


url='s3://sfsneha2022/snehaload/unload/'

storage_integration = SFTRNG_aws

file_format = MYCSVFORMAT;

copy into @my_ext_unload_stage from


"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF10"."CUSTOMER";
9) Output: files get unloaded into AWS S3:

4)DB Objects creation:

Create a database SFTRNG Create schema TRNG

Create tables LINEITEM_Local by doing data profiling and determining the max
length of each column. (Note: not to use the default column data types).

Solution:

Creating database SFTRNG:


Creating schema TRNG:

Creating table LINEITEM_Local :

5)Loading data:
From Local – Load all the multiple files from local folder we unloaded into
LINEITEM_Local table From GCS – Load the single whole file from GCS bucket
we unloaded into LINEITEM_GCP table

From AWS – Load the multiple files from AWS S3 bucket we unloaded into
LINEITEM_AWS table.

Solution:

Loading data from local storage:

create or replace file format mycsvformat

type='CSV'

field_delimiter = '|'

skip_header= 1;

create or replace stage my_csv_stage

file_format=mycsvformat;

copy into LINEITEM_local from @my_csv_stage

on_error='continue'

FORCE = TRUE ;

list @my_csv_stage

select * from "SFTRNG"."TRNG"."LINEITEM_LOCAL";

put file://C:\Users\VMAdmin\Documents\snowflake\unload\assignment1\
data*.csv @my_csv_stage;

remove @my_csv_stage

Loading data from GCP:


1) Login to GCP Dashboard:
2) Created a bucket called”snehtrng” and in that bucket created a new folder
called “snehaload” .

3) Uploaded all the CSV files from local system to GCP bucket.
4) After that created a role called “sneharole”in IAM & login section.

5) Given following permission to the role.


6) Role is created successfully.

7) Now I go to the the browser section and selected my bucket for adding
principal.

8) Wrote a code in snowflake worksheet to load a data


create or replace storage integration SFTRNG_GCS

type = external_stage

storage_provider = gcs

enabled = true

storage_allowed_locations = ('gcs://snehatrng/snehaload/')

DESC integration SFTRNG_GCS

STORAGE_GCP_SERVICE_ACCOUNT= agsmzvjnaw@awsapsouth1-1-
153a.iam.gserviceaccount.com

create stage my_gcs_stage

storage_integration = SFTRNG_GCS

url = 'gcs://snehatrng/snehaload/'

file_format = MYCSVFORMAT;

list @my_gcs_stage

remove @my_gcs_stage/contacts5.csv
Loading data from AWS:

1) Referring from the bow question of unloading from AWS S3 about creating
bucket, policy and role.

2) Creating new folder called “snehaload” in the bucket “sfsneha2022”.

3) Uploading contacts files from local storage to AWS bucket.


4) writing code into snowflake worksheet

create or replace storage INTEGRATION SFTRNG_aws

type = external_stage

storage_provider = s3

enabled = true

storage_aws_role_arn = 'arn:aws:iam::279714573403:role/sneharole'

storage_allowed_locations = ('s3://sfsneha2022/snehaload/')

DESC INTEGRATION SFTRNG_aws;

STORAGE_AWS_IAM_USER_ARN= arn:aws:iam::104779808655:user/9g9b-s-
insr9275

STORAGE_AWS_EXTERNAL_ID=
ZZ49050_SFCRole=3_p2K0gjOuTuytHgvsVQdEgBJ+AB8=

create or replace stage my_s3_stage

storage_integration = SFTRNG_aws

url = 's3://sfsneha2022/snehaload/'

file_format = MYCSVFORMAT ;

list @my_s3_stage;

6)COPY/CLONE:

Do the copy or clone from


"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF10"."ORDERS" as ORDERS_BKUP .

Mention which approach is taken and state the reason.

Solution:

Select * from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF10"."ORDERS"

create or replace table orders_bkup clone orders ;


I used clone command because it don’t take storage space in the
database and also use less warehouse credits than copy command.

7)TimeTravel:

Drop the table ORDERS_BKUP and note the query id.Restore the ORDERS_BKUP
table and note the query id.Delete the data from ORDERS_BKUP and note the
query id.Restore the data into ORDERS_BKUP using time travel and note the
query id and query

Solution:

drop table ORDERS_BKUP; --01a25551-0000-2023-0000-0002cc42f471

UNDROP TABLE ORDERS_BKUP; --01a25551-0000-2022-0002-cc42000104a2

SELECT * FROM ORDERS_BKUP; --01a25551-0000-2022-0002-cc42000104ae

DELETE from ORDERS_BKUP; —01a25554-0000-2022-0002-cc4200010532

-- time travel to a time just before the update was run

select * from ORDERS_BKUP AT (timestamp => '2022-02-15


19:37:02.57'::timestamp); --01a2555c-0000-2023-0000-0002cc42f5b1

-- time travel to 10 minutes ago (i.e. before we ran the update)

select * from ORDERS_BKUP AT(offset => -60*1) --01a2555d-0000-2022-0002-


cc42000105f6

select * from ORDERS_BKUP AT (statement => '01a25554-0000-2022-0002-


cc4200010532'); --01a2555d-0000-2022-0002-cc4200010616

8)Performance:
Write a query to find the number of orders each customer has made and total
price of the orders From the ORDERS_CLUSTER table.Note the total micro
partitions being used and micro partitions being scanned.Increase the size of
the WH and note the performance statistics.Note the difference with and
without usage of result cache Determine if clustering can improve the
performance.

Solution:

CREATE OR REPLACE TABLE ORDERS AS SELECT * FROM


"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF10"."ORDERS"

CREATE OR REPLACE TABLE ORDERS_CLUSTER AS SELECT * FROM


"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF10"."ORDERS"

ALTER TABLE ORDERS_CLUSTER CLUSTER BY (O_ORDERDATE,


O_ORDERSTATUS)

SELECT * FROM ORDERS where (O_ORDERDATE, O_ORDERSTATUS) = ('1992-


01-20','F')

SELECT * FROM ORDERS_CLUSTER where (O_ORDERDATE, O_ORDERSTATUS) =


('1992-01-20','F')

alter session set USE_CACHED_RESULT = FALSE;

Total Micro Partitions and Micro Partitioned Scanned:


1) Total Micro Partitions are 33 and Micro Partitioned Scanned are also 33.
Here 45.29% data is fetched from cache.
2) Total Micro Partitions are 33 and Micro Partitioned Scanned are also 33.
Here 100% data is fetched from cache.

3) Total Micro Partitions are 25 and Micro Partitioned Scanned are also 2. Here
0.00% data is fetched from cache.
4. Total Micro Partitions are 25 and Micro Partitioned Scanned are also 2. Here
100% data is fetched from cache.

You might also like