11/17/24, 5:44 PM                                                                 lab1_jupyter_importing-data
Importing Data in R
                          Estimated time needed: 15 minutes
                          Objectives
                          After completing this lab you will be able to:
                              Import csv and excel file
                              Access rows and columns from dataset
                              Access R built-in dataset
                          Table of Contents
                                About the Dataset
                                Reading CSV Files
                                Reading Excel Files
                                Accessing Rows and Columns from dataset
                                Accessing Built-in Datasets in R
                                                                   About the Dataset
                          Movies dataset
                          Here we have a dataset that includes one row for each movie, with several columns for
                          each movie characteristic:
                              name - Name of the movie
                              year - Year the movie was released
                              length_min - Length of the movie (minutes)
                              genre - Genre of the movie
                              average_rating - Average rating on IMDB
                              cost_millions - Movie's production cost (millions in USD)
https://labs.cognitiveclass.ai/v2/tools/jupyterlab?ulid=ulid-6080b253d3a4cb69b2885c96311644ebe83b4200             1/7
11/17/24, 5:44 PM                                                                 lab1_jupyter_importing-data
                                foreign - Is the movie foreign (1) or domestic (0)?
                                age_restriction - Age restriction for the movie
                          Let's learn how to import and read data from two common types of files used to store
                          tabular data (when data is stored in a table or a spreadsheet.)
                               CSV files (.csv)
                               Excel files (.xls or .xlsx)
                          To begin, we'll need to download the data!
                                                                  Download the Data
                          We've made it easy for you to get the data, which we've hosted online. Simply run the
                          code cell below (Shift + Enter) to download the data to your current folder.
           In [1]: # requests.get datasets
                          # CSV file
                          requests.get.file("https://cf-courses-data.s3.us.cloud-object-storage.appdom
                                        destfile="movies-db.csv")
                          # XLS file
                          requests.get.file("https://cf-courses-data.s3.us.cloud-object-storage.appdom
                                        destfile="movies-db.xls")
https://labs.cognitiveclass.ai/v2/tools/jupyterlab?ulid=ulid-6080b253d3a4cb69b2885c96311644ebe83b4200             2/7
11/17/24, 5:44 PM                                                                 lab1_jupyter_importing-data
                       ---------------------------------------------------------------------------
                       NameError                                 Traceback (most recent call last)
                       /tmp/ipykernel_69/1092272164.py in <module>
                             2
                             3 # CSV file
                       ----> 4 requests.get.file("https://cf-courses-data.s3.us.cloud-object-storag
                       e.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0101EN-Coursera/v2/dataset/movi
                       es-db.csv",
                             5               destfile="movies-db.csv")
                             6
                       NameError: name 'requests' is not defined
                          If you ran the cell above, you have now downloaded the following files to your
                          current folder:
                                  movies-db.csv
                                  movies-db.xls
                                                                   Reading CSV Files
                          What are CSV files?
                          Let's read data from a CSV file. CSV (Comma Separated Values) is one of the most
                          common formats of structured data you will find. These files contain data in a table
                          format, where in each row, columns are separated by a delimiter -- traditionally, a
                          comma (hence comma-separated values).
                          Usually, the first line in a CSV file contains the column names for the table itself. CSV
                          files are popular because you do not need a particular program to open it.
                          Reading CSV files in R
                          In the movies-db.csv file, the first line of text is the header (names of each of the
                          columns), followed by rows of movie information.
                          To read CSV files into R, we use the core function read.csv .
                           read.csv easy to use. All you need is the filepath to the CSV file. Let's try loading the
                          file using the filepath to the movies-db.csv file we downloaded earlier:
           In [ ]: # Load the CSV table into the my_data variable.
                   my_data <- read.csv("movies-db.csv")
                   my_data
                          The data was loaded into the my_data variable. But instead of viewing all the data at
                          once, we can use the head function to take a look at only the top six rows of our table,
                          like so:
https://labs.cognitiveclass.ai/v2/tools/jupyterlab?ulid=ulid-6080b253d3a4cb69b2885c96311644ebe83b4200                  3/7
11/17/24, 5:44 PM                                                                 lab1_jupyter_importing-data
           In [ ]: # Print out the first six rows of my_data
                   head(my_data)
                          Additionally, you may want to take a look at the structure of your newly created table. R
                          provides us with a function that summarizes an entire table's properties, called str .
                          Let's try it out.
           In [ ]: # Prints out the structure of your table.
                   str(my_data)
                          When we loaded the file with the read.csv function, we had to only pass it one
                          parameter -- the path to our desired file.
                          Coding Exercise: in the code cell below, get the summary of my_data data frame
           In [ ]: # Write your code below. Don't forget to press Shift+Enter to execute the ce
                             Click here for the solution
                                                                  Reading Excel Files
                          Reading XLS (Excel Spreadsheet) files is similar to reading CSV files, but there's one
                          catch -- R does not have a native function to read them. However, thankfully, R has an
                          extremely large repository of user-created functions, called CRAN. From there, we can
                          download a library package to make us able to read XLS files.
                          To download a package, we use the install.packages function (may take minutes
                          because it is a big library). Once installed, you do not need to install that same library
                          ever again, unless, of course, you uninstall it.
                          Whenever you are going to use a library that is not native to R, you have to load it into
                          the R environment after you install it. In other words, you need to install once only, but to
                          use it, you must load it into R for every new session. To do so, use the library
                          function, which loads up everything we can use in that library into R.
           In [ ]: # Load the "readxl" library into the R environment.
                   library(readxl)
                          Now that we have our library and its functions ready, we can move on to actually reading
                          the file. In readxl , there is a function called read_excel , which does all the work for
                          us. You can use it like this:
https://labs.cognitiveclass.ai/v2/tools/jupyterlab?ulid=ulid-6080b253d3a4cb69b2885c96311644ebe83b4200                     4/7
11/17/24, 5:44 PM                                                                 lab1_jupyter_importing-data
           In [ ]: # Read data from the XLS file and attribute the table to the my_excel_data v
                   my_excel_data <- read_excel("movies-db.xls")
                          Since my_excel_data is now a dataframe in R, much like the one we created out of
                          the CSV file, all of the native R functions can be applied to it, like head and str .
           In [ ]: # Prints out the structure of your table.
                   # Tells you how many rows and columns there are, and the names and type of e
                   # This should be the very same as the other table we created, as they are th
                   str(my_excel_data)
                          Much like the read.csv function,                    read_excel          takes as its main parameter the path
                          to the desired file.
                            [Tip] A library is basically a collection of different classes and functions which are
                            used to perform some specific operations. You can install and use libraries to add
                            more functions that are not included on the core R files. For example, the readxl
                            library adds functions to read data from excel files.
                            It's important to know that there are many other libraries too which can be used for a
                            variety of things. There are also plenty of other libraries to read Excel files -- readxl is
                            just one of them.
                                                     Accessing Rows and Columns
                          Whenever we use functions to read tabular data in R, the default method of structuring
                          this data in the R environment is using Data Frames -- R's primary data structure. Data
                          Frames are extremely versatile, and R presents us many options to manipulate them.
                          Suppose we want to access the "name" column of our dataset. We can directly
                          reference the column name on our data frame to retrieve this data, like this:
           In [ ]: # Retrieve a subset of the data frame consisting of the "name" columns
                   my_data['name']
                          Another way to do this is by using the $ notation which at the output will provide a
                          vector:
           In [ ]: # Retrieve the data for the "name" column in the data frame.
                   my_data$name
                          You can also do the same thing using double square brackets, to get a vector of
                           names column.
https://labs.cognitiveclass.ai/v2/tools/jupyterlab?ulid=ulid-6080b253d3a4cb69b2885c96311644ebe83b4200                                    5/7
11/17/24, 5:44 PM                                                                 lab1_jupyter_importing-data
           In [ ]: my_data[["name"]]
                          Similarly, any particular row of the dataset can also be accessed. For example, to get the
                          first row of the dataset with all column values, we can use:
           In [ ]: # Retrieve the first row of the data frame.
                   my_data[1,]
                          The first value before the comma represents the row of the dataset and the second
                          value (which is blank in the above example) represents the column of the dataset to be
                          retrieved. By setting the first number as 1 we say we want data from row 1. By leaving the
                          column blank we say we want all the columns in that row.
                          We can specify more than one column or row by using c , the concatenate function. By
                          using c to concatenate a list of elements, we tell R that we want these observations out
                          of the data frame. Let's try it out.
           In [ ]: # Retrieve the first row of the data frame, but only the "name" and "length_
                   my_data[1, c("name","length_min")]
                                                   Accessing Built-in Datasets in R
                          R provides various built-in datasets for users to utilize for different purposes. To know
                          which datasets are available, R provides a simple function -- data -- that returns all of
                          the present datasets' names with a small description beside them. The ones in the
                           datasets package are all inbuilt.
           In [ ]: # Displays a list of the inbuilt datasets. Opens in a new "window".
                   data()
                          As you can see, there are many different datasets already inbuilt in the R environment.
                          Having to go through each of them to take a look at their structure and try to find out
                          what they represent might be very tiring. Thankfully, R has documentation present for
                          each inbuilt dataset. You can take a look at that by using the help function.
                          For example, if we want to know more about the women dataset, we can use the
                          following function:
           In [ ]: # Opens up the documentation for the inbuilt "women" dataset.
                   help(women)
                          Since the datasets listed are inbuilt, you do not need to import or load them to use them.
                          If you reference them by their name, R already has the data frame ready.
https://labs.cognitiveclass.ai/v2/tools/jupyterlab?ulid=ulid-6080b253d3a4cb69b2885c96311644ebe83b4200                  6/7
11/17/24, 5:44 PM                                                                 lab1_jupyter_importing-data
           In [ ]: women
                          Coding Exercise: in the code cell below, get the CO2 dataset
           In [ ]: # Write your code below. Don't forget to press Shift+Enter to execute the ce
                             Click here for the solution
                          Scaling R with big data
                          As you learn more about R, if you are interested in exploring platforms that can help you
                          run analyses at scale, you might want to sign up for a free account on IBM Watson
                          Studio, which allows you to run analyses in R with two Spark executors for free.
                          Authors
                          Hi! It's Iqbal Singh and Walter Gomes, the authors of this notebook. I hope you found it
                          easy to learn how to import data into R! Feel free to connect with us if you have any
                          questions.
                          Other Contributors
                          Yan Luo
                                             © IBM Corporation 2021. All rights reserved.
                          <!--
                          Change Log
                                          Date (YYYY-MM-DD) Version Changed By Change Description
                                          2021-03-04        2.0     Yan        Added coding tasks
                          --!>
https://labs.cognitiveclass.ai/v2/tools/jupyterlab?ulid=ulid-6080b253d3a4cb69b2885c96311644ebe83b4200                 7/7