Reading data
Foundations of Data Analytics
Data Preparation
Vellore Institute of Technology, Chennai
July 29,2020
Data Preparation
Reading data
Outline
Motivation
Introduction
Reading data
Data Preparation
Reading data
Motivation
Types of Data
Structured data - Excel file
Semi-structured data - JSON, XML file
-https://json.org/example.html
Unstructured data -text file
-https://rdp.cme.msu.edu/tutorials/init_process/
RDPtutorial_INITIAL-PROCESS.html
Data storage
In databases NOSQL or MONGODB
In websites
Data Preparation
Reading data
Introduction
First step in Data Analytics
Collecting data from differnt sources which may be in various
formats such as flat files (.csv, .txt),Excel files, JSON, XML etc.
Data Collection
Data Cleaning
Data Understanding
Raw data —> Clean data —> Data Analysis
Data Preparation
Reading flat files
Reading Excel files
Reading data
Reading XML file
Reading JSON file
Reading flat files
Text or CSV files - Use read.table()
The read.table function is one of the most commonly used
functions for reading data. It has a few important arguments:
file, the name of a file, or a connection
header, logical indicating if the file has a header line
sep, a string indicating how the columns are separated
colClasses, a character vector indicating the class of each
column in the dataset
nrows, the number of rows in the dataset
comment.char, a character string indicating the comment
character
skip, the number of lines to skip from the beginning
stringsAsFactors, should character variables be coded as
factors?
Data Preparation
Reading flat files
Reading Excel files
Reading data
Reading XML file
Reading JSON file
Reading flat files (contd.)
#Using read.table()
loan <- read.table("loans data.csv",header = TRUE,sep
= ",")
str(loan)
#Using read.csv()
loan <- read.csv("loans data.csv"
str(loan)
Data Preparation
Reading flat files
Reading Excel files
Reading data
Reading XML file
Reading JSON file
Reading Excel files
#Reading Excel file
#You need to install xlsx package
install.packages("xlsx")
#Load the package
library(xlsx)
#Read the data
loan<-read.xlsx("loan.xlsx",sheetIndex=1, header=TRUE)
Data Preparation
Reading flat files
Reading Excel files
Reading data
Reading XML file
Reading JSON file
Reading XML file
#Reading XML file
#You need to install XML package and load it
install.packages("XML")
library(XML)
#Load the package httr to work with Urls and http
library(httr)
fileurl <- "https://www.w3schools.com/xml/simple.xml"
xmldata <- GET(fileurl)
doc <- xmlTreeParse(xmldata,useInternal=TRUE)
root <- xmlRoot(doc)
xmlName(root)
names(root)
Data Preparation
Reading flat files
Reading Excel files
Reading data
Reading XML file
Reading JSON file
Reading XML file (contd.)
#Accessing parts of xml file in the same way as list
root[[1]] #accessing 1st food
root[[1]][[1]] #accessing name of the 1st food
#Extracting parts of XML file
xmlSApply(root,xmlValue)
#Extracting individual nodes of XML file
xpathSApply(root,"//name",xmlValue)
xpathSApply(root,"//price",xmlValue)
Data Preparation
Reading flat files
Reading Excel files
Reading data
Reading XML file
Reading JSON file
Reading JSON file
#Loading jsonlite package
library(jsonlite)
jdata <- fromJSON("https://api.github.com/users/jtleek/repos")
names(jdata)
#Extracting nested objects
names(jdata$owner)
jdata$owner$login
#writing to json file
jfile <- toJSON(iris,pretty = TRUE)
cat(jfile)
Data Preparation
Reading flat files
Reading Excel files
Reading data
Reading XML file
Reading JSON file
References
Getting and Cleaning data - Coursera
XML package http://www.omegahat.net/RSXML/Tour.pdf
jsonlite package https://www.r-bloggers.com/
new-package-jsonlite-a-smarter-json-encoderdecoder/
Data Preparation