Week #4 R Programming
External Data & Looping
Dr. Azhari, MT
Department of Computer Science & Electronics
Faculty of Mathematics and Natural Sciences
Universitas Gadjah Mada
Sample Data: Professional Salary Survey
Source: https://data.world/finance/data-professional-salary-survey
1 2019_Data_Professional_Salary_Survey_Responses.xlsx
Source: https://data.world/finance/data-professional-salary-survey
2 Data_Professional_Salary_Survey_Responses.xlsx
Source: https://data.world/finance/data-professional-salary-survey
R programming over Excel for Data Analysis
1 R can handle very large datasets
2 R can automate and calculate much faster
than Excel
3 R source code is reproducible
4 Community libraries worth of R source code
are available to all
5 R provides more complex and advanced data
visualization
6 R is free, Excel is not
Source: https://www.gapintelligence.com/blog/understanding-r-programming-over-excel-for-data-analysis/
R Package for Reading Excel Data
1 R packages are collections of functions and
data sets developed by the community. They
increase the power of R by improving existing
base R functionalities, or by adding new ones.
For example, if you are usually working with
data frames, probably you will have heard
about dplyr or data.table, two of the most
popular R packages.
2 A package is a suitable way to organize your
own work and, if you want to, share it with
others. Typically, a package will include code
(not only R code!), documentation for the
package and the functions inside, some tests
to check everything works as it should, and
data sets.
R Package for Reading Excel Data
3
Import Excel File into R 1. Click File
2. Click Import Dataset
3. Click From Excel
4. Browse/select your file
5. (result area which will display your excel file)
6. (the R instruction be gerenated by Rstudio)
7. Click Import
Import Excel File into R
Import Excel File into R
Replace Values in a DataFrame in R
DataFrames are generic data objects of R
which are used to store the tabular data.
Data frames are considered to be the
most popular data objects in R
programming because it is more
comfortable to analyze the data in the
tabular form.
Data frames can also be taught as
mattresses where each column of a matrix
can be of the different data types.
DataFrame are made up of three principal
components, the data, rows, and columns.
https://www.geeksforgeeks.org/dataframe-operations-in-r/
Replace Values in a DataFrame in R
1) Replace a value across the entire DataFrame:
df[df == "Old Value"] <- "New Value"
(2) Replace a value under a single DataFrame column:
df["Column Name"][df["Column Name"] == "Old Value"] <- "New Value"
https://datatofish.com/replace-values-dataframe-r/
R Missing Value | NA
In R the missing values are coded by the symbol NA . cat("\n\n")
1 ID <-c(1,2,3,4,5,6,7,8)
To identify missings in your dataset the function is
is.na() . When you import dataset from other Name <- c("John", "Tim", NA, "Stone", "Andini", "Rossi", NA, NA)
statistical applications the missing values might be Sex <- c("male", "male", "female", "male", "female", "male", "female", "male")
coded with a number, for example 99 Age <- c(52, 23, 20, 21, 23, NA, NA, 19)
Salary <- c(3520.2, NA, 2890.3, 3025.2, 3320.5, 2985.8, NA, 2020.8)
NA is a valid logical object. Where a component of x or dtFriend <- data.frame(ID, Name, Sex, Age, Salary)
2 y is NA, the result will be NA if the outcome is print(dtFriend)
ambiguous. In other words NA & TRUE evaluates to
NA, but NA & FALSE evaluates to FALSE.
3 Missing values are inevitable in data science, and
handling them is a constant issue. In the case of
Boolean logic, it can behave fairly differently
depending on the order of arguments and exactly how
it is set up, unlike a lot of other data types. Whether
this is useful or not depends on the scenario, but the
behavior is something to keep in mind.
Function Max() and Min() in R
vHighestValue <- max(x, na.rm = FALSE) ▪ x = vector or a data frame.
▪ na.rm = remove NA values, if it mentioned False it considers NA or if
vLowestValue <- min(x, na.rm = FALSE) it mentioned True it removes NA from the vector or a data frame.
cat("\n\n") cat("\n\n")
#creates a vector #creates a vector
midTestScore <-c(78.8, 65.0, 78.9, 84, 92.1, 73.2, 58.9, 87.6, 88.3) finalTestScore <-c(88.8, 77.0, NA, 86, 94.6, 72.2, NA, 80.3, 88.8)
print(midTestScore) print(finalTestScore)
#returns the max values & min value present in the vector #returns the max values & min value present in the vector
maxScore <- max(midTestScore) minScore <- min(finalTestScore)
minScore <- min(midTestScore) maxScore <- max(finalTestScore, na.rm = TRUE)
cat( cat(
"\nThe highest Score : ", maxScore, "\nThe lowest Score : ", minScore,
"\nThe lowest Score : ", minScore, "\nThe highest Score : ", maxScore,
"\n" "\n"
) )
Replace Values in a DataFrame in R
1) Replace a value across the entire DataFrame:
df[df == "Old Value"] <- "New Value"
(2) Replace a value under a single DataFrame column:
df["Column Name"][df["Column Name"] == "Old Value"] <- "New Value"
https://datatofish.com/replace-values-dataframe-r/
Change & Update Item value of Dataframe
cat("\n\n") 4 #update missing value, corert the item data
ID <-c(1,2,3,4,5,6,7,8) dtFriend$Name[dtFriend$ID == 3] <- "Andri"
1 Name <- c("John", "Tim", NA, "Stone", "Andini", "Rossi", NA, NA) dtFriend$Name[dtFriend$ID == 7] <- "Anna"
Sex <- c("male", "male", "female", "male", "female", "male", "female", "male") dtFriend$Salary[dtFriend$Name == "Tim"] <- 1818.5
Age <- c(52, 23, 20, 21, 23, NA, NA, 19) dtFriend$Age[dtFriend$Age == 52] <- 25
Salary <- c(3520.2, NA, 2890.3, 3025.2, 3320.5, 2985.8, NA, 2020.8) dtFriend$Age[dtFriend$ID == 6] <- 20
dtFriend <- data.frame(ID, Name, Sex, Age, Salary)
print(dtFriend) print(dtFriend)
t1CheckofNA <- is.na(dtFriend)
2 t2CountofNA <- sum(is.na(dtFriend))
t3CountAveofNA <- mean(is.na(dtFriend)) 5
t4MaxofSalary <- max(dtFriend$Salary) 7
t5MaxofSalary <- max(dtFriend$Salary, na.rm = TRUE)
cat("\nt1 Check missing values :", t1CheckofNA,
"\nt2 the number of missing value:", t2CountofNA,
3 "\nt3 the average of missing value:", t3CountAveofNA,
"\nt4 the maximum of salary with missing value:", t4MaxofSalary, 6
"\nt5 the maximum of salary with out missing value:", t5MaxofSalary,
"\n"
)
Function Max() and Min() in R
cat("\n\n")
#creates a character vector with some names
ComunityFriends <- c('John','Angelina','Smuts','Garena','Lucifer', 'Andini')
firstOrderedName <- min(ComunityFriends)
lastOrderedName <- max(ComunityFriends)
print(ComunityFriends)
cat(
"\nEarlist Ordered Name :", firstOrderedName,
"\nLastest Ordered Name :", lastOrderedName,
"\n"
)
R Looping Structure 1 for (elementdata in Listofdata) {
instruction 1
instruction 2
In R, for loops take an interator variable and assign it successive :
values from a sequence or vector. For loops are most commonly }
used for iterating over the elements of an object (list, vector, etc.)
2 while ( condition ) {
instruction 1
instruction 2
:
}
repeat {
3
statement/instruction 1
statement/instruction 2
:
if( condition ) {
break
}
}
Example: R Looping Structure
list_days <- c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat") qualities <- c('funny', 'cute', 'friendly')
list_months <- list ("Jan", "Feb", "Mar", "Apr", "May", "Jun", animals <- c('koala', 'cat', 'dog', 'panda')
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
list_years <- c(2021, 2022, 2023) for (x in qualities) {
cat("\n") for (y in animals) {
for (aDay in list_days) { print(paste(x, y))
print(aDay) }
} }
for (aMonth in list_months) {
cat(aMonth, " ")
if (aMonth == "Jun") cat("\n")
}
cat("\n\n")
for( year in list_years ) {
for (aMonth in list_months) {
cat(aMonth, " ", year, "\n")
}
}
statisticsClass <- data.frame(
Exam1 = c(97,91,33,86,40,48,27,53,58,31),
Exam2 = c(68,85,27,43,46,100,92,73,98, 91),
Quiz = c(75,90,88,81,78,87,90,69,NA,NA)
)
print(statisticsClass)
statisticsClass$avg <- (statisticsClass$Exam1 + statisticsClass$Exam2)/2
max.scores <- statisticsClass[statisticsClass$avg==max(statisticsClass$avg),]
print(max.scores)
R If-then-else Structure
# if
if (condition is true) {
do something
:
}
# if ... else
if (condition is true) {
do something
:
} else { # that is, if the condition is false,
do something different
:
}
R Looping: Examples 2
cat("\n\n")
ScoreAccountingClass <- c(82.5, 92.3, 83.0, 79.0, 88.2, 81.7, 76.6)
nStudentsA <- 0
nStudentsB <- 0
1 Counting the number of Students
for (studentScore in ScoreAccountingClass) {
print(studentScore)
ScoreAccountingClass <- c(82.5, 92.3, 83.0, 79.0, 88.2, 81.7, 76.6) if (studentScore>=80) {
nStudents <- 0 nStudentsA <- nStudentsA + 1
for (studentScore in ScoreAccountingClass) { } else {
nStudentsB <- nStudentsB + 1
print(studentScore) }
nStudents <- nStudents + 1 }
} cat("\nNumber of Students Grup A:", nStudentsA,
cat("\nNumber os Students :", nStudents, "\n") "\nNumber of Students Grup B:", nStudentsB,
"\nTotal Students (Group A + Group B) :", nStudentsA + nStudentsB,
"\n"
)
Counting the number of Students
who have score >=80, and
who have score < 80 (below 80)
Example Flow Chart
1 Flowchart to calculation total number of students
(for each studentscore of nStudents, totScores
Start dataSetScore nStudent <- 0 Stop
dataSetScore)
studentScore
nStudents <- nStudents + 1
Example Flow Chart
Flowchart to calculation total number of students who have Accounting score greater and
2 equal than 80 (score Accounting >= 80), and total number of students who have Accounting
score bellow from 80 (score Accounting < 80)
nStudentsA <- 0 (for each studentscore of nStudentsA,
Start dataSetScore Stop
nStudentsB <- 0 dataSetScore) nStudentsB
studentScore
studentScore >=80
nStudentsA <- nStudentsA + 1 nStudentsB <- nStudentsB + 1
R Looping: Examples
Count variable is added one by one in
the looping, and total variable is added
Initialisasi Count variable with zero, one by one of each student score
and variable total with zero
ScorepfAccountingClass <- c(82.5, 92.3, 83.0, 79.0, 88.2, 81.7, 76.6)
nStudents <- 0
3 Counting the number of Students
totScores <- 0
Calculation Total Score & Average Score ScorepfAccountingClass
for (studentScore in ScorepfAccountingClass) {
print(studentScore)
nStudents <- nStudents + 1
totScores <- totScores + studentScore
}
AverageScore <- (totScores/nStudents)
print("\nAccounting Mid Class Score")
cat("\n",ScorepfAccountingClass,
"\nNumber os Students: ", nStudents,
"\nTotal of Scores: ", totScores,
Calculate the average outsite for looping block
"\nAverage of Mid test: ", totScores
)
Example Flow Chart
3 Flowchart to Calculation Total Score, Number of students, average of score
nStudent <- 0 (for each studentscore of AverageScore <- nStudents, totScores Stop
Start dataSetScore
Total <- 0 dataSetScore) (totScores/nStudents)
studentScore
nStudents <- nStudents + 1
totScores <- totScores + studentScore
Flow Chart 3 Example Flowchart to Calculation Total Score,
Number of students, average of score
Simbol of Flowchart
Start dataSetScore
Stop Simbol start, stop
StudentName, Simbol input, ouput nStudents <- 0
StudentAddres totScores <- 0
Simbol procesing,
Total <- 25 * discount (for each studentScore of
Calcutating dataSetScore)
Simbol filtering,
City == “Bandung” ? Conditional, averageScore <-
Control studentScore (totScores/nStudents)
Simbol sub program,
Calculate
Corelation() sub processing nStudents <- nStudents + 1 nStudents, totScores,
totScores <- totScores + studentScore averageScore
Simbol flow (to next)
Loop Stop
Looping (for)
(for star, end)
Install package from local source
install.packages(path_to_source, repos = NULL, type="source")
Name <- c("John", "Tim", NA) install.packages("~/Downloads/dplyr-master.zip", repos=NULL, type="source")
Sex <- c("men", "men", "women")
Age <- c(45, 53, NA)
dt <- data.frame(Name, Sex, Age)
print(dt) install.packages("tidyr")
is.na(dt)
sum(is.na(dt))
mean(is.na(dt))
# Replace a value across the entire DataFrame:
df[df == "Old Value"] <- "New Value“
dt$Age[dt$Age == 99] <- NA
# Replace a value under a single DataFrame column:
df["Column Name"][df["Column Name"] == "Old Value"] <- "New Value"
https://datascienceplus.com/missing-values-in-
r/#:~:text=In%20R%20the%20missing%20values,function%20is%20is.na()%20.&text=When%20
you%20import%20dataset%20from,a%20number%2C%20for%20example%2099%20.
How to Analyze a Single Variable using Graphs in R? |
DataScience+ (datascienceplus.com)
There are 4 types of plots that we can use to observe a single
variable data:
· Histograms
· Index plots
· Time-series plots
· Pie Charts
# How to create Histogram in R
# by Michaelino Mervisiano
datavar <-rnorm(1000,2.5)
hist(datavar,main="Awesome Histogram",
col="Blue",prob=TRUE,
xlab="Random Numbers from a Normal Distribution with
Mean 2.5")
https://datascienceplus.com/how-to-analyses-a-single-
variable-using-graphs-in-r/