0% found this document useful (0 votes)
11 views30 pages

C5 - DSC551 - R Programming

r programming chapter 05

Uploaded by

fakhrizul Afif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views30 pages

C5 - DSC551 - R Programming

r programming chapter 05

Uploaded by

fakhrizul Afif
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

DSC551: Programming for Data

Science (R Programming)
5. Base Graphics & Descriptive Statistics

Lecturer, Department of Statistics


2024-10-01

Asmui Rahim, DSC551:R , Oct 2024


Introduction
1. Graphical functions in R: Base graphics, high-level plotting functions (e.g., pie charts, bar
plots), low-level plotting functions (e.g., adding points, lines), and graphical parameters
(e.g., color, font) for customizing plots.
2. Descriptive statistics: Summarizing and describing datasets using measures like mean,
median, variance, etc.
3. Techniques for summarizing categorical data (frequency tables, contingency tables, pie
charts, bar charts).
4. Methods for summarizing numerical data (stem-and-leaf plots, box plots, histograms,
scatter plots).

Asmui Rahim, DSC551:R , Oct 2024


Graphical Function
Base graphics.
High-level plotting functions. eg pie charts

Low-level plotting functions.


Graphical parameters.

Asmui Rahim, DSC551:R , Oct 2024


High-level Plotting Functions
Foundation for creating various types of plots.

Asmui Rahim, DSC551:R , Oct 2024


Some of these options are identical for several graphical functions;

Asmui Rahim, DSC551:R , Oct 2024


Low-level Plotting Functions
A set of graphical functions which affect an already existing graph called low level plotting
commands.

Asmui Rahim, DSC551:R , Oct 2024


Continue

Asmui Rahim, DSC551:R , Oct 2024


Graphical Parameters
The presentation of graphics can be improved with graphical parameters.
Used either as options of graphic functions, or with par to change permanently, i.e. the
subsequent plots will be drawn with respect to the parameters specified by the user.

Asmui Rahim, DSC551:R , Oct 2024


Continue;

Asmui Rahim, DSC551:R , Oct 2024


For plotting symbols in, we use pch= . The colours were obtained with the options col=.

Asmui Rahim, DSC551:R , Oct 2024


Descriptive Statistics
Used to summarize and describe the main features of a dataset.
Provide a concise overview of the data, helping to identify patterns and trends.
Example using telco.csv dataset.

1 mydata1 <- read.csv("telco.csv", stringsAsFactors = TRUE)


1 names(mydata1)
[1] "Gender" "Programs" "Car_Ownership" "Telco_Prefer"
[5] "Usage_GB" "Hour_Perday"
1 str(mydata1)
'data.frame': 45 obs. of 6 variables:
$ Gender : Factor w/ 2 levels "Female","Male": 2 1 1 1 1 1 2 2 1 1 ...
$ Programs : Factor w/ 4 levels "Account","Business",..: 4 2 3 4 3 3 1 4 4 3
...
$ Car_Ownership: Factor w/ 2 levels "No","Yes": 2 2 1 1 1 2 2 1 2 1 ...
$ Telco_Prefer : Factor w/ 4 levels "Celcom","DiGi",..: 1 2 1 3 4 1 3 2 2 3 ...
$ Usage_GB : num 14.6 15.7 14.8 15.4 12.9 22.4 28 19.2 25.4 25.3 ...
$ Hour_Perday : num 3 3.8 4 3.5 3 6 6.5 5.5 6 6 ...

Asmui Rahim, DSC551:R , Oct 2024


Explore Dataset
Some functions;
function Col2
head(mydata1) Return the first parts of a data frame.
tail(mydata1) Return the last parts of a data frame.
dim(mydata1) Returns the numbers of rows and columns of
the data frame.
str(mydata1) Display the internal structure of the data
frame.
names(mydata1) Returns a vector containing the names of the
variables.

Asmui Rahim, DSC551:R , Oct 2024


Type of Variables
Categorical variable: Also known as qualitative variables. Values represent categories or
groups, rather than numerical measurements.
Example: Gender, blood type, ratings
Can be nominal or ordinal.
Numerical variable: Also known as quantitative variables. The values are numbers that
represent a measurable quantity.
Example: Height, Weight, Temperature
Can be interval or ratio.

Asmui Rahim, DSC551:R , Oct 2024


Summarizing and Displaying
Categorical Data
1. Frequency and contingency table.
2. Pie chart.
3. Bar chart.

Asmui Rahim, DSC551:R , Oct 2024


Frequency and Contingency Table
table() is used to produce a frequency or contingency tables of counts for each level of
specified factors.

1 table(mydata1$Gender)

Female Male
29 16
1 table(mydata1$Programs)

Account Business Sciences Statistics


11 8 13 13

Contingency table

1 table(mydata1$Gender,mydata1$Programs)

Account Business Sciences Statistics


Female 8 5 10 6
Male 3 3 3 7
1 table(mydata1$Gender,mydata1$Car_Ownership)

No Yes
Female 16 13
Male 7 9

Asmui Rahim, DSC551:R , Oct 2024


Contingency table for more than two variables.
We use ftable() function.

1 ftable(mydata1$Gender,mydata1$Programs,mydata1$Car_Ownership)
No Yes
Female Account 5 3
Business 1 4
Sciences 6 4
Statistics 4 2
Male Account 1 2
Business 1 2
Sciences 1 2
Statistics 4 3

Asmui Rahim, DSC551:R , Oct 2024


Pie Chart
Illustrate how different categories contribute to whole. Each slice of the pie represents a
proportion or percentage of the total.
Easy and simple comparison. Work best with a small number of categories. Too many slices
make the chart cluttered and difficult to interpret.

1 colors <- c("Blue","Yellow","Red","Orange")


2 my_labels <- c("Celcom", "Digi", "Maxis", "U-Mobile")
3
4 pie(table(mydata1$Telco_Prefer),
5 col=colors,
6 main="Telco Preference by Students")
7 legend("topright", my_labels, fill=colors)

Asmui Rahim, DSC551:R , Oct 2024


Bar Chart
Comparing categories how values vary across these categories.
Showing changes over time.
Displaying data with both positive and negative values.

1 barplot(table(mydata1$Programs),
2 col=c("lawngreen","indianred","khaki","midnightblue"),
3 ylim=c(0,15),ylab="Number of Students",
4 xlab="Programs",
5 main="Number of Students by Program")

Asmui Rahim, DSC551:R , Oct 2024


Cluster Bar Chart
Compare multiple subcategories within each main category.

1 barplot(table(mydata1$Gender,mydata1$Programs),
2 beside=TRUE,col=c("deeppink","cyan"),
3 legend=TRUE,ylab="Numbers of Students",
4 xlab="Programs",
5 main="Numbers of Students by Gender and Programs")

Asmui Rahim, DSC551:R , Oct 2024


Summarizing and Displaying
Numerical Data
1. Statistics summary.
2. Stem and leaf plot.
3. Box and whisker plot.
4. Histogram.
5. Scatter plot.

Asmui Rahim, DSC551:R , Oct 2024


Some functions;

Asmui Rahim, DSC551:R , Oct 2024


Examples;

1 summary(mydata1)
Gender Programs Car_Ownership Telco_Prefer Usage_GB
Female:29 Account :11 No :23 Celcom :10 Min. : 5.00
Male :16 Business : 8 Yes:22 DiGi :13 1st Qu.:13.70
Sciences :13 Maxis :16 Median :17.80
Statistics:13 U-Mobile: 6 Mean :18.13
3rd Qu.:23.20
Max. :32.40
Hour_Perday
Min. :1.50
1st Qu.:3.40
Median :4.50
Mean :4.46
3rd Qu.:6.00
Max. :8.00
1 summary(mydata1$Usage_GB)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.00 13.70 17.80 18.13 23.20 32.40
1 median(mydata1$Usage_GB)
[1] 17.8
1 range(mydata1$Usage_GB)
[1] 5.0 32.4
1 print(paste("Variance for the Usage GB variable is",
2 round(var(mydata1$Usage_GB),2),
3 "GB"))
[1] "Variance for the Usage GB variable is 42.08 GB"
Asmui Rahim, DSC551:R , Oct 2024
Stem and Leaf Plot
Visualize numerical data by organizing it based on its digits.

1 stem(mydata1$Usage_GB)

The decimal point is 1 digit(s) to the right of the |


0 | 5889
1 | 00022334
1 | 555566677788999
2 | 0002223444
2 | 556888
3 | 02

Asmui Rahim, DSC551:R , Oct 2024


Box and Whisker Plot
Visually summarizing and comparing the distribution of numerical data.
Concise way to see key aspects of the data, such as the meadian, quartiles and potential
outliers.

1 boxplot(mydata1$Usage_GB,
2 ylab="Internet Quota Usage in GB",
3 main="Figure 1")
4 boxplot(mydata1$Usage_GB,
5 xlab="Internet Quota Usage in GB",
6 main="Figure 2",
7 horizontal=TRUE)

Asmui Rahim, DSC551:R , Oct 2024


Can combine two graphs by using par(mfrow=c(1,2))

1 par(mfrow=c(1,2))
2 boxplot(mydata1$Usage_GB,
3 ylab="Internet Quota Usage in GB",
4 main="Figure 1")
5 boxplot(mydata1$Usage_GB,
6 xlab="Internet Quota Usage in GB",
7 main="Figure 2",
8 horizontal=TRUE)

Asmui Rahim, DSC551:R , Oct 2024


Histogram
Visualizing the distribution of numerical data.
Visual representation of how often different values occur within a dataset.

1 hist(mydata1$Usage_GB, prob=TRUE)
2 #if probability instead of frequency is desired
3 lines(density(mydata1$Usage_GB),lwd=4,col="red")

Asmui Rahim, DSC551:R , Oct 2024


Scatter Plot
Visualize the relationship between two numerical variables.
Easy to see patterns, trends and correlations.

1 plot(mydata1$Hour_Perday, mydata1$Usage_GB,
2 main="Hour per Day vs Usage in GB",
3 pch=19, col="blue")
4
5 fit=lm(mydata1$Usage_GB~mydata1$Hour_Perday)
6 abline(fit,col="red",lwd=4)

Asmui Rahim, DSC551:R , Oct 2024


More Examples;
1 par(mfrow=c(2,2))
2 plot(mydata1$Hour_Perday,type="l", main="lines")
3 plot(mydata1$Hour_Perday,type="b", main="both")
4 plot(mydata1$Hour_Perday,type="s", main="steps")
5 plot(mydata1$Hour_Perday,type="h", main="high density")

Asmui Rahim, DSC551:R , Oct 2024


1 par(mfrow=c(2,2))
2 plot(mydata1$Hour_Perday,type="l",lty=1,col=1,lwd=3,main="lines")
3 plot(mydata1$Hour_Perday,type="b",lty=2,col=2,lwd=3,main="both")
4 plot(mydata1$Hour_Perday,type="s",lty=3,col=3,lwd=3,main="steps")
5 plot(mydata1$Hour_Perday,type="h",lty=4,col=4,lwd=3,main="high density")

Asmui Rahim, DSC551:R , Oct 2024


End of slides

Asmui Rahim, DSC551:R , Oct 2024

You might also like