Introduction to R Programming
"R is an interpreted computer programming language which was created by Ross Ihaka and
Robert Gentleman at the University of Auckland, New Zealand." The R Development Core Team
currently develops R. It is also a software environment used to analyze statistical information,
graphical representation, reporting, and data modeling. R is the implementation of the S
programming language, which is combined with lexical scoping semantics.
R is a popular programming language used for statistical computing and graphical
presentation.
Its most common use is to analyze and visualize data.
It is a great resource for data analysis, data visualization, data science and machine learning
It provides many statistical techniques (such as statistical tests, classification, clustering, and
data reduction)
It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc.
It works on different platforms (Windows, Mac, Linux)
It is open-source and free
It has a large community support
It has many packages (libraries of functions) that can be used to solve different problems
R Versions :
2000 - R version 1.0.0 was released to the public.
2003 - The R Foundation was formed to hold and administer the R software copyright and
to provide support for the R language project.
2004 - R version 2.0.0 is released.
2009 - The R Journal, an open-access journal for statistical computing and research, is
established.
2013 - R version 3.0.0 is released.
2020 - R version 4.0.0 is released.
June 2023 - We're currently on R version 4.3.1.
According to Comprehensive R Archive Network(CRAN), here are nearly 20,000 R packages
available.
In the present era, R is one of the most important tool which is used by researchers, data
analysts, statisticians, and marketers for retrieving, cleaning, analyzing, visualizing, and
presenting data.
The important task in data science is the way we deal with the data: clean, feature
engineering, feature selection, and import. It should be our primary focus. Data scientist’s job
is to understand the data, manipulate it, and expose the best approach.
For machine learning, the best algorithms can be implemented with R.
Keras and TensorFlow allow us to create high-end machine learning techniques.
R has a package to perform Xgboost . Xgboost is one of the best algorithms for Kaggle competition.
R communicates with other languages and possibly calls Python, Java, C++. The big data world is
also accessible to R. We can connect R with different databases like Spark or Hadoop.
In brief, R is a great tool to investigate and explore data.
The elaborate analysis such as clustering, correlation, and data reduction are done with R.
RStudio IDE
RStudio is an integrated development environment which allows us to interact with R more readily.
RStudio is like the standard RGui, but it is considered more user-friendly.
This IDE has various drop-down menus, Windows with multiple tabs, and so many customization
processes. The first time when we open RStudio, we will see three Windows.
The fourth Window will be hidden by default. We can open this hidden Window by clicking
the File drop-down menu, then New File and then R Script.
Program-1
Load the “iris.CSV” file and display the names and type of each column. Find statistics
such as min, max, range, mean, median, variance, standard deviation for each column
of data
Procedure :
Step-1: open R studio
When the R studio is opened it shows us 4 windows on display.
1st window: Contain the program of the opened file.
2nd window: Shows the environment in which it is running, history connections
and tutorial.
3rd window: It is the console window where we should do our queries
4th Window: Display all the files present in your computer.
Step-2: load the iris.arff dataset using the command
Iris <- read.csv ("C:/Program Files/Weka-3-8-6/data/iris.arff ")
data <- iris
Step-3: view the data View(iris)
Step- 3.1 Summary of iris data set
Step-4 : Names and types of each column dataset
str(iris)
Step-5 : Find statistics such as minimum, maximum, range, mean, median, variance, and
Standard deviation
Step-5.1: Minimum value of each attribute in the iris data set
Step-5.2: Maximum value of each attribute in the iris data set
Step-5.3: Range value of each attribute in the data set
Step-5.4: Mean value of each attribute in the iris data set
Step-5.5: Median value of each attribute in the iris data set
Step-5.6: Variance value of each attribute in the iris data set
Step-5.7: Standard Deviation of each attribute in the iris data set
Analysis :
The program provides a summary of the data structure and
descriptive statistics for each numeric column.
This helps gain a basic understanding of the data distribution,
central tendency (mean), and variability (standard deviation).
Overall, this program performs a basic exploratory data analysis
(EDA) on the "iris.csv" file.