0% found this document useful (0 votes)
5 views20 pages

2.representation of Data

The document discusses various methods for representing data, including stem-and-leaf diagrams, box-and-whisker plots, histograms, and cumulative frequency graphs. Each method is explained with its purpose, construction steps, and examples, highlighting their suitability for different types of data. Additionally, it covers the concept of skewness in data distributions and how to interpret it using measures of central tendency.

Uploaded by

周佳文
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views20 pages

2.representation of Data

The document discusses various methods for representing data, including stem-and-leaf diagrams, box-and-whisker plots, histograms, and cumulative frequency graphs. Each method is explained with its purpose, construction steps, and examples, highlighting their suitability for different types of data. Additionally, it covers the concept of skewness in data distributions and how to interpret it using measures of central tendency.

Uploaded by

周佳文
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Section 2.

Representation of data

2.1 Stem-and-leaf diagrams

 Represent discrete data: specific values you cannot subdivide. Typically


integers you can count.

 Show all raw data and groups it into class intervals of same class
width.
Note: the class width here is the difference between two consecutive
lower/upper class limits.

E.g., in Fig 2.3, we divide the following discrete data points into 4 Classes of
width 10:

Classes: 0-9, 10-19, 20-29, 30-39.

 Consists of:
1) a stem: defines the scale for the data
2) leaf: where the data is plotted in ascending order
3) a key: explains how to read the data
4) numbers in bracket: how many values in that class intervals, not always
included but useful when given large dataset.

 What are stem-and-leaf diagrams used for:


1) The data is arranged into classes, so it is easy to see the modal class
interval. E.g., the modal class in Fig 2.3 the class 20-29 where 23 is the
mode.
2) Since the data is in ascending order, it is easy to identify the median,
quartiles (LQ and UQ), maximum and minimum.
Eg., Find them for Fig 2.3

LQ=_______________Median=____________,UQ=______________,therefore,
IQR=___________
Maximum=____________, Minimum=__________, therefore Range =
____________

1
3) Outliers can be easily identified and removed: a data point that differs
significantly from other observations

Example:

2.2 Back-to-back stem and leaf diagrams

 Useful when data is to be split into 2 comparable categories such as

2
boys/girls; children/adults, Chinese/Russian, etc.
E.g.,

Note:
1) One stem for both girls and boys, a system of leaves for either, one key
that explains both
2) The leaves on the left-hand side of the stems (Boys) increase from the
center outwards.

Example:

The following stem and leaf diagrams show the times taken by some children
and adults to complete a level on a computer game.

Key:2 | 3 represents a time of 23 seconds

(a) Compare the times taken to complete the level between the children and
the adults.

3
(b) It is later discovered two of the adults’ times had been omitted from the
diagram –times of 23 and 42 seconds. Briefly explain whether adding these
times would change the adults’ median time.

4
2.3 Box-and-Whisker Plots

 A graph that clearly shows key statistics including median, quartiles,


minimum, maximum and outliers.

 Used for both continuous and discrete data.

 Does not show any other individual data items.

 Steps to draw a box-and-whisker plot:

5
1) Write the individual data points if discrete and form classes in order from
smallest to largest.
2) Draw a scale based on the the classes and label the scale.
3) Determine Lower Quartile Q1, Median Q2, Upper Quartile Q3, minimum
and maximum.
4) Complete the box using Q1, Q2 (Median) and Q3.
5) Draw the whiskers using the minimum and maximum.

 Box plots are often used to compare two sets of data such that:
1) both plots will be drawn one above the other on the same scale on the x-
axis
2) It’s easy to see the main shape of the data distribution.

Example:

The incomplete box plot below shows the tail lengths in cm of some students’
pets.

(a) Given that the median tail length was 21 cm, complete the box plot. Mark
the key statistics including Median, UQ, LQ, Max and Min.

(b) Find the range and interquartile range of the tail lengths

6
7
2.4 Histograms

 Displays grouped continuous or discrete data such that it doesn’t


allow gaps between the class intervals in a histogram. If there is gap,
reform the Classes by finding the middle point between Classes in last
section, regardless of whether the data type is discrete or
continuous.
E.g. If given classes as , (continuous), or 0-9, 10-
19 (discrete), you should transform them into 0 ≤ x< 9.5 and 9.5 ≤ x< 19.5.

 Consists of x/y-axis where:


1) on x-axis, the class intervals are plotted in order. But Note: the intervals
are not necessarily of equal length.

2) on y-axis, the frequency density for each class is plotted that is shaped
like rectangles/bars: the frequency per unit of the data in each class

E.g

8
 Steps to make a histogram:
 Always check there are no gaps in between classes.

1) Find the class width of each group by subtracting the lower boundary
from the upper boundary

2) Calculate the frequency density.

3) Label the class intervals on x-axis and plot the frequency density for each
class like bars on y-axis. Note: the bars may have different widths.

Example:

9
10
2.5 Cumulative Frequency Graph (c.f.)

 Used with data that has been organized into a grouped frequency
table, therefore it is not possible to find the actual values of mean, median
and quartiles: we can only estimate them.

 Consists of:
1) on x-axis plotted the Classes: Note that the sample size is usually a large
number, so you need to examine the scale carefully before labelling them.

2) on y-axis plotted the number of data points/observations up to


(accumulative) a certain data value on the upper boundary of each class:
consider both the frequency of the data in that specific class and
that of all data in the Classes below it.

11
 Steps to draw a cumulative frequency table:
1) Draw x-/y- axis with scale based on classes. Label x-axis with the random
variable and its unit. Label y-axis with cumulative frequency.

2) Pinpoint the data (x,y) as dots on your graph based on the upper class
boundary of each class (x) and the cumulative frequency up to that point
(y).

3) Connect the dots with curves rather than straight lines. (IMPORTANT)

 Find approximated statistics from the c.f. graph:


1) Lower Quartile Q1:

Draw a horizontal straight line representing . The intersection point


with the curve reflects the corresponding x values as the Q1.

2) Median Q2:

Draw a horizontal straight line representing . The intersection point


with the curve reflects the corresponding x values as the Q2.

3) Upper Quartile Q3:

Draw a horizontal straight line representing . The intersection point


with the curve reflects the corresponding x values as the Q3.

Example:
The cumulative frequency graph below shows the lengths in cm, l , of a group
of puppies in a training group.

(a) Given that the group was one of the groups used in the data
collection, find the number of puppies that were in this group.

(b) Use the graph to find an estimate for the interquartile range of the
puppies.

(c) x % of the puppies are greater than 53.5cm long, use your graph to find an
estimate for the value of x.

12
13
14
2.6 Summary of Comparison btw Graphs/Plots

 Stem-and-leaf diagrams:
1) used with discrete data of a single variable (with back-to-back it can be
categorized into two, but still single variable)
2) shows all raw data and shape of data distribution
3) used for datasets of small sample size (less than 30)

 Box-and-Whisker Plots:
1) used with discrete or continuous data of a single variable
2) shows the range, IQR,and Q1/2/3 quartiles.
3) useful for comparing data patterns quickly

 Histogram:
1) used with grouped continuous (more commonly) or discrete data of
a single variable
2) used with varying Class width/group sizes.
3) shows the frequencies of each Class, represented by the area of each
bar.

 Cumulative frequency graphs.


1) used with grouped continuous data of a single variable
2) shows the cumulative frequencies that fall below the upper boundary
of each Class.

15
Example:

A student is collecting information on his friends’ interests and believes that


his friends who only have dogs spend more time outside than his friends who
only have cats. He has surveyed 20 friends with only cats and 20 friends with
only dogs and has written down the total amount of time, rounded to the
nearest hour, each of them spent outside last week. Describe, with a reason,
which diagram would be best for the student to use to display the data.

 Skewness

1) Skewness describes the way in which data in a non-symmetrical


distribution is leaning

-- A distribution that has its tail on the right side has positive skew:
skewed to the right
Tail extends to the right side; the majority of data concentrates in lower values.

-- A distribution that has its tail on the left side has negative skew:
skewed to the left
Tail extends to the left side; the majority of data concentrates in higher values.

16
2) If the distribution is shown on a box plot looking at the difference
between the quartiles can help decide how it is skewed:
-- If the median is closer to the lower quartile then the distribution
has positive skew
Q3 - Q2 > Q2- Q1

 Majority of lower values are consistent and less varied on the left
side --- smaller Q2-Q1
 Tail of higher values are spread out and more varied on the right side
--- larger Q3-Q2

-- If the median is closer to the upper quartile then the distribution


has negative skew
Q3 - Q2 < Q2 - Q1

 Majority of higher values are consistent and less varied on the right
side --- smaller Q3-Q2
 Tail of lower values are spread out and more varied on the left side ---
larger Q2-Q1

17
3) Looking at the values of the statistics can help you decide whether
distribution is positively skewed or negatively skewed
-- In a positively skewed distribution
mode < median < mean

 Majority of data concentrates on lower values but the mean is pushed


up by the more-varied-higher values on the right, which gives a
higher mean.
 Tail doesn’t influence median as much.
 Majority of data concentrates on lower values, which gives a lower
mode.

-- In a negatively skewed distribution


mean < median < mode

 Majority of data concentrates on higher values but the mean is pulled


down by the more-varied-lower values on the right, which gives a
lower mean.
 Tail doesn’t influence median as much.
 Majority of data concentrates on higher values, which gives a higher
mode.

18
Example:

The graph below shows the distribution according to height of a group of jockeys at a south
Florida horse track. Select the statement that correctly describes a relationship between measures
of central tendency for this distribution.

A. The mean is less than the mode.


B. The mode and the mean are the same.
C. The median is greater than the mode.
D. The median and the mean are the same.

The graph below shows the distribution of a group of Gator fans according the number junk cars
in their back yards. Select the statement that correctly gives a relationship between measures of
central tendency for this distribution.

19
A. The mean is the same as the median.
B. The mean is less than the mode.
C. The mode is less than the mean.
D. The median is less than the mean.

20

You might also like