DISTRIBUTING DATA
● Summarising data:
- Large sets of numbers are hard to learn from.
- We need to take steps to simplify and summarise the numbers
into something more digestible.
- A good start is simply to count how often each number occurs.
- We count how often each number occurs in the dice roll (Table 1)
- Raw data is converted into a summarised frequency table which
can be further summarised into a chart.
- Quick and easy visual way to get some intuition about a data set
● Histograms for continuous data:
- The x-axis is splitted into ‘bins’.
- Each bin covers a set range.
- Here the bins have a width of 2
- This bin covers the range 23 to 25
with its centre on 24.
- If the data has 2,3. Uses height or anything that can’t be
classified into discrete values. We separate the data value into
width of 2, so anything of 1,2 or 2,3 would be within the bin 1-3.
- And this one covers 1 to 3 with its centre on 1.
- We’ve omitted some x-axis labels to save space.
- If we get a 14 it will land in the 13-15 bin.
- Data point is added to the bin corresponding to its value.
- Each data point in the dataset is added to its bin until the whole
dataset is displayed.
- We can gradually see the shape of the distribution appear as
more data comes in.
- The final shape of the distribution is very informative. Here we
can see that values of 0 occur regularly and values of +30 or -30
are extremely rare.
- We can choose how many bins to use. This is an important
choice.
- More bins show the distribution but can get noisy.
- Fewer bins are less noisy but can also be less informative.
- Sometimes you will see di erent y-axes on histograms.
- Some histograms show the total count, and the others a
normalised proportion.
- This isn’t critical either way, the shape is the important
information so some programmes - like Jamovi - don’t label the
y-axis at all.
- The Height of the bar is what is actually important.
- The histogram is centred on zero, has a variability of most of the
values occurring between -10 and +10 and is symmetrical.
- If we change the mean, the shape of the distribution stays the
same, but the centre of mass shifts such as the highest bars
occur where the most likely values are.
- If we change the variance, this stretches or compresses the data
set to reflect the values in the data set.
- If we change the skewness (which is about the symmetry):
1. A negative skewness will have a long tail in which the tail
points towards negative values.
2. A positive skewness will have a long tail in which the tail
points towards the positive values.
- If we change the Kurtosis, which reflects the peak goodness of our
data sets. So data with high kurtosis will have a sharp peak. And with
low kurtosis will have very wide tails.
- Shape is what is important.
- Histograms can show many di erent distributions of data.
- Here is a ‘normal’ distribution, this is a special example that we’ll be
looking at a lot during the course.
- However, there are many others which can all be visualised using the
histograms.
1. ‘Fat-tailed’ Distribution
2. Skewed Distribution
3. Uniform Distribution
4. Bimodal Distribution
5. Triangular Distribution
6. Decaying Distribution
● Summary:
1. histograms visualise the
distribution of a dataset.
2. Increasing the n. of bins in the
histograms gives more
resolution.
3. …but can make the distribution noisier as well (particularly with a small
n. of samples)
4. The shape of many histograms (but not all!) are well described by the
mean, variance, skewness and kurtosis.
5. There are many types of distribution that reflect di erences in the
underlying dataset.
6. It’s a way of quickly gaining an intuition about what type of
distribution we’re working with in a particular data set.
● Learning statistics with Jamovi:
❖ Data collection can be thought of as a kind of measurement.
❖ Measurement is assigning numbers or labels or some kind of
‘stu ’.
❖ Example:
➔ My age (in years could have been 1,2,3) is 33 years.
➔ When asked if I like anchovies, I might have said that I
do, or I do not, or I have no opinion, or I sometimes do.
➔ My chromosomal gender (male (XY) or female (XX), but
there could be other possibilities too, Klinfelter’s syndrome
(XXY)) is male.
➔ My self-identified gender (male, female, neither,
transgender) is female.
➔ Bolded part is ‘the thing to be measured’ and the italicised
part is ‘the measurement itself’.
❖ The way in which you specify the allowable measurement values
is important, since age can be in years, but can also be in ‘years
and months’, so 2 years and 11 months: 2:11. Or even for
newborns in ‘days’ or ‘hours’.
❖ We should define the concept ‘age’ in 2 di erent ways: the length
of time since conception and the length of time since birth, as
this won’t make a huge di erence in adults but might in
newborns, as a baby that was born 2 weeks earlier does not
have the same age compared to a baby that was born 1 week
late.
❖ Methodology also matters:
1. If you ask someone (only ppl that know this question),
they might lie (but it is fast, cheap and easy).
2. Could ask parents (Fast, easy) for a child's birth but they
might not remember birth since conception so might have
to ask an obstetrician.
3. Could look up for birth or death certificates for accurate
info, but time is confusing and frustrating.
❖ Operationalisation (related with measurement), process by which
we take a meaningful but somewhat vague concept and turn it
into a precise measurement:
1. Be precise about your measurement, age (time since birth
or conception).
2. Determine the methodology. (Asking adults, certificates,
and what question will be asked).
3. Defining the set of allowable values that the measurement
can take. Age: in years, years and months, hours?.
Genders: do we allow only male, only female, allow other
options, or allow all responses, if so how will we interpret
their answers.
❖ A theoretical construct. This is the thing that you’re trying to take
a measurement of, like “age”, “gender” or an “opinion”. A
theoretical construct can’t be directly observed, and often
they’re actually a bit vague.
❖ A measure. The measure refers to the method or the tool that
you use to make your observations. A question in a survey, a
behavioural observation or a brain scan could all count as a
measure.
❖ An operationalisation. The term “operationalisation” refers to the
logical connection between the measure and the theoretical
construct, or to the process by which we try to derive a measure
from a theoretical construct.
❖ A variable is what we end up with when we apply our measure to
something in the world. That is, variables are the actual “data”
that we end up with in our data sets.
❖ Di erent types of variables can be distinguished using scales of
measurement.
1. Nominal scale variable (categorical variable) is one in
which there is no particular relationship between the
di erent possibilities. E.g eye colour: blue, green, brown,
we can’t really say a blue eye is bigger than a brown eye
or similar with gender, we don’t say male is worse than
female. It means that it doesn’t matter the order of the
variables in the table as it doesn’t change the meaning.
Also you couldn’t say that they have an ‘average thing’
such as an average blue color, doesn’t make sense.
2. Ordinal scale variable is one in which there is a natural,
meaningful way to order the di erent possibilities, but you
can’t do anything else. if we ask ppl:
- Temperatures are rising because of human activity
- Temperatures are rising but we don’t know why
- Temperatures are rising but not because of humans
- Temperatures are not rising
- By putting it in this order it matches what the
science is saying, if we change the order it wouldn’t
make sense. Once we get all the results, we can use
the natural ordering of these items to construct a
sensible group comment but we can't do so in
di erent order, which wouldn't make sense. Also, we
can't generate an average, the value indicates
nothing.
3. Interval scale variable, the di erences between the
numbers are interpretable, but the variable doesn’t have a
‘natural’ 0 value (the numerical value is genuinely
meaningful). Example; temperature, a 15^ yesterday and
18^ today, means 3^ di erence, so the addition,
subtraction are meaningful but when it is 0^ we can't say
no temp, it's when water freezes.
4. Ratio scale variable whose numerical value is meaningful,
the only variable that has a 0 which means 0. Example, the
response time, you can add, subtract, divide and 0 means
zero seconds.
❖ You also can have continuous variable and discrete variables:
- A continuous variable is one in which, for any two values
that you can think of, it’s always logically possible to have
another value in between.
- A discrete variable is, in e ect, a variable that isn’t
continuous. For a discrete variable it’s sometimes the case
that there’s nothing in the middle.
- Nominal: discrete, Ordinal: discrete, Interval: discrete
and continuous Ratio: discrete and continuous.
❖ The Likert scale: used to ask psychology questions, with options
such as: 1. Strongly disagree, 2. Disagree, 3. Neither agree nor
disagree, 4. Agree and 5. Strongly agree. These are discrete as
you can’t have a 2,5. They are not nominal, since items are
ordered and neither ratio scale, since no natural zero. They are
quasi-interval scales, as they have both qualities.
DATA AND VARIABLES
● Dataset
❖ A collection of data acquired for a specific purpose.
❖ May relate to multiple experiments or hypotheses.
● Variable
❖ A number that can ‘vary’ (e.g. take a high or low value)
depending on an attribute that we’re trying to measure.
❖ We typically measure several variables from each participant.
❖ These typically form one column in a data file.
● Types of variables
❖ Nominal: No relationship between di erent possibilities in scale.
Sometimes called ‘Categorical’ data. E.g. country of origin
❖ Ordinal: A natural order between possibilities but nothing else.
Can’t interpret the ‘magnitude’ of di erences. E.g. likert scales
❖ Interval: The possibilities are ordered and have interpretable
magnitudes, though ‘zero’ does not have special meaning. E.g.
temperature.
❖ Ratio: Like interval data, but now zero is directly interpretable
and we can interpret ratios between values. E.g. reaction times.
❖ Continuous: A variable that can change freely to take any value.
E.g. temperature, 4C, 10.34C.
❖ Discrete: a numbered variable that takes one if a fixed set of
values. E.g. n. of cars.
● Choice of measurement
❖ How we choose to measure a variable can a ect its type:
- A person’s true height is continuous & ratio as it is an
attribute which has an exact value with an interpretable
zero that we could measure to any precision.
- But, if an experiment chooses to record height to the
nearest centimeter, then that variable would be discrete &
ratio.
- If an experiment chooses to record only whether the
person is ‘tall’ or ‘short’, then the variable would be
discrete & ordinal.
- These labels are not fixed or absolute, they are only a
guide.
EXPERIMENTS IN PSYCHOLOGY
● What is a hypothesis?
❖ A proposed prediction about a phenomenon
❖ A scientific hypothesis is a testable prediction derived from
previous data or theory.
❖ The scientific process is defined by generating and testing
di erent hypotheses.
❖ Statistics are formal methods for comparing di erent
hypotheses using data.
❖ Examples:
- “If I drop a ball it will fall towards the ground”
- “Drinking co ee will improve how people drive when they
are tired”
- “PPl are better at perceiving things that happen at regular
times”
- “A new drug will reduce the symptoms of depression in
secondary school pupils”
● The null hypothesis
❖ How would our data look if there was no e ect and everything
was down to chance?
❖ We need something to compare our hypothesis to.
❖ This is typically a description of how the data would look if there
was no e ect at all - a ‘Null’ hypothesis.
❖ Statistical tests work to reject the Null hypothesis, if the null
hypothesis is not supported by the data then we accept the
experimental hypothesis.
❖ Example:
- “If I drop a ball, it will stay hanging in the air”
- “People will drive the same whether they have consumed
any co ee or not”
- “PPl do not use the timing of events to guide their
perception”
- “A new drug will make no di erence to the symptoms of
depression in secondary school pupils”
● What is an experiment?
❖ A scientific procedure used to test a specific hypothesis.
❖ An experiment is when we manipulate something in a controlled
way to see its e ect on something else.
❖ Using terms from the previous section, we manipulate one
variable to see its e ect on another variable.
❖ Sometimes the data to test a hypothesis may already exist, but
often we will need to design an experiment to test it.
● The role of variables
❖ Di erent variables have di erent roles within an experiment.
❖ In an experiment, we manipulate one variable to see its e ect
on another. We can term these ‘Dependent’ or ‘Independent’
variables - or equivalently ‘Outcomes’ and ‘Predictors’.
❖ We may have more than one outcome or predictor in a single
study.
- Worked example:
❖ Hypotheses:
- Experimental Hypothesis H1:
- Revising with background music leads to
lower marks in the exam.
- Null hypothesis H0:
- Revising with background music does NOT
lead to lower marks in the exam.
- Exam marks are the same whether people
revised with background music or not.
❖ Experiment
- Dependent variable:
- Exam marks
- Independent variable:
- Revision method - music or no music
- Experiment
- Between groups comparison, one group
revises with background music and the other
in silence. We compare exam marks between
the two groups.
- If the groups have the same marks, then we accept
the null hypothesis - if not we can accept the
experimental hypothesis.
- But how do we know if the groups are di erent or
not…?
● Statistics
❖ Statistics enable us to make comparisons based on data and to
tell whether a hypothesis is supported by the data.
❖ Test statistics: a value that quantifies how close the data are to
the null hypothesis.
❖ P-value: the probability that a particular test statistic could
occur, if the null hypothesis is true.
❖ If the test statistic is small and likely to have happened just by
chance, then we can accept the null hypothesis.
❖ If the test statistic is large and very unlikely to have occurred by
chance, then we can accept the experimental hypothesis.
● Big picture
❖ We need to think through the whole research cycle
➔ What literature are we following up?
➔ What is our hypothesis, or is this exploratory research?
➔ What data would we need to answer our question?
➔ How can we collect this data?
➔ What statistics do we need to carry out?
➔ How do we describe and report our research?
➔ How do we share our results with the world?