BECS 184 - Guess Paper PDF
BECS 184 - Guess Paper PDF
com BECS-184
m
game of the commoners. Today one has the computing power as one can easily load software of his choice or
need into his PC. There is a plethora of read-made computer packages available today. Now one can find
different statistical packages for applications to different disciplines. We will describe two such packages that
60 co
are ready available and are popular and user friendly. We will also give a glimpse of some other packages in
the subsequent section.
Microsoft Excel: Microsoft Excel is a big worksheet (it can take data rows in thousands across 256 columns).
This worksheet can be used for data entry and for performing calculations by click of buttons. It has a “paste
.
70 ar
function where you can paste any formula from a big list of inbuilt functions. MS Excel can be used to create
tables, and graphs and perform statistical calculations. The work done in MS Excel can be easily copied and
pasted to many window-based programs for further analysis. According to Pottel, “Spreadsheets are a useful
and popular tool for processing and presenting data. In fact, Microsoft Excel spreadsheets have become
94 dh
somewhat of a standard for data storage, at least for smaller data sets. The fact that the program is often being
packaged with new computers, which increases its easy availability, naturally encourages its use for statistical
analysis. However, many statisticians find this unfortunate, since Excel is clearly not a statistical package.
58 ra
There is no doubt about that, and Excel has never claimed to be one. But one should face the facts that due to
its easy availability many people, including professional statisticians, use Excel, even on a daily basis, for
quick and easy statistical calculations. Therefore, it is important to know the flaws in Excel, which,
k
unfortunately, still exist today! …. ‘‘Excel is clearly not an adequate statistics package because many statistical
methods are simply not available. This lack of functionality makes it difficult to use it for more than
99 a
computing summary statistics and simple linear regression and hypothesis testing”. However in MS Excel
2003 aspects of the some statistical functions, including rounding results, and precision have been enhanced.
h
The MS Excel worksheet is a collection of cells. As we have earlier said, there are 65,000 (rows) X 256
(columns) cells in an MS Excel worksheet. Each row or column can be used to enter data belonging to one
ic
category. Data entry in MS Excel is as simple as writing on a piece of paper. MS Excel assigns each column a
field depending upon the type of data. It supports various data formats; one can choose a data format by
hr
formatting the cells. Once the type of cells is defined it is easy to enter the data without taking care of the
format. MS Excel can perform usual calculations on the data so entered. It has an insert function (fx) icon that
contains many inbuilt functions like sum, count, max/min, standard deviation etc. In fact it has a plethora of
S
built-in functions that performs special calculations without even typing the formula. To perform a
calculation one has to select a function and specify the range of values on which it has to be applied. These
functions are known as paste functions. We will concentrate on the statistical functions and see some of the
major statistical functions of MS Excel. As you can see in figure 9. 2, once you go to the function menu and
choose “statistical” category, you will be asked to select a function. Suppose you have chosen t-test. You will
be told on the same screen that t-test returns the probability associated with a student’s t-test. Now if you are
still not comfortable with the description, you may select help on this function, which is at the bottom left of
the screen. More help is offered in the following form. MS Excel has a built-in statistical package for taking
you in further details of data analysis. It provides a set of data analysis tools called the Analysis ToolPak,
1
Shrichakradhar. com BECS-184
which you can use to save steps when you develop complex statistical analyses. You provide the data and
parameters for each analysis; the tool uses the appropriate statistical macro functions and then displays the
results in an output table. Some tools generate charts in addition to output tables. The table given below
highlights their functions and uses off these tools:
m
60 co
.
70 ar
94 dh
58 ra
k
99 a
h
ic
hr
S
2
Shrichakradhar. com BECS-184
You have seen that MS Excel can do virtually most of common statistical calculations. There are two more
features that are worth mentioning when one talks about the statistical functions of MS Excel. These two are
cross tabulations, pivot tables and the graphical features.
MS Excel can be used to create cross tabulations or two-way frequency tables across categorical variables. In
MS Excel there is a pivot table wizard which helps in creating tables in multi-dimensions. Let us explain these
concepts with the help of an example. The data given below is the percentage contribution of a country to
world research in a particular subject area.
m
60 co
.
70 ar
94 dh
58 ra
Now suppose you want to make a pivot table that would enable you to visualise a country whose
k
contributions differ in the disciplines of physics and chemistry. You can simply drag the subject field in rows
and column. This would enable you to see that e. g. the shaded countries Hungary and India have
99 a
3
Shrichakradhar. com BECS-184
There could be many more such cross-tabulations depending upon the need of the researcher. The most
pleasant part of working with MS Excel is the ease with which you can drag these fields and have a
customised layout as per your wish. It is said that in MS Excel, the most demanding work is to input the data,
the analysis being the easiest. You can use pivot table effectively to present data where two-dimensional
tables are important. One of the major advantages of this feature is that once the table is prepared, we can
change the summary from one characteristic to another.
m
Similarly, if you want to make a graphical presentation of the data then you can go to the chart wizard and
choose the chart that you want to make. MS Excel has a built-in facility to create graphs and charts. There are
60 co
several types of charts and graphs supported by MS Excel like bar charts, line charts, pie charts and scatter
diagrams etc. The chart wizard menu can be summoned by clicking on the graph icon from the menu bar.
SPSS: SPSS is the short form for Statistical Package for Social Sciences (SPSS). It is a very popular package due
.
to its features and compatibility with other window-based programs. In the late 1960s, three Stanford
70 ar
University graduate students developed the SPSS statistical software system
94 dh
58 ra
k
99 a
h
ic
hr
S
Once you open a data file you can go to the analyze menu and start working on the statistical aspects of the
data. You can find the frequencies, cross-tabulations, and ratios etc. You can see that there is a long list of
statistical analysis in the analyze menu. We will give you a glimpse of what these functions do.
4
Shrichakradhar. com BECS-184
m
60 co
.
70 ar
94 dh
58 ra
k
99 a
h
ic
One can see that many of these tools are available in MS Excel also, but the difference is that the output given
by SPSS contains many more details regarding the statistical aspects of the findings. For example the cross-
hr
tabs procedure forms two-way and multi-way tables in both MS Excel and SPSS, but in SPSS it also provides a
variety of tests and measures of association for two-way tables. The supporting statistics provided in SPSS is
Pearson chi-square, likelihood-ratio chi-square, linear-by-linear association test, Fisher’s exact test, Yates’
S
corrected chi-square, Pearson’s r, Spearman’s rho, contingency coefficient, phi, Cramér’s V, symmetric and
asymmetric lambdas, Goodman and Kruskal’s tau, uncertainty coefficient, gamma, Somers’ d, Kendall’s tau-
b, Kendall’s tau-c, eta coefficient, Cohen’s kappa, relative risk estimate, odds ratio, McNemar test, and
Cochran’s and Mantel-Haenszel statistics. SPSS is thus more comprehensive. SPSS also supports several
statistical graphs. It displays many statistics on the graph itself. It has a feature that helps you to find a chart
that is most suitable for your data, which is called “Chart Galleries by Data Structure”
Now suppose you have chosen single categorical variable as the gallery that best describes your data. Then
the next screen that will appear would be like the following figure. Suppose here you choose Simple Pareto
5
Shrichakradhar. com BECS-184
Counts or Sums for Groups of Cases, then SPSS will describe what this graph does like, “Creates a bar chart
summarizing categories of a single variable, sorted in descending order. A line shows the cumulative sum. ”
So in this way, SPSS also acts as a mentor also. Probably, this is the reason for its success also. SPSS has a
menu called “Statistics Coach”, which asks questions about your data like “What do you want to do with
your data?”It asks you further questions about your data in four steps and then suggests the right kind of
analysis for your dataset. The output of SPSS appears as pivot table, which can be cut and pasted to Word
documents, Excel worksheets and PowerPoint presentations. According to Wegman and Solka (2005), “SPSS
m
supports numerous add-on modules including one for regression, advanced models, classification trees, table
creation, exact tests, categorical analysis, trend analysis, conjoint analysis, missing value analysis, mapbased
60 co
analysis, and complex samples analysis”.
.
Advantages of Content Analysis
70 ar
94 dh
58 ra
k
99 a
h
ic
6
Shrichakradhar. com BECS-184
m
“examine, compare, conceptualize and categorize data”. In regard to analyzing qualitative data on older
persons, categories could, for instance, be health issues, care provided by relatives or institutions, income
60 co
security in old age, or the household situation in which older persons live. It is helpful to begin the coding
process as early as possible, i. e. after interviews and focus group discussions have been transcribed. Early
coding would permit the research team to categorize data and to perceive the social reality of older persons
through those categories. Coding would allow patterns to emerge from the field notes and other collected
.
material. Established codes should be reviewed to ensure that changes in coding could be made in case it
70 ar
seems prudent to do so. Coding it therefore a highly flexible approach to making sense of collected qualitative
data. Various codified categories could be connected. The evaluation team should explore possible linkages
and how categories could be related to each other. Coding, however, does not substitute for analysis. Since it
94 dh
is only a mechanism to categorize data, the findings still have to be interpreted. Content analysis is the coding
of documents and transcripts, to obtain counts of words and/or phrases for purposes of statistical analysis.
The evaluation team creates a dictionary, which clusters words and phrases into conceptual categories for
58 ra
purposes of counting. Based upon this, the recurrence of often used words and phrases can be utilized and
would inform the team of important topics that are mentioned repeatedly by older persons during interviews
and focus group discussions. Narrative analysis is another method to analyze qualitative research data. It
k
attempts to analyze a chronologically told story, with a focus on how elements of the story are sequenced and
why some elements are evaluated differently from others. Narrative analysis is seen as an alternative to semi-
99 a
structured interviews, allowing for the uninterrupted flow of information. Some proponents of narrative
analysis see it as a truly participatory and empowering research methodology insofar as it gives respondents
h
the venue to articulate their own viewpoints without any structure restricting their expressions on a particular
ic
subject. Four models of narrative analysis can be distinguished: thematic analysis (emphasis on what is said
compared to how it is said), structural analysis (emphasis on the way a story is told), interactional analysis
(emphasis on the dialogue) and per formative analysis (emphasis on performance such as gestures used).
hr
Notes and transcription of semi structured interviews are analyzed to interpret the findings of a story told
concerning a particular event. Problems with narrative analysis are that the memory can deceive the narrator
regarding the accuracy of a story told. Some researchers call for the introduction of questions at the end of a
S
story to clarify any outstanding issues. Another criticism of narrative analysis is that stories told are treated
uncritically and are only recorded without any analysis. Qualitative data analysis is not controlled by the
same strict rules as quantitative analysis. The nature of qualitative data contributes to the evolving nature of
analysis and to a less structured approach. Coding, content analysis and narrative analysis seem to be rather
tentative approaches to interpret collected data and to invoke meaning of the material assembled.
Nevertheless, qualitative evaluation should be able to portray a social reality in greater complexity compared
to quantitative methods. Participatory research is capable of generating more nuances concerning the lives of
older persons. Such an approach should however be flanked by quantitative methods that would complement
the qualitative data. Quantitative methods of data collection will be the focus of the next section of this paper.
7
Shrichakradhar. com BECS-184
m
Quantity Indices: The major focus of consideration and comparison in these indices are the quantities either
of a single commodity or a group of commodities. For example, the focus may be to understand the changes
in the quantity of paddy production in India over different time periods. For this purpose, a single
60 co
commodity’s quantity index will have to be constructed. Alternatively, the focus may be to understand the
changes in food grain production in India, in this case all commodities which are categorized under food
grains will be considered while constructing the quantity index.
.
Value Indices: Value indices actually measure the combined effects of price and quantity changes. For many
70 ar
situations either a price index or quantity index may not be enough for the purpose of a comparison. For
example, an index may be needed to compare cost of living for a specific group of persons in a city or a
region. Here comparsion of expenditure of a typical family of the group is more relevant. Since this involves
94 dh
comparing expenditure, it is the value index which will have to be constructed. These indices are useful in
production decisions, because it avoids the effects of inflation.
58 ra
k
99 a
h
ic
hr
S
8
Shrichakradhar. com BECS-184
m
obvious that we are referring to the production of all those items that are produced by the industrial sector.
However, production of some of these items may be increasing while that of others may be decreasing or may
remain constant. The rate of increase or decrease and the units in which these items are expressed may differ.
60 co
For instance, cement may be quoted per kg, cloth may be per meters, cars may be per unit etc. In such a
situation, when the purpose is to measure the changes in the average level of prices or production of
industrial products for comparing over a time or with respect to geographic location, it is not appropriate to
.
apply the technique of measure of central tendency because it is not useful when series are expressed in
70 ar
different units or/and in different items. It is in these situations, that we need a specialised average, known as
index numbers. These are often termed as ‘economic barometers’. An index number may be defined as a
special average which helps in comparison of the level of magnitude of a group of related variables under two
94 dh
or more situations. Index numbers are a series of numbers devised to measure changes over a specified time
period (the time period may be daily, weekly, monthly, yearly, or any other regular time interval), or compare
with reference to one variable or a group of related variables. Thus, each number in a series of specified index
58 ra
number is:
k
99 a
h
ic
hr
have become indispensable for analyzing the data related to business and economic activity. This statistical
tool can be used in several ways as follows:
1) Decision makers use index numbers as part of intermediate computations to understand other
information better. Nominal income can be transformed into real income. Similarly, nominal sales
into real sales & so on …, through an appropriate index number. Consumer price index, also known
as cost of living index, is arrived at for a specified group of consumers in respect of prices of specific
commodities and services which they usually purchase. This index serves as an indicator of ‘real’
wages (or income) of the consumers. Forexample, an individual earns Rs. 100/- in the year 1970 and
his earnings increase to Rs. 300/- in the year 1980. If during this period, consumer price index
9
Shrichakradhar. com BECS-184
increases from 100 to 400 then the consumer is not able to purchase the same quantity of different
commodities with Rs. 300, which he was able to purchase in the year 1970 with his income of Rs. 100/-
. This means the real income has declined. Thus real income can be calculated by dviding the actual
income by dividing the consumer price index:
m
60 co
Therefore, the consumer’s real income in the year 1980 is Rs. 75/- as compared to his income of Rs.
100/- in the year 1970. We can also say that because of price increase, even though his income has
increased, his purchasing power has decreased.
2) Different types of price indices are used for wage and salary negotiations, for compensating in price
.
rise in the form of DA (Dearness Allowance).
70 ar
3) Various indices are useful to the Government in framing policies. Some of these include taxation
policies, wage and salary policies, economic policies, custom and tariffs policies etc.
94 dh
4) Index numbers can also be used to compare cost of living across different cities or regions for the
purpose of making adjustments in house rent allowance, city compensatory allowance, or some other
special allowance.
5) Indices of Industrial Production, Agricultural Production, Business Activity, Exports and Imports are
58 ra
useful for comparison across different places and are also useful in framing industrial policies,
import/export policies etc.
6) BSE SENSEX is an index of share prices for shares traded in the Bombay Stock Exchange. This helps
k
the authorities in regulating the stock market. This index is also an indicator of general business
99 a
activity and is used in framing various government policies. For example, if the share prices of most
of the companies comprising any particular industry are continuously falling, the government may
h
think of changes in its policies specific to that industry with a view to helping it.
Sometimes, it is useful to correlate index related to one industry to the index of another industry or activity so
ic
as to understand and predict changes in the first industry. For example, the cement industry can keep track of
the index of construction activity. If the index of construction activity is rising, the cement industry can expect
hr
Ans. Data in statistics is sometimes classified according to how many variables are in a particular study. For
example, “height” might be one variable and “weight” might be another variable. Depending on the number
of variables being looked at, the data might be univariate, or it might be bivariate. When you conduct a study
that looks at a single variable, that study involves univariate data. For example, you might study a group of
college students to find out their average SAT scores or you might study a group of diabetic patients to find
their weights. Bivariate data is when you are studying two variables. For example, if you are studying a
group of college students to find out their average SAT score and their age, you have two pieces of the
puzzle to find (SAT score and age). Or if you want to find out the weights and heights of college students,
10
Shrichakradhar. com BECS-184
then you also have bivariate data. Bivariate data could also be two sets of items that are dependent on each
other. For example:
• Ice cream sales compared to the temperature that day.
• Traffic accidents along with the weather on a particular day.
Bivariate data has many practical uses in real life. For example, it is pretty useful to be able to predict when a
natural event might occur. One tool in the statistician’s toolbox is bivariate data analysis. Sometimes,
m
something as simple as plotting on variable against another on a Cartesian plane can give you a clear picture
of what the data is trying to tell you. For example, the scatterplot below shows the relationship between the
time between eruptions at Old Faithful vs. the duration of the eruption.
60 co
.
70 ar
94 dh
58 ra
k
Waiting time between eruptions and the duration of the eruption for the Old Faithful Geyser in Yellowstone National
99 a
Park, Wyoming, USA. This scatterplot suggests there are generally two “types” of eruptions: short-wait-short-duration,
and long-wait-long-duration.
h
Bivariate analysis means the analysis of bivariate data. It is one of the simplest forms of statistical analysis,
used to find out if there is a relationship between two sets of values. It usually involves the variables X and Y.
ic
between the two. You can read more here. ). Caloric intake would be your independent variable, X and
weight would be your dependent variable, Y.
Bivariate analysis is not the same as two sample data analysis. With two sample data analysis (like a two
sample z test in Excel), the X and Y are not directly related. You can also have a different number of data
11
Shrichakradhar. com BECS-184
values in each sample; with bivariate analysis, there is a Y value for each X. Let’s say you had a caloric intake
of 3,000 calories per day and a weight of 300lbs. You would write that with the x-variable followed by the y-
variable: (3000,300).
Two sample data analysis:
Sample 1: 100,45,88,99
Sample 2: 44,33,101
m
Bivariate analysis
(X,Y)=(100,56),(23,84),(398,63),(56,42)
Types of Bivariate Analysis
60 co
Common types of bivariate analysis include:
1. Scatter plots,
These give you a visual idea of the pattern that your variables follow.
.
70 ar
94 dh
58 ra
k
A simple scatterplot.
2. Regression Analysis: Regression analysis is a catch all term for a wide variety of tools that you can use to
99 a
determine how your data points might be related. In the image above, the points look like they could follow
an exponential curve (as opposed to a straight line). Regression analysis can give you the equation for that
h
3. Correlation Coefficients: Calculating values for correlation coefficients are using performed on a computer,
although you can find the steps to find the correlation coefficient by hand here. This coefficient tells you if the
variables are related. Basically, a zero means they aren’t correlated (i. e. related in some way), while a 1 (either
hr
positive or negative) means that the variables are perfectly correlated (i. e. they are perfectly in sync with each
other).
S
12
Shrichakradhar. com BECS-184
summarize the data using statistics that represent the majority of people in the population for whom the
question is being asked.
Reasons to Use Univariate Data: Data is gathered for the purpose of answering a question, or more
specifically, a research question. Univariate data does not answer research questions about relationships
between variables, but rather it is used to describe one characteristic or attribute that varies from observation
to observation. To describe how net worth varies, we would use univariate data to find the statistics that
m
represent the center value for all American households along with how the other values spread from that
center value. A researcher would want to conduct a univariate analysis for two purposes. The first purpose
would be to answer a research question that calls for a descriptive study on how one characteristic or attribute
60 co
varies, such as describing how net worth varies from American family to American family. A second purpose
would be to examine how each characteristic or attribute varies before including two variables in a study
using bivariate data or more than two variables in a study using multivariate data (bivariate data being for a
.
2-variable relationship and multivariate data being for a more than 2-variable relationship). For example, it
70 ar
would be beneficial to examine how net worth per family varies before including it in an analysis that
correlates it with a second variable, say, educational attainment.
94 dh
Q. Explain the concept of cumulative and relative frequency.
Ans. It is often useful to express class frequencies in different ways. Rather than listing the actual frequency
opposite each class, it may be appropriate to list either cumulative, frequencies or relative frequencies or both.
58 ra
Cumulative Frequencies. As its name indicates, it cumulates the, frequencies, starting at either the lowest or
highest value. The cumulative frequency of a given class interval thus represents the total of all the previous
class frequencies including the class against which it is written. To illustrate the concept of cumulative
k
13
Shrichakradhar. com BECS-184
Relative Frequencies. Very often, the frequencies in a frequency distribution are converted to relative
frequencies to show the percentage for each class. If the frequency of each class is divided by the total number
of observations (total frequency), then this proportion is referred to as relative frequency. To get the
percentage for each class, multiply the relative frequency by 100; For the above example, the values computed
for relative frequency and percentage are shown below:
m
60 co
.
70 ar
94 dh
58 ra
k
99 a
h
ic
hr
S
14
Shrichakradhar. com BECS-184
m
some subjects only get a light brush, others get wrapped up like a clam in the tentacles’ vice-like grip. Data
science falls into the latter category. If you want to do data science, you’re going to have to deal with math. If
you’ve completed a math degree or some other degree that provides an emphasis on quantitative skills,
60 co
you’re probably wondering if everything you learned to get your degree was necessary. I know I did. And if
you don’t have that background, you’re probably wondering: how much math is really needed to do data
science?
In this post, we’re going to explore what it means to do data science and talk about just how much math you
.
need to know to get started. Let’s start with what “data science” actually means. You probably could ask a
70 ar
dozen people and get a dozen different answers! Here at Dataquest, we define data science as the discipline of
using data and advanced statistics to make predictions. It’s a professional discipline that’s focused on creating
understanding from sometimes-messy and disparate data (although precisely what a data scientist is tackling
94 dh
will vary by employer). Statistics is the only mathematical discipline we mentioned in that definition, but data
science also regularly involves other fields within math. Learning statistics is a great start, but data science
also uses algorithms to make predictions. These algorithms are called machine learning algorithms and there
are literally hundreds of them. Covering how much math is needed for every type of algorithm in depth is not
58 ra
within the scope of this post, I will discuss how much math you need to know for each of the following
commonly-used algorithms:
• Naive Bayes
k
• Linear Regression
• Logistic Regression
99 a
• Neural Networks
h
• K-Means clustering
• Decision Trees
ic
Naïve Bayes’ Classifiers: Naïve Bayes’ classifiers are a family of algorithms based on the common principle
that the value of a specific feature is independent of the value of any other feature. They allow us to predict
the probability of an event happening based on conditions we know about events in question. The name
hr
15
Shrichakradhar. com BECS-184
Math we need: If you want to understand how Naive Bayes classifiers works, you need to understand the
fundamentals of probability and conditional probability. To get an introduction to probability, you can check
out our course on probability. You can also check out our course on conditional probability to get a thorough
understanding of Bayes’ Theorem, as well as how to code Naive Bayes from scratch.
Linear Regression: Linear regression is the most basic type of regression. It allows us to understand the
relationships between two continuous variables. In the case of simple linear regression, this means taking a
set of data points and plotting a trend line that can be used to make predictions about the future. Linear
m
regression is an example of parametric machine learning. In parametric machine learning, the training process
ultimately enables the machine learning algorithms is a mathematical function that best approximates the
patterns it found in the training set. That mathematical function can then be used to make predictions about
60 co
expected future results. In machine learning, mathematical functions are referred to as models. In the case of
linear regression, the model can be expressed as:
.
70 ar
94 dh
58 ra
k
99 a
h
Logistic Regression: Logistic regression focuses on estimating the probability of an event occurring in cases
ic
where the dependent variable is binary (i. e. , only two values, 0 and 1, represent outcomes). Like linear
regression, logistic regression is an example of parametric machine learning. Thus, the result of the training
process for these machine learning algorithms is a mathematical function that best approximates the patterns
hr
in the training set. But where a linear regression model outputs a real number, a logistic regression model
outputs a probability value. Just as a linear regression algorithm produces a model that is a linear function, a
logistic regression algorithm produces a models that is a logistic function. You might also hear it referred to as
S
a sigmoid function, which squashes all values to produced a probabilty result between 0 and 1. The sigmoid
function can be represented as follows:
So why does the sigmoid function always return a value between 0 and 1? Remember from algebra that
raising any number to a negative exponent is the same as raising that number’s reciprocal to the
corresponding positive exponent. Math we need: We’ve discussed exponents and probability here, and you’ll
want to have a solid understanding of both Algebra and probability to get a working knowledge of what is
happening in logistic algorithms. If you want to get a deep conceptual understanding, I would recommend
learning about Probability Theory as well as Discrete Mathematics or Real Analysis.
16
Shrichakradhar. com BECS-184
Neural Networks: Neural networks are machine learning models that are very loosely inspired by the
structure of neurons in the human brain. These models are built by using a series of activation units, known
as neurons, to make predictions of some outcome. Neurons take some input, apply a transformation function,
and return an output. Neural networks excel at capturing nonlinear relationships in data and aid us in tasks
such as audio and image processing. While there are many different kinds of neural networks (recurrent
neural networks, feed forward neural networks, recurrent neural networks etc. ), they all rely on that
fundamental concept of transforming an input to generate an output. When looking at a neural network of
m
any kind, we’ll notice lines going everywhere, connecting every circle to another circle. In mathematics, this is
what is known as a graph, a data structure that consists of nodes (represented as circles) that are connected by
edges (represented as lines. ) Keep in mind, the graph we’re referring to here is not the same as the graph of a
60 co
linear model or some other equation. If you’re familiar with the Traveling Salesman Problem, you’re probably
familiar with the concepts of graphs. At its core, neural networks is a system that takes in some data, performs
some linear algebra and then outputs some answers. Linear algebra is the key to understanding what is going
on behind the scenes in neural networks. Linear Algebra is the branch of mathematics concerning linear
.
equations such as y=mx+b and their representations through matrices and vector spaces. Because linear
70 ar
Algebra concerns the representation of linear equations through matrices, a matrix is the fundamental idea
that you’ll need to know to even begin understanding the core part of neural networks. A matrix is a
rectangular array consisting of numbers, symbols, or expressions, arranged in rows or columns. Matrices are
94 dh
described in the following fashion: rows by columns. For example, the following matrix:
58 ra
is called a 3 by 3 matrix because it has three rows and three columns. With dealing with neural networks, each
feature is represented as an input neuron. Each numerical value of the feature column multiples to a weight
k
vector that represents your output neuron. Mathematically, the process is written like this:
99 a
h
ic
hr
S
17
Shrichakradhar. com BECS-184
K-Means Clustering: The K Means Clustering algorithm is a type of unsupervised machine learning, which is
used to categorize unlabeled data, i. e. data without defined categories or groups. The algorithm works by
finding groups within the data, with the number of groups represented by the variable k. It then iterates
through the data to assign each data point to one of k groups based on the features provided. K-means
clustering relies on the notion of distance throughout the algorithm to “assign” data points to a cluster. If
you’re not familiar with the notion of distance, it refers to the amount of space between two given items.
Within mathematics, any function that describes the distance between any two elements of a set is called a
m
distance function or metric. There are two types of metrics: the Euclidean metric and the taxicab metric. The
Euclidean metric is defined at the following:
60 co
.
70 ar
94 dh
58 ra
k
Decision Tree: A decision tree is a flow-chart-like tree structure that uses a branching method to illustrate
99 a
every possible outcome of a decision. Each node within the tree represents a test on a specific variable - and
each branch is the outcome of that test. Decision trees rely on a theory called information theory to determine
h
how they’re constructed. In information theory, the more one knows about a topic, the less new information
ic
one can know. One of the key measures in information theory is known as entropy. Entropy is a measure
which quantifies the amount of uncertainty for a given variable. Entropy can be written like this:
hr
S
18
Shrichakradhar. com BECS-184
m
60 co
Q. Write a note describing the importance of statistics in data analysis.
Ans. It is important to understand the ideas behind the various techniques, in order to know how and when
to use them. One has to understand the simpler methods first, in order to grasp the more sophisticated ones.
.
It is important to accurately assess the performance of a method, to know how well or how badly it is
70 ar
working. Additionally, this is an exciting research area, having important applications in science, industry,
and finance. Ultimately, statistical learning is a fundamental ingredient in the training of a modern data
scientist. Examples of Statistical Learning problems include:
94 dh
• Identify the risk factors for prostate cancer.
• Classify a recorded phoneme based on a log-periodogram.
• Predict whether someone will have a heart attack on the basis of demographic, diet and clinical
measurements.
58 ra
• Establish the relationship between salary and demographic variables in population survey data.
1 . Linear Regression: In statistics, linear regression is a method to predict a target variable by fitting the best
99 a
linear relationship between the dependent and independent variable. The best fit is done by making sure that
the sum of all the distances between the shape and the actual observations at each point is as small as
h
possible. The fit of the shape is “best” in the sense that no other position would produce less error given the
ic
choice of shape. 2 major types of linear regression are Simple Linear Regression and Multiple Linear
Regression. Simple Linear Regressionuses a single independent variable to predict a dependent variable by
fitting a best linear relationship. Multiple Linear Regression uses more than one independent variable to
hr
19
Shrichakradhar. com BECS-184
2 . Classification: Classification is a data mining technique that assigns categories to a collection of data in
order to aid in more accurate predictions and analysis. Also sometimes called a Decision Tree, classification is
one of several methods intended to make the analysis of very large datasets effective. 2 major Classification
techniques stand out: Logistic Regression and Discriminant Analysis.
Logistic Regression is the appropriate regression analysis to conduct when the dependent variable is
dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic
regression is used to describe data and to explain the relationship between one dependent binary variable and
m
one or more nominal, ordinal, interval or ratio-level independent variables. Types of questions that a logistic
regression can examine:
60 co
• How does the probability of getting lung cancer (Yes vs No) change for every additional pound of
overweight and for every pack of cigarettes smoked per day?
• Do body weight calorie intake, fat intake, and participant age have an influence on heart attacks (Yes vs
No)?
.
70 ar
94 dh
58 ra
In Discriminant Analysis, 2 or more groups or clusters or populations are known a priori and 1 or more new
k
observations are classified into 1 of the known populations based on the measured characteristics.
Discriminant analysis models the distribution of the predictors X separately in each of the response classes,
99 a
and then uses Bayes’ theorem to flip these around into estimates for the probability of the response category
given the value of X. Such models can either be linear or quadratic.
h
• Linear Discriminant Analysis computes “discriminant scores” for each observation to classify what
response variable class it is in. These scores are obtained by finding linear combinations of the
ic
independent variables. It assumes that the observations within each class are drawn from a
multivariate Gaussian distribution and the covariance of the predictor variables are common across
hr
are not assumed to have common variance across each of the k levels in Y.
3. Resampling Methods: Resampling is the method that consists of drawing repeated samples from the
original data samples. It is a non-parametric method of statistical inference. In other words, the method of
resampling does not involve the utilization of the generic distribution tables in order to compute approximate
p probability values. Resampling generates a unique sampling distribution on the basis of the actual data. It
uses experimental methods, rather than analytical methods, to generate the unique sampling distribution. It
yields unbiased estimates as it is based on the unbiased samples of all the possible results of the data studied
by the researcher. In order to understand the concept of resampling, you should understand the terms
Bootstrapping and Cross-Validation:
20
Shrichakradhar. com BECS-184
m
60 co
• Bootstrapping is a technique that helps in many situations like validation of a predictive model
.
performance, ensemble methods, estimation of bias and variance of the model. It works by sampling
70 ar
with replacement from the original data, and take the “not chosen” data points as test cases. We can
make this several times and calculate the average score as estimation of our model performance.
• On the other hand, cross validation is a technique for validating the model performance, and it’s
94 dh
done by split the training data into k parts. We take the k — 1 parts as our training set and use the
“held out” part as our test set. We repeat that k times differently. Finally, we take the average of the k
scores as our performance estimation.
Usually for linear models, ordinary least squares is the major criteria to be considered to fit them into the data.
58 ra
The next 3 methods are the alternative approaches that can provide better prediction accuracy and model
interpretability for fitting linear models.
4. Subset Selection: This approach identifies a subset of the p predictors that we believe to be related to the
k
response. We then fit a model using the least squares of the subset features.
99 a
h
ic
hr
S
• Best-Subset Selection: Here we fit a separate OLS regression for each possible combination of
the p predictors and then look at the resulting model fits. The algorithm is broken up into 2 stages: (1)
21
Shrichakradhar. com BECS-184
Fit all models that contain k predictors, where k is the max length of the models, (2) Select a single
model using cross-validated prediction error. It is important to use testing or validation error, and not
training error to assess model fit because RSS and R² monotonically increase with more variables. The
best approach is to cross-validate and choose the model with the highest R² and lowest RSS on testing
error estimates.
• Forward Stepwise Selection considers a much smaller subset of ppredictors. It begins with a model
containing no predictors, then adds predictors to the model, one at a time until all of the predictors
m
are in the model. The order of the variables being added is the variable, which gives the greatest
addition improvement to the fit, until no more variables improve model fit using cross-validated
prediction error.
60 co
• Backward Stepwise Selection begins will all p predictors in the model, then iteratively removes the
least useful predictor one at a time.
• Hybrid Methods follows the forward stepwise approach, however, after adding each new variable,
the method may also remove variables that do not contribute to the model fit.
.
5. Shrinkage: This approach fits a model involving all p predictors, however, the estimated coefficients are
70 ar
shrunken towards zero relative to the least squares estimates. This shrinkage, aka regularization has the effect
of reducing variance. Depending on what type of shrinkage is performed, some of the coefficients may be
estimated to be exactly zero. Thus this method also performs variable selection. The two best-known
94 dh
techniques for shrinking the coefficient estimates towards zero are the ridge regression and the lasso.
58 ra
k
99 a
h
ic
• Ridge regression is similar to least squares except that the coefficients are estimated by minimizing a
hr
slightly different quantity. Ridge regression, like OLS, seeks coefficient estimates that reduce RSS,
however they also have a shrinkage penalty when the coefficients come closer to zero. This penalty
has the effect of shrinking the coefficient estimates towards zero. Without going into the math, it is
S
useful to know that ridge regression shrinks the features with the smallest column space variance.
Like in prinicipal component analysis, ridge regression projects the data into d directional space and
then shrinks the coefficients of the low-variance components more than the high variance
components, which are equivalent to the largest and smallest principal components.
• Ridge regression had at least one disadvantage; it includes all p predictors in the final model. The
penalty term will set many of them close to zero, but never exactly to zero. This isn’t generally a
problem for prediction accuracy, but it can make the model more difficult to interpret the results.
Lasso overcomes this disadvantage and is capable of forcing some of the coefficients to zero granted
that s is small enough. Since s = 1 results in regular OLS regression, as s approaches 0 the coefficients
shrink towards zero. Thus, Lasso regression also performs variable selection
22
Shrichakradhar. com BECS-184
6. Dimension Reduction: Dimension reduction reduces the problem of estimating p + 1 coefficients to the
simple problem of M + 1 coefficients, where M < p. This is attained by computing M different linear
combinations, or projections, of the variables. Then these M projections are used as predictors to fit a linear
regression model by least squares. 2 approaches for this task are principal component regression and partial
least squares.
m
60 co
.
70 ar
94 dh
58 ra
k
99 a
• One can describe Principal Components Regression as an approach for deriving a low-dimensional
h
set of features from a large set of variables. The firstprincipal component direction of the data is along
which the observations vary the most. In other words, the first PC is a line that fits as close as possible
ic
to the data. One can fit p distinct principal components. The second PC is a linear combination of the
variables that is uncorrelated with the first PC, and has the largest variance subject to this constraint.
The idea is that the principal components capture the most variance in the data using linear
hr
combinations of the data in subsequently orthogonal directions. In this way, we can also combine the
effects of correlated variables to get more information out of the available data, whereas in regular
least squares we would have to discard one of the correlated variables.
S
• The PCR method that we described above involves identifying linear combinations of X that best
represent the predictors. These combinations (directions) are identified in an unsupervised way, since
the response Y is not used to help determine the principal component directions. That is, the
response Y does not supervise the identification of the principal components, thus there is no
guarantee that the directions that best explain the predictors also are the best for predicting the
response (even though that is often assumed). Partial least squares (PLS) are a supervised alternative
to PCR. Like PCR, PLS is a dimension reduction method, which first identifies a new smaller set of
features that are linear combinations of the original features, then fits a linear model via least squares
to the new M features. Yet, unlike PCR, PLS makes use of the response variable in order to identify
the new features.
23
Shrichakradhar. com BECS-184
7. Nonlinear Models: In statistics, nonlinear regression is a form of regression analysis in which observational
data are modeled by a function which is a nonlinear combination of the model parameters and depends on
one or more independent variables. The data are fitted by a method of successive approximations. Below are a
couple of important techniques to deal with nonlinear models:
• A function on the real numbers is called a step function if it can be written as a finite linear
combination of indicator functions of intervals. Informally speaking, a step function is a piecewise
constant function having only finitely many pieces.
m
• A piecewise function is a function which is defined by multiple sub-functions, each sub-function
applying to a certain interval of the main function’s domain. Piecewise is actually a way of expressing
60 co
the function, rather than a characteristic of the function itself, but with additional qualification, it can
describe the nature of the function. For example, a piecewise polynomial function is a function that is
a polynomial on each of its sub-domains, but possibly a different one on each.
.
70 ar
94 dh
58 ra
k
99 a
h
• A spline is a special function defined piecewise by polynomials. In computer graphics, spline refers
to a piecewise polynomial parametric curve. Splines are popular curves because of the simplicity of
ic
their construction, their ease and accuracy of evaluation, and their capacity to approximate complex
shapes through curve fitting and interactive curve design.
hr
• A generalized additive modelis a generalized linear model in which the linear predictor depends
linearly on unknown smooth functions of some predictor variables, and interest focuses on inference
about these smooth functions.
8. Tree-Based Methods: Tree-based methods can be used for both regression and classification problems.
S
These involve stratifying or segmenting the predictor space into a number of simple regions. Since the set of
splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are
known as decision-tree methods. The methods below grow multiple trees which are then combined to yield a
single consensus prediction.
• Bagging is the way decrease the variance of your prediction by generating additional data for
training from your original dataset using combinations with repetitions to produce multistep of the
same carnality/size as your original data. By increasing the size of your training set you can’t improve
the model predictive force, but just decrease the variance, narrowly tuning the prediction to expected
outcome.
24
Shrichakradhar. com BECS-184
• Boosting is an approach to calculate the output using several different models and then average the
result using a weighted average approach. By combining the advantages and pitfalls of these
approaches by varying your weighting formula you can come up with a good predictive force for a
wider range of input data, using different narrowly tuned models.
m
60 co
.
70 ar
94 dh
58 ra
The random forest algorithm is actually very similar to bagging. Also here, you draw random bootstrap
samples of your training set. However, in addition to the bootstrap samples, you also draw a random subset
of features for training the individual trees; in bagging, you give each tree the full set of features. Due to the
k
random feature selection, you make the trees more independent of each other compared to regular bagging,
99 a
which often results in better predictive performance (due to better variance-bias trade-offs) and it’s also faster,
because each tree learns only from a subset of features.
h
SVM is a classification technique that is listed under supervised learning models in Machine Learning. In
layman’s terms, it involves finding the hyperplane (line in 2D, plane in 3D and hyperplane in higher
25
Shrichakradhar. com BECS-184
dimensions. More formally, a hyperplane is n-1 dimensional subspace of an n-dimensional space) that best
separates two classes of points with the maximum margin. Essentially, it is a constrained optimization
problem where the margin is maximized subject to the constraint that it perfectly classifies the data (hard
margin). The data points that kind of “support” this hyperplane on either sides are called the “support
vectors”. In the above picture, the filled blue circle and the two filled squares are the support vectors. For
cases where the two classes of data are not linearly separable, the points are projected to an exploded (higher
dimensional) space where linear separation may be possible. A problem involving multiple classes can be
m
broken down into multiple one-versus-one or one-versus-rest binary classification problems.
10. Unsupervised Learning: So far, we only have discussed supervised learning techniques, in which the
60 co
groups are known and the experience provided to the algorithm is the relationship between actual entities
and the group they belong to. Another set of techniques can be used when the groups (categories) of data are
not known. They are called unsupervised as it is left on the learning algorithm to figure out patterns in the
data provided. Clustering is an example of unsupervised learning in which different data sets are clustered
.
into groups of closely related items. Below is the list of most widely used unsupervised learning algorithms:
70 ar
94 dh
58 ra
k
99 a
h
ic
hr
S
• Principal Component Analysis helps in producing low dimensional representation of the dataset by
identifying a set of linear combination of features which have maximum variance and are mutually
un-correlated. This linear dimensionality technique could be helpful in understanding latent
interaction between the variable in an unsupervised setting.
• k-Means clustering: partitions data into k distinct clusters based on distance to the centroid of a
cluster.
• Hierarchical clustering: builds a multilevel hierarchy of clusters by creating a cluster tree.
This was a basic run-down of some basic statistical techniques that can help a data science program manager
and or executive have a better understanding of what is running underneath the hood of their data science
26
Shrichakradhar. com BECS-184
teams. Truthfully, some data science teams purely run algorithms through python and R libraries. Most of
them don’t even have to think about the math that is underlying. However, being able to understand the
basics of statistical analysis gives your teams a better approach. Have insight into the smallest parts allows for
easier manipulation and abstraction. I hope this basic data science statistical guide gives you a decent
understanding!
m
Ans. After having addressed the key components of the bottom-up, participatory approach, we will focus
now on gathering of information through the various participatory methods of data collection. Participatory
data collection, or research, is generally associated with qualitative methods of information gathering.
60 co
Qualitative methods in comparison to quantitative ones tend to be more concerned with words than numbers.
Qualitative methods are therefore based on data collection and analysis which focus on interpreting the
meaning of social phenomena based on the views of the participants of a particular social reality.
.
Participatory approaches contain a variety of data collection methods: (a) participatory listening and
70 ar
observation; (b) visual tools such as maps, daily activity diagrams, institutional diagrams and Venn diagrams,
flow diagrams and livelihood analysis; (c) semistructured interviews; and (d) focus group discussions.
Among the participatory methods of evaluation, semi-structured interviews and focus groups are the most
94 dh
often used instruments for gathering the views of participants on certain topics and issues. Participatory
listening and observation and various visual tools would normally be undertaken at the initial stages of the
evaluation process as they often provide the basis for the design of in-depth questionnaires for semi-
structured interviews and the conduct of focus groups.
58 ra
While quantitative questionnaires are structured in the variety of answers that a respondent chooses from,
qualitative surveys and focus groups allow for more nuanced, semi-structured and open-ended responses.
The objective of qualitative designs is to capture values, attitudes and preferences of participants to permeate
k
the ‘how’ and the ‘why’ underlying a phenomenon. Since data resulting from qualitative research approaches
does not lend itself to numerical coding, evaluation of qualitative findings is more complex compared to
99 a
quantitative research results. Tables, rows of data, or correlations are therefore not generated by qualitative
research. Information has to be grouped under topical headings and generalized in its diversity.
h
ic
hr
S
27
Shrichakradhar. com BECS-184
To capture the full extent of a specific social reality, many research designs are based on a combination of
quantitative and qualitative methods. Data collated by quantitative research methods is rarely sufficient to
provide a full explanation of an observable social issue. Based on their experience, researchers have realized
the importance of integrating quantitative analysis with qualitative methods while trying to provide policy
makers with a comprehensive portrait of the socio-economic situation of various social groups. Such an
integrative approach would also be of use in reviewing and appraising the implementation of MIPAA.
a) Participatory listening and observation: Listening and observation skills are the basis for attaining a
m
comprehensive understanding of the situation of older persons in a particular community and to viewing
social reality through the eyes of older persons. These skills are of great use for any participatory research
60 co
design and should be applied for the duration of any project. Participatory listening and observation assumes
that “the participant observer/ethnographer immerses him- or herself in a group for an extended period of
time, observing behaviour, listening to what is said in conversations both between others and with the
fieldworkers,and asking questions”. It is therefore “a major research strategy which aims to gain a close and
.
intimate familiarity with a given area of study through intensive involvement with people in their natural
70 ar
environment.” A bottom-up, participatory research project in a particular community may be started by
familiarizing oneself with the environment. This is usually done in a guided walk – or transect walk – that
often involves an individual or a group of people who would guide the researcher(s) through a community to
94 dh
observe and talk about things of local importance. The organizational set-up of a community, the quality of
housing and the availability of social services for older persons can be studied on such a walk. As a result,
maps could be drawn reflecting the crucial local institutions that are relevant to older persons. With regard to
participatory listening it is important for the listener to ensure that his/her appearance and manner are
58 ra
conducive to the research environment and are acceptable to the older persons themselves. Every person
should be encouraged to speak, and interest in what is said should be demonstrated at all times. Non-verbal
communication such as body language should be given due attention as well. The researcher(s) should seek
k
clarification if needed to understand correctly what an individual tries to express. Expressive or verbal
judgments of what older persons have said should be avoided.
99 a
Participatory observation complements the listening component. People or events should be observed at
h
different times of the day and at different days of the week to ensure that a balanced impression has been
gained. Observations and conversations should be written down in field notes as soon as possible since
ic
human memory can be deceptive. Particular attention should be given to power relationships among older
persons, what roles various individuals play in the community, what activities and tasks are performed at
what frequency, and what issues engender excitement, irritation, agreement or disagreement among older
hr
persons. Participatory observation and listening form the basis from which further and more complex
inquiries depart. What has been observed and heard is often the starting point for semi-structured interviews
and focus group discussions during which observations can be checked and clarified in interview questions to
S
determine whether the researcher has accurately interpreted what he/she has seen and heard.
b) Visual tools: “Visual tools – such as maps, diagrams, seasonal calendars and daily activity charts – are
important elements of participatory research. They enable older people to explore complex relationships and
link issues in ways not possible through verbal methods alone, generating a deeper analysis of local issues. ”
A common participatory approach in visualizing is it to draw figures, maps and diagrams and/or to use tools
such as stones, sticks or other objects to demonstrate the layout of a particular community. One of the
advantages of using visual tools is that illiterate members of a community would also have the opportunity to
participate in the evaluation exercises, so that a balanced representation of older persons within the
community could be ensured. That means that older persons from various socio-economic strata and from
28
Shrichakradhar. com BECS-184
different geographic areas of the community should have the opportunity to participate. Age and sex
distribution should be accurately represented as well.
Maps can be informative tools showing characteristics of a location, where evaluation of MIPAA is being
undertaken. HelpAge International distinguishes between resource maps and mobility maps. The former
show where (older) people live as well as the general infrastructure of a community, while the latter outline
movements within a community. In addition to these two methods, body maps could be an important source
of information, about the health status of older persons, which could be depicted on a large map of the human
m
body. However, body mapping should be approached with utmost sensitivity. Although there would be a
general introduction to the mapping exercise by the researcher(s), the mapping itself should be conducted by
60 co
people living in the location of evaluation and the evaluation team shall not interfere during the mapping
activity. Since different groups of older persons would be asked to participate in the mapping exercise, it
might be expected that different maps would highlight the different perceptions within a community. The
mapping exercise should also include an inquiry about historical changes of a community that could be
.
reflected in mapping as well. To understand how members of a community spend their time, daily activity
70 ar
diagrams are helpful. Daily work patterns and other activities of older persons could be recorded with the
assistance of such a method by using little stones that would symbolize time spent on particular activities. Of
special interest would be gender differences with regard to time use as well as how much older persons
94 dh
contribute to household and community activities. In addition, changes of time use can be demonstrated by
inviting older persons to reflect on their whole lives and how much their daily activities have varied over the
course of time discerning trend lines and creating historical profiles. Caution is in order when asking
participants about their (extended) past, since human memory can be very deceptive.
58 ra
Similarly, institutional diagrams would illustrate key institutions and individuals within the community. By
drawing rectangles of different sizes, older persons would demonstrate the influence and power that certain
local institutions and individuals possess. Connections between institutional and individual power are of
k
interest to the researcher(s) as well: Venn diagrams are used to explain changes in relationships between
institutions, groups and individuals. With regard to Venn diagrams, the same procedure of using rectangles
99 a
of varying sizes should be utilized. The rectangles would represent different institutions (with the larger
h
rectangles representing institutions that play a more important role in the community). The distances among
the rectangles would represent the level of contact among various institutions. Overlapping of rectangles
ic
would symbolize the extent to which the various parts of different institutions collaborate on particular
issues. An example for two overlapping institutions could be the local police force and the localgovernment.
Since questions regarding power within a community are often sensitive, it may be prudent to engage in such
hr
programme on people’s lives, for example the impact of new health policy on older persons’ wellbeing. ”6
Events (problems, issues), their causes and effects can be visualized by lines of varying thickness expressing
their significance. They would also be used to identify the extent which issues are interrelated. The opinions
of participants on effectiveness of policies can be measured by flow diagrams. Similarly, effectiveness of
policies affecting the lives of older persons can be ranked and scored on a matrix to establish which policies
are viewed as successful or failing in delivering what was promised to older persons. In that sense, flow
diagrams and ranking and scoring matrices would be promising tools for monitoring of existing or future
policies and programmes specifically geared towards older persons. Livelihood analysis aims at learning
about people’s income (cash and in kind) and expenditure. It can also be seen as a participatory, economic
29
Shrichakradhar. com BECS-184
household analysis, since older persons would be asked to list how many household members reside where
they live. Participants would draw three circles and divide the first one according to sources of income, the
second one according to on what kind of expenditure the resources are spent and the third one according to
which household member spends how much of the available resources. The final maps, daily activity
diagrams, institutional diagrams and Venn diagrams, flow diagrams and outcome of the livelihood analysis
that have been created by various groups and individuals should be copied or photographed by the
evaluation team. The results will be valuable in influencing the design of semi-structured interviews and in
m
conducting focus group discussions since a rather diverse body of base information has been gathered by
visual tools. More focused in-depth data collection can follow once the listener has attained a more nuanced
60 co
understanding of a particular community and its older persons.
c) Semi-structured interviews: “Semi-structured interviews – conversations based on a set of guideline
questions – are a key technique in participatory research, and a powerful way of learning about the views of
older people. ”7 Although all guideline questions will be asked during an interview – albeit with the
.
possibility of varying order – new questions may arise during each interview. Therefore, the interview
70 ar
process is flexible compared to the rigidly structured interviews that we will turn to in the next section. This
kind of flexibility will allow the interviewee to describe events, observations and issues in very personal terms
and he/she will thus be less restricted to respond to questions in his/her own words. The set of questions
94 dh
however, will ensure comparability of data when the interviews are analyzed. The guideline questions of the
interview should be organized according to topical areas of inquiry that should succeed each other in a logical
fashion. The language used should be comprehensible and jargon free. It is obvious that the interviewer has to
be able to speak the language of the community in which he/she will conduct semi-structured interviews. An
58 ra
ability to (a) ask short, simple and easy questions, to (b) listen attentively, to (c) steer the interview sensitively
in the desired direction and to (d) remember what was said earlier and interpret correctly respondent’s
statements during the interview are of paramount importance for the interviewer. Questions that would lead
k
the respondent in a particular direction (Do you agree that…. ?) should be avoided. At the outset of an
interview, it is important to select appropriate participants, to explain why the researcher(s) conduct this
99 a
interview, to record the interviewee’s name, age, gender and, importantly, whether the individual belongs to
certain community institutions, how large the residential household is and how the interviewee locates
h
him/herself within the community. Being outfitted with good quality recording equipment and making sure
that the interview location is quiet and private are practical issues that are important for successful
ic
interviewing.
d) Focus group discussions: Focus group discussions are “a research strategy which involves intensive
hr
discussion and interviewing of small groups of people, on a given ‘focus’ or issue, usually on a number of
occasions over a period of time. ”10 The difference between individual semi structured interviews and focus
group discussions is that the latter gives an opportunity to follow the group dynamic that evolves during the
S
discussion. How interviewees react to each other’s responses and make up their opinion, often as a reaction to
what other participants have expressed is of core interest during a focus group discussion. Since participants
may argue about certain aspects of an issue that is being discussed during a focus group, the reactions
expressed and opinions voiced may be more realistic compared to an individual interview. In addition, views
of participants can be challenged by others more profoundly than in a semi-structured interview. Thus, focus
group discussions ideally complement semi-structured individual interviews.
The moderator who facilitates the focus group should try to be not too intrusive and should rely on a rather
unstructured setting for the discussions to extract the opinions, views and perspectives of the participants.
He/she should have a rather small number of guiding questions to stir the discussion and should intervene
minimally. Only when the discussion veers clearly off track or when there are unproductive silences, should
30
Shrichakradhar. com BECS-184
the moderator get involved. The moderator should record the discussions on audio equipment and make
notes on the non-verbal behaviour of the participants. Naturally, the main interest would be on the range of
opinions expressed, who are the opinion leaders and how the participants express their views during a focus
group discussion. As with semi structured interviewing, it is not necessary to transcribe the entire discussion;
the focus should be on the most important parts of a focus group to document what was said. Evaluation of
living conditions of older persons, for methodological reasons, should be based on numerous focus group
discussions. 11 There is no clear guide on how many discussions on a particular topic are sufficient, but in
m
case of measuring the quality of life of older persons, it seems that a more limited number of discussions
would be in order since only older persons would participate compared to a sample reflecting the entire
society. If a starting hypothesis exists (i. e. income of older persons decreased due to pension scheme reforms),
60 co
‘theoretical saturation’ could be applied here as well: if the evaluation team hears repeatedly similar or
identical responses and discussions of focus groups, it will conclude, there is no further need to continue with
more discussions. The size of each focus group should range between six to ten participants to allow every
.
speaker enough time to express him/herself. The participants shall be selected randomly on a variety of
70 ar
characteristics: older age (60 and above) being the most obvious, but also based on differences of educational
attainment, income and occupation, marital status and sex. Since participatory research on views of older
persons will be organized within a community or locale, it is evident that many of the participants in focus
94 dh
group discussions will know each other in advance. It is recommended to start a focus group by thanking the
participants for taking part in the discussion, by explaining the evaluation purpose and design, and the
reasons for recording the session. In addition, anonymity during evaluation should be assured and certain
conventions (e. g. only one speaker at a time) of focus group discussions should be outlined. Forms could be
58 ra
filled out that would provide the evaluation team with general socio-economic (educational attainment,
occupation) and demographic (age, sex) data of the participants. Thereafter, participants would introduce
themselves to the group and attach name tags. A free flow of discussion topics should be facilitated by the
k
moderator using a set of guided questions. Every participant should have the opportunity to express
uninterrupted his/her respective opinion and more quiet participants should be encouraged to speak as well.
99 a
Similarly to the semi-structured interviews, the language used by the moderator should be clear and jargon-
free. In addition, the guided questions should be relevant to the group assembled. Thoughtful questions
h
would engender a lively debate and avoid replies such as ‘yes’ or ‘no’ from the participants. A successful
focus group discussion would allow the moderator to see the debated issues through the eyes of the
ic
participants and to glean a much deeper understanding of issues concerning the lives of older persons.
Ans. There are three major issues which may be faced in the construction of index numbers. They are: 1)
Collection of Data; 2) Selection of Base Year and 3) Selection of Appropriate Index. Let us discuss them in
detail:
S
1) Collection of Data: Data collection through a sample method is one of the issues in the construction of
index numbers. The data has to be as reliable, adequate, accurate, comparable, and representative, as possible.
Here a large number of questions need to be answered. The answers ultimately depend on the purpose and
individual judgement. For example, one needs to decide the following:
I. Identification of Commodities to be Included: How many and which category of commodities to
include? A large number of items may be present. It is not possible to include all of them, only those
items deserve to be included in the construction of an index number as would make it more
representative. For example, if we are required to construct indices for shares on the Bombay Stock
Exchange, there are several shares listed and traded, it is not possible to include all of them.
31
Shrichakradhar. com BECS-184
Therefore, it has to be decided which sample number of shares (may be 30 or 40) should represent the
general movement of share prices of the Bombay Stock Exchnage. Therefore, it is worthwhile to note
that the selection of items must be deliberate and in keeping with the relevance and significance of
each individual item to the purpose for which the index is constructed.
II. Sources of Data: From where to collect data? It is an important and difficult issue. The source
depends on the information requirement. For example, one may need to collect prices and quantities
consumed related to certain commodities for a consumer price index. However, there may be a large
m
number of retailers and wholesalers, selling the commodities, and quoting different prices. To get the
details, only a few representative shops (which represent the typical purchasing points of the people
under question) need to be selected. Thus, based on a representative sample survey, sources should
60 co
be from where accurate, adequate, and timely data can be available.
III. Timings of Data Collection: It is also equally important to collect the data at an appropriate time.
Referring to the example of consumer price index, prices are likely to vary on different days of the
.
month. For certain commodities prices may vary at different times of the same day. Take an example,
70 ar
vegetable prices are usually high in the morning when fresh vegetables arrive and are low in the late
evening when sellers are closing for the day and wish to clear the perishable stock. For each
commodity, individual judgement needs to be exercised to represent reality and to serve the purpose
94 dh
for which an index is to be used.
2) Selection of Base Year: A base period is the reference period for comparing and analysing the changes in
prices or quantities in a given period. For many index number series, value of a particular time period,
usually a year, is taken as reference period against which all subsequent index numbers in the series are
58 ra
yet other cases, we may be required to compare one index number series against another series. In such a
context, a ‘base’ common to all series is more appropriate.
99 a
In the light of the above considerations, therefore, the period/year selected as base period/year must be a
‘normal’ period. Normal period is a period with price or quantity figures neither too low, nor too high. It
h
should not have been affected by abnormal occurances, such as floods, (if interested in
agriculturalproduction), wars, sudden recession etc. What is normal should also be decided keeping in view
ic
Whether to use an unweighted or weighted index is a difficult question to answer. It depends on the purpose
for which the index number is required to be used. For example, if we are interested in an index for the
purpose of negotiating wages or compensating for price rise, only a weighted index would be worthwhile to
S
use. Which weights to be used? Whether base year quantities or current year quantities or some other weights
are to be used is an important question to answer. Weights which realistically reflect the relative importance
of items included in the construction of an index is perhaps the only answer. The purpose for which an index
is needed will of course remain a vital factor to reckon with.
32