Perspective On Data Science
Perspective On Data Science
Perspective on Data
Downloaded from www.annualreviews.org. Guest (guest) IP: 196.189.145.183 On: Thu, 03 Apr 2025 07:29:41
Science
Roger D. Peng1 and Hilary S. Parker2
1
Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore,
Maryland 21205, USA; email: rdpeng@jhu.edu
2
Independent Consultant, San Francisco, California 94102, USA
1
1. INTRODUCTION
Data science is perhaps one of the most generic and vaguely defined fields of study to have come
about in the past 50 years. In a recent paper titled “50 Years of Data Science” David Donoho
(2017) cites a number of definitions that ultimately could be interpreted to include essentially
any scientific activity. To the extent that there is a common theme to the various definitions of
data science, it is the decoupling of operational activity on data from the scientific, policy, or
business considerations that may surround such activity. The key recent development is that this
operational activity has grown significantly more complex with the increase in dataset sizes and the
Downloaded from www.annualreviews.org. Guest (guest) IP: 196.189.145.183 On: Thu, 03 Apr 2025 07:29:41
growth in computational power (Peng 2011). It is therefore valuable to consider what distinguishes
data science as a field of study and whether people who consider themselves members of the field
share anything in common or even agree on the definition of the field.
Equal in difficulty to the task of defining the field of data science is defining what it is that
data scientists do. The University of California–Berkeley School of Information’s website ti-
tled “What is Data Science?” (https://ischoolonline.berkeley.edu/data-science/what-is-data-
science/) states that
Data scientists examine which questions need answering and where to find the related data. They have
business acumen and analytical skills as well as the ability to mine, clean, and present data. Businesses use
data scientists to source, manage, and analyze large amounts of unstructured data. Results are then syn-
thesized and communicated to key stakeholders to drive strategic decision-making in the organization.
Such a definition is not uncommon in our experience and encompasses a wide range of possible
activities that require a diverse set of skills from an individual.
The common theme in descriptions of the job of the data scientist is a kind of beginning-to-end
narrative, whereby data scientists have a hand in many, if not all, aspects of a process that involves
data. The only aspects in which they are not involved are the choice of the question itself and the
decision that is ultimately made upon seeing the results. In fact, based on our experience, many
real-world situations draw the data scientist into participating in those activities as well.
Having a vague and generic definition for a field of study, especially one so new in formation
as data science, can provide security and other advantages in the short term. Drawing as many
people as possible into a field can build strength and momentum that might subsequently lead to
greater resources. It would seem premature to narrowly define a field in the early stages and risk
excluding individuals who might make useful contributions. However, retaining such a vague defi-
nition of a field introduces challenges that ultimately limit progress. Over time, the ever-changing
and quickly advancing nature of the field leads to a greater number of activities and tools being
included in the field, making it increasingly difficult to define the core elements of the field. A
persistent temptation to define the field as simply the union of all activities by members of the
field can lead to a kind of field-specific entropy.
The inclusion of a large number of activities into the definition of a field incurs little cost until
one is confronted with teaching the field to newcomers (Kross & Guo 2019). Students, or learners
of any sort, arrive with limited time and resources and typically need to learn the essentials, or
core, of the field. What exactly composes this core set of elements can be greatly influenced by
the instructor’s personal experience in the field of data science, which is unsurprising given the
heterogeneity of people included in the field. The result is that different individuals can end up
telling very different stories about what makes up the core of data science (Wing 2020). Computer
scientists, statisticians, engineers, information scientists, and mathematicians (to name but a few)
will all focus on their specific perspective and relay their experience of what is important. Such
a fracturing of the teaching of data science suggests that there is little about data science that is
2 Peng • Parker
generalizable and that material should essentially be taught on a need-to-know or just-in-time
basis. In that case, there is little rationale for a standardized formal education in data science.
It would be naive to think that the definition of a field ever becomes stable or that members
of a field ever reach agreement on its definition. In fact, it is healthy for members of any field to
question the fundamental core of the field itself and consider whether anything should be added
or subtracted. In a 1962 paper in the Annals of Mathematical Statistics, John Tukey debated whether
data analysis was a part of statistics or a field unto its own (Tukey 1962). More recently, software
engineering and computer science topics have been added to numerous statistics curricula as a
Downloaded from www.annualreviews.org. Guest (guest) IP: 196.189.145.183 On: Thu, 03 Apr 2025 07:29:41
fundamental aspect of the field (Nolan & Temple Lang 2010). No single paper or review will
settle the matter of what makes the core of the field, but constant discussion and iteration have
their own intellectual benefits and ensure that a field does not stagnate to the point of irrelevance.
At this point, it may be worthwhile to examine the long list of data science activities, tools,
skills, and job requirements and determine if there exist any common elements—things that all
data scientists do or tools that all data scientists use. The extent to which we can draw anything
out of such an exercise will give us some indication of whether a core of data science exists and
where its boundaries may lie.
Statistical theory
Downloaded from www.annualreviews.org. Guest (guest) IP: 196.189.145.183 On: Thu, 03 Apr 2025 07:29:41
Truth ??
State of
scientific theory
Expected
outcome
Figure 1
Data analytic triangle. Data analyses attempt to discover the truth by iterating between observed data
analytic output and expected outcomes. Explaining any deviation between the expected outcome and the
observed output is a key job of the data scientist.
The difference between the truth and one’s data analytic output can typically be explained
by statistical theory, which explicitly accounts for random deviations in the data. The difference
between the truth and what one expects to see from an analysis can be explained by our under-
standing of the scientific theory underlying the phenomenon we are studying. If we have little
understanding of the phenomenon, then our expectations may be quite diffuse or far from the
truth and there might be little that surprises us in the data. If our understanding is thought to be
quite good, then our expectations might be more narrow and perhaps closer to the truth.
How can we explain the difference between the data analytic output and our expectations for
what we should observe? We would argue that answering this question is a fundamental task for the
data scientist. It presents itself in almost every data analysis, large or small, and appears repeatedly.
Given that we generally do not observe the truth, the output and the expected outcome are all we
have to work with, and juggling these two quantities in our minds is a constant challenge. If the
data analytic output is close to our expectations, so that the output is as-expected, we might be
inclined to move on to the next step in our analysis. If the output is far from our expectation, so
that the output is unexpected, then we might be inclined to pause and investigate the cause of this
unexpected outcome.
2.1.1. Example: estimating pollution levels in a city. Consider a study designed to estimate
ambient particulate matter air pollution levels in a city. A simple initial approach might involve
deploying a single pollution sensor outdoors that collects hourly measurements of pollution. In
order to estimate the daily average level, we could take measurements from a single day’s sam-
pling period and average over the 24 hourly measurements. We know that in the United States,
the national ambient air quality standard for daily average fine particulate matter is 35 µg/m3 .
Therefore, we might expect that the measurement of the daily average that we take should be less
than 35 µg/m3 .
4 Peng • Parker
Now suppose we take our measurement and discover that the daily average is 65 µg/m3 , which
is far higher than what we would expect. In this situation, we might be highly motivated to ex-
plain why this deviation has occurred, and there may be several possible explanations: (a) Perhaps
our interpretation of the ambient air quality standards was wrong and values over 35 µg/m3 are
permissible; (b) there may be a bias in the sensor that causes values to generally read high; or
(c) we may have removed measurements below the detection limit before computing the mean,
biasing our calculation upwards. Each of these possible explanations represents an error in the
way we think about our expectation, our measurement technology, or our statistical computation,
Downloaded from www.annualreviews.org. Guest (guest) IP: 196.189.145.183 On: Thu, 03 Apr 2025 07:29:41
respectively.
We could also consider the possibility that we observe a daily average that is 20 µg/m3 . Such
an observation would be well within our expectation, and we might be inclined to do nothing in
response, but should we? With results that are as-expected, it might be wise to consider what if
something had in fact gone wrong or failed and ask, how would that occur? Perhaps the sensor
was placed in a manner that restricted airflow, reducing the number of particles it could detect. Or
perhaps our software for reading in the data erroneously eliminated large values as implausible.
In this example, regardless of whether the results are as-expected or unexpected, we need to
reconcile what we observe with our understanding of the systems that produced the data and our
expectations. Problems may lie in our knowledge of the domain, the technology we deploy to
collect data, and the analytic tools that we use to produce results.
2.1.2. Reconciling unexpected and as-expected results. Tukey noted the distinction between
as-expected and unexpected results and further commented that deciding where to draw the line
between the two was a matter of judgment. In a section describing the choice of a cutoff value for
identifying wild shots or spotty data, he writes,
The choice [of cutoff] is going to be a matter of judgment. And will ever remain so, as far as statistical
theory and probability calculations go. For it is not selected to deal with uncertainties due to sampling.
Instead it is to be used to classify situations in the real world in terms of how it is “a good risk” for us
to think about them. It is quite conceivable that empirical study of many actual situations could help us
to choose. . .but we must remember that the best [choice] would be different in different real worlds.
(Tukey 1962, p. 47; emphasis in original)
Determining what is expected from any data analysis and what is unexpected will generally be
a matter of judgment, which will change and evolve over time as experience is gained. Tukey ulti-
mately classifies the data points in his case study into three categories: in need of special attention,
in need of some special attention, and undistinguished (Tukey 1962).
Reconciling the observed output with the expected outcome is an aspect of what Grolemund &
Wickham (2014) call a sense-making process, where we update our schema for understanding the
world based on observed data. However, rather than take the data for granted and blindly update
our understanding of the world based on what we observe, a key part of the data scientist’s job is
to investigate the cause of any deviations from our expectations and to provide an explanation for
what was observed. Should the output be as-expected, it is equally important for a data scientist
to consider what might have gone wrong in order to identify any faulty assumptions or logic.
With either unexpected or as-expected output, the data scientist must interrogate the systems
that generated both the output and our expectations in order to provide useful explanations for
the observed results of the analysis.
opment of systems is a useful generalization that has some interesting downstream implications.
This framing provides a rationale for applying design thinking principles to data analysis devel-
opment (Section 2.3) and offers a formal framework for interpreting results that are unexpected.
2.2.1. Data analysis systems development. The data analytic system is typically the one most
under the control of the data scientist. This system consists of a series of connected components,
methods, tools, and algorithms that together generate the data analytic output. The system may
have various subsystems that themselves branch off into methods, tools, or algorithms. For exam-
ple, a simple data cleaning subsystem might involve reading in a comma-separated value (CSV)
file, removing rows that contain not available (NA) values, converting text entries into numeric
values, and then returning the cleaned dataset. The input to this subsystem is a CSV file, and
the output is the cleaned dataset. A second summary statistics subsystem might take the cleaned
dataset and return a table containing the mean and standard deviation of each column.
A system that operates in parallel with the data analytic system is what we refer to as the scien-
tific system. This system is built by those knowledgeable of the underlying science and summarizes
prior work or preliminary data relevant to the question at hand. The product of the scientific sys-
tem is some summary of evidence or background information that can be used in conjunction
with the design of the data analytic system to predict the output of the data analytic system. The
scientific system can be built by the analyst if the analyst is knowledgeable in the area. Otherwise,
a collaboration can be developed to build a scientific system whose output can be given to the
analyst.
The third system is the software implementation of the data analytic system. Here the analyst
chooses what specific software toolchains, programming languages, and other computational en-
vironments will be used to produce the output. Some components may be written by others (and
perhaps accessed via application programmer interfaces), while others may need to be written by
the analyst from scratch. The development of the software system may be dictated by the work
environment of the analyst. For example, an organization may have previously decided to only use
an existing programming language, toolchain, or workflow, limiting the options available to the
analyst (Parker 2017).
One important issue that we have left out is the data generation process, which has its own
complex systems associated with it. Experimental design, quality assurance, and many other data
collection processes will affect data quality and could subsequently cause problems, unexpected
outcomes, or failures in the data analysis process. The data scientist will likely have some knowl-
edge of this process, and familiarity with the data generation will aid significantly in interpreting
the results of an analysis. However, the data scientist may not have much direct control over the
data generation, and therefore it may be more important to develop strong collaborations with
others who may manage the development of those systems. Any discussion of systems must draw
useful boundaries around the system, and so we have chosen to exclude the data generation process
from the discussion of data analysis. We return to this topic later, when discussing the diagnosis
of unexpected outcomes and the broader context in which a data analysis sits (Section 2.4).
6 Peng • Parker
2.2.2. Data analytic iteration. These three systems—data analytic, scientific, and software—are
the responsibility of the data scientist, regardless of whether the data scientist is entirely respon-
sible for building them. Fundamentally, the data scientist must understand the behavior of each
of these systems under typical conditions. In particular, considering how these three systems will
interact before running the actual data through them is an important aspect of the concept of
preregistration (Asendorpf et al. 2013). Developing the range of expected outcomes reflects our
current state of knowledge and informs our interpretation of the eventual data analytic output.
Ultimately, the data analytic output gives us information about each of these systems and how
Downloaded from www.annualreviews.org. Guest (guest) IP: 196.189.145.183 On: Thu, 03 Apr 2025 07:29:41
2.2.3. Example: data cleaning. A common data checking task might be to first take the cleaned
dataset from the data cleaning subsystem and count the number of rows in the dataset. Continuing
the example from Section 2.1.1, we might be importing data produced by a remote pollution sensor
on a monthly basis in order to monitor environmental conditions. Such data might arrive in the
form of a CSV file. Figure 2 shows a sample diagram of how the data cleaning subsystem might
be organized and implemented.
If the analyst knew in advance roughly how many rows were in the original raw dataset (say
100), then the expectation for the number of rows in the cleaned dataset might be something close
to 90 (i.e., 10% of rows contained NA values). If the result of counting the rows was 85, then that
might be considered close enough and warrant moving on. However, if the number of rows was
10 or even 0, then that result would be unexpected.
How does an analyst come up with an expectation that approximately 10% of rows will contain
NA values? From Figure 2, we can see that the expected clean dataset is informed by all three
systems—data analytic, scientific, and software—and our understanding of how they operate. One
DATA ANALY TI C
Data collection/ Import Filter out rows Coerce text to Expected Compare output
CSV file CSV file with NA values numeric values clean dataset to expectation
Downloaded from www.annualreviews.org. Guest (guest) IP: 196.189.145.183 On: Thu, 03 Apr 2025 07:29:41
possible path is through knowledge of the sensor mechanism, which might be known to be unre-
liable at times, leading to about 10% of observations having some sort of problem. Another way
to develop an expectation is to have knowledge about the underlying scientific problem, which
might involve very difficult measurements to take, with 10% missing data standard for the field.
Finally (and perhaps least likely), it may be that the data collection process is fine, but it is known
that the software implementation of the data cleaning subsystem is unreliable and occasionally
corrupts about 10% of observations.
Now suppose that the observed result is that the cleaned dataset has 10 rows, which is unex-
pected. The analyst can track down the possible causes of this unexpected result by tracing back
through the three systems to see if there might be some problem with or misunderstanding about
how any of these systems operate. The analyst must evaluate which of these systems is most likely
at fault, pinpoint the root cause if possible, and implement any corrective action if the outcome is
undesirable. Perhaps an extra step should be added to the data cleaning subsystem that produces
an error message if the cleaned dataset has fewer than a certain number of rows. Or perhaps a call
should be made to the data collection team to see if there were any recent problems in the latest
batch of data. If so, then a protocol can be put in place in which the data collection team messages
the data scientist if a future batch of data has greater than expected NA observations.
The purpose of laboring through this hypothetical example is (a) to demonstrate the complexity
and variety of knowledge that may be required in order to execute even a simple data checking
step and (b) to indicate that the source of an unexpected outcome can lie in systems beyond the
implementation in software. Using more complex systems, like statistical modeling or inference
systems, will require further knowledge of data analytic systems, science, and software.
8 Peng • Parker
work, many of which originate outside of the data or the data analysis. As a result, many conflicts
cannot be resolved explicitly by employing the tools of data analysis, but can nevertheless affect
the conduct of a data analysis or even cause the analysis to fail (Robinson & Nolis 2020, McGowan
et al. 2021). For better or worse, resolving conflicts is commonly part of the data scientist’s job.
It is useful to state explicitly that a data analysis is a technical product that must be created and
brought into existence. Without the presence of a data scientist, a data analysis would not naturally
occur. The design and production of technical systems is by no means a new concept or field of
study (Brooks 1995; Hirschorn 2007; Cross 2011, 2021). However, considering a data analysis as
Downloaded from www.annualreviews.org. Guest (guest) IP: 196.189.145.183 On: Thu, 03 Apr 2025 07:29:41
a technical product to be designed, produced, maintained, and ultimately retired is not a common
perspective in the various data science–related subfields, including statistics (Parker 2017).
The need for design thinking in data analysis is driven by the presence of conflicts introduced
by various constraints on the analysis development. Constraints imposed by budget, time, ex-
pertise, personnel, the audience, or other stakeholders can fundamentally change how the data
analytic, scientific, and software systems are built and what the output looks like at the end. In-
deed, without such constraints, there is generally little need for design thinking. Analyses that
are done on a very short timescale can look different from otherwise similar analyses done on a
longer timescale. Analyses developed with large budgets will look different from analyses done on
a shoestring. Computational resources often affect what types of methods can be executed within
the available time. Analyses presented to company executives will need to look different from
analyses presented to the principal data scientist. The reality of data analysis development is that
data scientists must produce the best product within the numerous constraints imposed from the
outside (and this is before we consider the data themselves). If the constraints make developing an
analysis untenable, then the data scientist may need to negotiate with a stakeholder to make some
changes (Robinson & Nolis 2020).
2.3.1. Design principles. Limits placed by stakeholders and contextual factors are not the only
considerations for the data scientist, as there is a growing collection of design principles that are
being considered to guide the development of data analyses and data analytic products (Parker
2017, Woods 2019, McGowan et al. 2021). These principles serve to characterize a data analysis
and to distinguish properties of one analysis from another. As an analysis is developed, the data
scientist must choose how much emphasis will be placed on different principles.
For example, reproducibility is commonly cited as an attribute of a data analysis—that is, most
analyses should strive to be reproducible by others (Peng 2011). However, reproducibility does
not always make sense and is not always necessary or possible. Common data products such as
dashboards or interactive visualizations often do not produce reproducible analyses because the
audience for such products generally does not require it. Quick one-off analyses may not be re-
producible if the stakes are very low. Even large-scale analyses may not be reproducible to the
general public if the analyses use private or proprietary data (Peng et al. 2006).
The extent to which an analysis adheres to certain principles (such as reproducibility) can be
driven by numerous outside factors that the data scientist must negotiate before, during, and after
the analysis is completed. For example, in the United States, the Health Insurance Portability and
Accountability Act (HIPAA), which became effective around the year 2000, greatly limited the
reproducibility of data analyses using personal health data. Since 2000, data scientists using iden-
tifiable health data in analyses must sacrifice some reproducibility or else find a way to anonymize
the data. Many journals now require that data be deposited in third-party repositories so that
analyses have a chance at being reproduced if needed (Paltoo et al. 2014). Requirements regard-
ing reproducibility and privacy will likely come into conflict and may alter the nature of a data
analysis in order to balance certain tradeoffs. For example, analyses of aggregated data may be less
is any likely to be welcome. However, during the development of the underlying algorithm, some
skepticism might be useful when discussing the algorithm with other analysts or engineers. Here,
the audience for the analysis or analytic product plays a significant role in shaping the analysis and
how it is presented.
There may be other design principles that are valuable in guiding the development of a data
analysis, and the community of data scientists will have to formalize them as the field develops.
Most likely, these principles will evolve over time as technologies, methodologies, culture, and the
data science community continue to change. For example, reproducibility did not receive much
attention until computing and the Internet became fundamental aspects of data analysis and aca-
demic research (Schwab et al. 2000). New technologies like Git allow for analyses to be version
controlled and create more collaborative opportunities (Parker 2017). More generally, cloud-based
platforms allow for collaborative data sharing (Figshare, Open Science Framework), paper writ-
ing (Overleaf, Google Docs, arXiv), and coding (GitHub, GitLab). Similar to other areas where
design is an important consideration, the principles that guide the development of products must
keep up with the standards of the times (Cross 2011).
10 Peng • Parker
context from which the data were generated and to which the results will be presented. The risk of
ignoring this context is producing an analysis that is incorrect or not useful, at best, and unethical
or harmful, at worst.
packages are written in a variety of languages and implemented on many platforms. The existence
of such a diversity of tools might lead one to conclude that there is little left to develop. Of course,
the vibrant and active developer communities organized around both the R and Python program-
ming languages, to name just two, are evidence that new tools need to be continuously developed
to solve new problems and handle new situations.
It is worth singling out the R programming language here in order to highlight its historical
origins with the S language. S was developed at Bell Labs to address an important and novel
problem: A language was needed to do interactive data analysis. In particular, exploratory data
analysis was a new concept that was difficult to execute in existing programming systems. Rick
Becker, one of the creators of the S language, writes in “A Brief History of S,”
We wanted to be able to interact with our data, using Exploratory Data Analysis (Tukey 1971) tech-
niques. . . . On the other hand, we did occasionally do simple computations. For example, suppose we
wanted to carry out a linear regression given 20 x, y data points. The idea of writing a Fortran program
that called library routines for something like this was unappealing. While the actual regression was
done in a single subroutine call, the program had to do its own input and output, and time spent in
I/O often dominated the actual computations. Even more importantly, the effort expended on programming
was out of proportion to the size of the problem. An interactive facility could make such work much easier.
(Becker 1994, pp. 81–82; emphasis added)
The designers of the S language wanted to build a language that differed from programming
languages at the time (i.e., Fortran), a language that would be designed for data analysis. It is worth
revisiting the idea that tools could be designed with data science problems specifically in mind.
In particular, it is worth considering what tools might look like if they were designed first to deal
with data analysis rather than to write more general purpose software.
A simple comparison can provide a demonstration of what we mean here. Consider both the
Fortran and R languages. Fortran, like many programming languages, is a compiled language
where code is written in text files, compiled into an executable, and then run on the computer.
R is an interpreted language where each expression is executed as it is typed into the computer.
R inherits S’s emphasis on interactive data analysis and the need for quick feedback in order to
explore the data. Designing R to be an interpreted language as opposed to a compiled language was
a choice driven by the intended use of the language for data analysis (Ihaka & Gentleman 1996).
expectation was that the result should reproduce. However, if a result does not reproduce, then
having some detailed representation of the analysis is critical. This brings us to a second pur-
pose for data analysis representation, which is diagnosing the source of problems or unexpected
findings in the analysis.
A third reason for having access to the details of an analysis is to be able to build on the analysis
and to develop extensions. Extensions may come in the form of sensitivity analysis or the applica-
tion of alternate methods to the same data. A detailed representation would prevent others from
having to redo the entire analysis from scratch.
The current standard for providing the details of a data analysis is providing the literal code
that executed the steps of an analysis. This representation works in the sense that it typically
achieves reproducibility, it allows us to build on the analysis, and it allows us to identify the
source of potential unexpected results, to some extent. However, the computer code of an anal-
ysis is arguably incomplete, given that a complete data analysis can be composed of other sys-
tems of thinking, such as the data analytic and scientific systems described in Section 2.2. The
R code of a data analysis generally does not have any trace of these systems. One consequence
of this incompleteness is that when an unexpected result emerges from a data analysis, the code
alone is insufficient for diagnosing the cause. Literate programming techniques provide a par-
tial solution to this issue by providing tooling for mixing computer code with documentation
(Knuth 1984).
Data scientists can perhaps learn lessons from designers of traditional programming languages.
Over time, as computers have become more powerful and compilation technologies have ad-
vanced, programming languages have placed greater emphasis on being relevant and understand-
able to human beings. A purpose of high-level languages is to allow humans to reason about a
program and to gain some understanding of what critical activities are occurring at any given
time. Data scientists may benefit from considering to what extent current data science tools and
approaches to representing data analysis allow us to better reason about an analysis and potentially
diagnose any problems in design.
12 Peng • Parker
While it might be argued that reproducible analyses meet a minimum standard of quality (Peng
et al. 2006), such a standard is insufficient, if only because reproducible analyses can still be in-
correct or at least poor quality (Leek & Peng 2015). Reproducibility is valuable for diagnosing a
problem after it occurs, but what can be done to prevent problems before they are published or
disseminated widely? What tools can be built to improve the quality of data analysis at the source,
when the analysis is still in the data scientist’s hands?
The systems approach described in Section 2.2 naturally leads one to consider the concept of
explicitly testing expectations versus outputs. Here, we can potentially borrow ideas from software
Downloaded from www.annualreviews.org. Guest (guest) IP: 196.189.145.183 On: Thu, 03 Apr 2025 07:29:41
testing that might help to improve the conduct of data analysis more generally (Wickham 2011,
Radcliffe 2015). There may be other ideas that can be adapted for the purpose of improving data
analysis, and data scientists should continue to work to build such tools.
fields of study.
14 Peng • Parker
poor tool for data storage and representation, for others, it is the ideal (or perhaps the only) tool.
Broman & Woo offer reasonable guidelines for using spreadsheets while preserving the integrity
of the data and facilitating future data analyses.
perhaps paradoxically, it also plays an important theoretical role in summarizing and abstracting
common data analytic routines and practices. A prime example comes from the recent develop-
ment of the tidyverse collection of tools for the R programming language (Wickham et al. 2019),
and in particular the dplyr package with its associated verbs for managing data. This collection of
software packages has revolutionized the practice of data analysis in R by designing a set of tools
oriented around the theoretical framework of tidy data (Wickham 2014). This framework turns
out to be useful for many people by abstracting and generalizing a common way of thinking about
data analysis via the manipulation of data frames. It is perhaps one of the most valuable theoretical
constructs in data science today.
Yet, is it true that all data are tidy? No, reasonable counterexamples abound in every scientific
context. Is it even true that all data analyses can be accomplished within the tidy data framework?
No; for example, there are some analyses that are better suited to using matrices or arrays. Will
the tidyverse continue to be useful forever? Probably not, because new tools and frameworks will
likely be developed. Indeed, there are unlikely to be any universal truths that emerge from the
tidyverse software collection. But it is nevertheless undeniable that the tidyverse has provided a
useful structure for reasoning about and executing data analyses.
A key purpose of writing software is to codify and automate common practices. The concept
of “don’t repeat yourself” encapsulates the idea that in general, computers should be delegated
the role of executing repetitive tasks (Thomas & Hunt 2019). Identifying those repetitive tasks
requires careful study of a problem both in one’s own practice and in others’. As we aim to gen-
eralize and automate common tasks, we must simultaneously consider whether the task itself is
unethical or likely to cause harm. With the scale and speed of computing available today, lock-
ing in and automating a process that is biased can cause significant harm and may be difficult to
unravel in the future (O’Neil 2016).
communications and discussions about the process of doing data analysis occur on blogs, social
media, and various other informal channels. While such accounts are often useful, there could be
a benefit to creating a more formal approach to documenting, summarizing, and communicating
lessons learned from data analysis (Waller 2018). Such an effort would give novices an obvious
place to learn about data analysis and would give researchers in the field a way to see across the
experiences of other data scientists in order to identify any common structures. If we are to treat
data science as any other scientific field, we need a common venue in which we can discuss lessons
learned and identify areas that we do not yet understand or where there are knowledge gaps.
16 Peng • Parker
given case study, and given the limits on time and resources for most teachers, cases will usually be
presented using a single approach. However, if time were available to present multiple different
approaches to solve a problem, students might then be inclined to ask, “Which is the best way?”
In our experience teaching data science in the classroom at the graduate level, the structure
of a data science course is often best defined by what it is not. Typically, it is not whatever is
taught in the rest of the core statistics curriculum (Kross et al. 2020). While this approach can
sometimes result in a reasonable course syllabus, it is hardly a principled approach to defining the
core curriculum of an area of study. However, many guidelines for building data science programs
Downloaded from www.annualreviews.org. Guest (guest) IP: 196.189.145.183 On: Thu, 03 Apr 2025 07:29:41
are indistinguishable from statistics programs (De Veaux et al. 2017). Until we have a clear vision
for what comprises the core of data science, there will likely not be any better alternative proposals
for building a coherent data science program. Moving forward, the danger of not having a focused
vision is that data science education will devolve into teaching a never-ending proliferation of
topics related to data.
An apprenticeship model is sometimes proposed as an ideal way to teach data science, particu-
larly in nondegree, bootcamp-style training programs.1 Such one-on-one attention with a mentor
or advisor is indeed valuable, but it is arguably the most inefficient approach and is difficult to scale
to satisfy all demands for data science education. As a way to teach highly specific skills needed in
a particular setting, apprenticeships may be the best approach to training. But as a general model
for teaching data science, it is likely not feasible. In addition, individualized instruction risks rein-
forcing the idea that data science is an individualized art and that the student only needs to learn
whatever the advisor happens to know. The resulting heterogeneity of education goes against the
idea that data science is, in fact, a unified field with a core set of ideas and knowledge.
Much like with other fields of study, the material that is most suitable for a classroom setting,
where all students are given the same information, is theory. Material that is best suited for individ-
ualized instruction is specific information required to solve specific problems. If data scientists are
able to build a theory, draw generalizable lessons, abstract out common practices, and summarize
previous experience across domains, then it would make sense to teach that in the classroom, much
like we would teach statistics or biochemistry. The classroom would also be the logical place to
indicate what remains unknown about data science and what questions could be answered through
further research.
6. SUMMARY
Sharpening the boundaries of the field of data science will take considerable time, effort, and
thought over the coming years. Continuing on the current path of merging together skills and
activities from other fields is valuable in the short term and has the benefit of allowing data scien-
tists to explore the space of what the field could be. But the sustainability of an independent data
science field is doubtful unless we can identify unique characteristics of the field and define what
it means to develop new knowledge.
In this review we have made an attempt at describing the core ideas about data science that
make data science different from other fields. A key potential target of further exploration is the
area of data analysis, which is an activity that continues to lack significant formal structure more
than 50 years after Tukey’s paper on the future of data analysis. Given the importance of data
1 Examples include those described by Craig (2020) and IBM (2020), as well as those offered by General
knowledge will be required. The frequently changing nature of tools, technologies, and software
platforms may preclude those elements from playing a significant role in any data science theory
and practitioners will need to continuously adapt to the latest developments. Striking a balance
between the general and the specific is a challenge in any field and will be an important issue that
defines the future of data science.
DISCLOSURE STATEMENT
R.D.P. receives royalties from Coursera, Inc. for developing a course on data science.
LITERATURE CITED
Aldwell E, Schachter C, Cadwallader A. 2018. Harmony and Voice Leading. New York: Cengage Learn.
Am. Stat. Assoc. Undergrad. Guidel. Workgr. 2014. Curriculum guidelines for undergraduate programs in statistical
science. Rep., Am. Stat. Assoc., Alexandria, VA
Asendorpf JB, Conner M, De Fruyt F, De Houwer J, Denissen JJ, et al. 2013. Recommendations for increasing
replicability in psychology. Eur. J. Pers. 27(2):108–19
Baggerly K. 2010. Disclose all data in publications. Nature 467(7314):401
Baggerly KA, Coombes KR. 2009. Deriving chemosensitivity from cell lines: forensic bioinformatics and re-
producible research in high-throughput biology. Ann. Appl. Stat. 3(4):1309–34
Becker RA. 1994. A brief history of S. Comput. Stat. 1994:81–110
boyd d, Crawford K. 2012. Critical questions for big data: provocations for a cultural, technological, and
scholarly phenomenon. Inform. Commun. Soc. 15(5):662–79
Bressert E. 2012. SciPy and NumPy: An Overview for Developers. Sebastopol, CA: O’Reilly
Broman KW, Woo KH. 2018. Data organization in spreadsheets. Am. Stat. 72(1):2–10
Brooks FP Jr. 1995. The Mythical Man-Month: Essays on Software Engineering. London: Pearson
Carver R, Everson M, Gabrosek J, Horton N, Lock R, et al. 2016. Guidelines for assessment and instruction in
statistics education (GAISE) college report 2016. Rep., Am. Stat. Assoc., Alexandria, VA
Chatfield C. 1995. Problem Solving: A Statistician’s Guide. Boca Raton, FL: Chapman and Hall/CRC
Craig R. 2020. Why apprenticeships are the best way to learn data skills. Forbes, June 18. https://
www.forbes.com/sites/ryancraig/2020/06/18/sex-appeal-and-mystery-closing-the-data-skills-
gap/?sh=3e7c18a4566a
Cross N. 2011. Design Thinking: Understanding How Designers Think and Work. Oxford, UK: Berg
Cross N. 2021. Engineering Design Methods: Strategies for Product Design. Chichester, UK: Wiley. 5th ed.
Danks D, London AJ. 2017. Algorithmic bias in autonomous systems. In Proceedings of the Twenty-Sixth In-
ternational Joint Conference on Artificial Intelligence, Vol. 17, ed. C Sierra, pp. 4691–97. Red Hook, NY:
Curran
De Veaux RD, Agarwal M, Averett M, Baumer BS, Bray A, et al. 2017. Curriculum guidelines for undergraduate
programs in data science. Annu. Rev. Stat. Appl. 4:15–30
D’Ignazio C, Klein LF. 2020. Data Feminism. Cambridge, MA: MIT Press
Donoho D. 2017. 50 years of data science. J. Comput. Graph. Stat. 26(4):745–66
Goldberg P. 2014. Duke scientist: I hope NCI doesn’t get original data. Cancer Lett. 41(2):2
Goodyear MD, Krleza-Jeric K, Lemmens T. 2007. The declaration of Helsinki. Br. Med. J. 335(7621):624–25
18 Peng • Parker
Grolemund G, Wickham H. 2014. A cognitive interpretation of data analysis. Int. Stat. Rev. 82(2):184–204
Hardin J, Hoerl R, Horton NJ, Nolan D. 2015. Data science in statistics curricula: preparing students to “think
with data.” Am. Stat. 69:343–53
Hirschorn SR. 2007. NASA systems engineering handbook. Tech. Rep., Natl. Aeronaut. Space Admin.,
Washington, DC
IBM. 2020. The data science skills competency model. Rep., IBM Analytics, Armonk, NY. https://www.ibm.com/
downloads/cas/7109RLQM
Ihaka R, Gentleman R. 1996. R: A language for data analysis and graphics. J. Comput. Graph. Stat. 5(3):299–314
Ioannidis JP. 2005. Why most published research findings are false. PLOS Med. 2(8):e124
Downloaded from www.annualreviews.org. Guest (guest) IP: 196.189.145.183 On: Thu, 03 Apr 2025 07:29:41
Jager LR, Leek JT. 2014. An estimate of the science-wise false discovery rate and application to the top medical
literature. Biostatistics 15(1):1–12
Knuth DE. 1984. Literate programming. Comput. J. 27(2):97–111
Kross S, Guo PJ. 2019. Practitioners teaching data science in industry and academia: expectations, workflows,
and challenges. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–14.
New York: ACM
Kross S, Peng RD, Caffo BS, Gooding I, Leek JT. 2020. The democratization of data science education. Am.
Stat. 74(1):1–7
Leek JT, Peng RD. 2015. Opinion: reproducible research can still be wrong: adopting a prevention approach.
PNAS 112(6):1645–46
Leonelli S, Lovell R, Wheeler B, Fleming L, Williams H. 2021. From FAIR data to fair data use: method-
ological data fairness in health-related social media research. Big Data Soc. https://doi.org/10.1177/
20539517211010310
Loukides M, Mason H, Patil D. 2018. Ethics and Data Science. Sebastopol, CA: O’Reilly
Lovett MC, Greenhouse JB. 2000. Applying cognitive theory to statistics instruction. Am. Stat. 54(3):196–206
McGowan LD, Peng RD, Hicks SC. 2021. Design principles for data analysis. arXiv:2103.05689 [stat.ME]
McKinney W. 2011. pandas: A foundational Python library for data analysis and statistics. Python High Perform.
Sci. Comput. 14(9):1–9
Natl. Acad. Sci. Eng. Med. 2018. Data science for undergraduates: opportunities and options. Rep., Natl. Acad. Press,
Washington, DC
Nolan D, Temple Lang D. 2010. Computing in the statistics curricula. Am. Stat. 64(2):97–107
O’Neil C. 2016. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. New
York: Crown
Open Science Collab. 2015. Estimating the reproducibility of psychological science. Science 349(6251):aac4716
Paltoo DN, Rodriguez LL, Feolo M, Gillanders E, Ramos EM, et al. 2014. Data use under the NIH GWAS
data sharing policy and future directions. Nat. Genet. 46(9):934–38
Parker H. 2017. Opinionated analysis development. PeerJ Preprints 5:e3210v1
Patil P, Peng RD, Leek JT. 2016. What should researchers expect when they replicate studies? A statistical
view of replicability in psychological science. Perspect. Psychol. Sci. 11(4):539–44
Peng RD. 2011. Reproducible research in computational science. Science 334(6060):1226–27
Peng RD, Dominici F, Zeger SL. 2006. Reproducible epidemiologic research. Am. J. Epidemiol. 163(9):783–89
Peng RD, Hicks SC. 2021. Reproducible research: a retrospective. Annu. Rev. Public Health 42:79–93
R Core Team. 2021. R: A language and environment for statistical computing. Statistical Software, R Found.
Stat. Comput., Vienna
Radcliffe N. 2015. Why test-driven data analysis? TDDA Blog, Nov. 5. http://www.tdda.info/why-test-
driven-data-analysis
Robinson E, Nolis J. 2020. Build a Career in Data Science. Shelter Island, NY: Manning
Rosenblat A, Kneese T, boyd d. 2014. Algorithmic accountability. Presented at The Social, Cultural & Ethical
Dimensions of “Big Data,” March 17, New York, NY
SAS Inst. 2015. Base SAS 9.4 procedures guide. Tech. Manual, SAS Inst., Cary, NC
Schoenberg A. 1983. Theory of Harmony. Berkeley: Univ. Calif. Press
Schwab M, Karrenbach N, Claerbout J. 2000. Making scientific computations reproducible. Comput. Sci. Eng.
2(6):61–67
20 Peng • Parker