Analyzing Health Data in R for SAS Users 1st
Edition Monika Maya Wahi install download
https://textbookfull.com/product/analyzing-health-data-in-r-for-
sas-users-1st-edition-monika-maya-wahi/
Download more ebook instantly today - get yours now at textbookfull.com
We believe these products will be a great fit for you. Click
the link to download now, or visit textbookfull.com
to discover even more!
SAS for R users : a book for budding data scientists
First Edition Ohri
https://textbookfull.com/product/sas-for-r-users-a-book-for-
budding-data-scientists-first-edition-ohri/
Clinical trial data analysis with R and SAS Second
Edition Chen
https://textbookfull.com/product/clinical-trial-data-analysis-
with-r-and-sas-second-edition-chen/
Analysis of Correlated Data with SAS and R Mohamed M.
Shoukri
https://textbookfull.com/product/analysis-of-correlated-data-
with-sas-and-r-mohamed-m-shoukri/
Analyzing High-Dimensional Gene Expression and DNA
Methylation Data with R 1st Edition Hongmei Zhang
(Author)
https://textbookfull.com/product/analyzing-high-dimensional-gene-
expression-and-dna-methylation-data-with-r-1st-edition-hongmei-
zhang-author/
SAS certification prep guide base programming for SAS 9
Fifth Edition. Edition Sas Sas Sas
https://textbookfull.com/product/sas-certification-prep-guide-
base-programming-for-sas-9-fifth-edition-edition-sas-sas-sas/
Analyzing Data through Probabilistic Modeling in
Statistics 1st Edition Dariusz Jakóbczak
https://textbookfull.com/product/analyzing-data-through-
probabilistic-modeling-in-statistics-1st-edition-dariusz-
jakobczak/
Survival Analysis with Interval Censored Data A
Practical Approach with Examples in R SAS and BUGS 1st
Edition Kris Bogaerts
https://textbookfull.com/product/survival-analysis-with-interval-
censored-data-a-practical-approach-with-examples-in-r-sas-and-
bugs-1st-edition-kris-bogaerts/
Infographics Powered by SAS:: Data Visualization
Techniques for Business Reporting 1st Edition Travis
Murphy
https://textbookfull.com/product/infographics-powered-by-sas-
data-visualization-techniques-for-business-reporting-1st-edition-
travis-murphy/
Clinical Data Quality Checks for CDISC Compliance Using
SAS 1st Edition Sunil Gupta (Author)
https://textbookfull.com/product/clinical-data-quality-checks-
for-cdisc-compliance-using-sas-1st-edition-sunil-gupta-author/
Analyzing Health Data in R
for SAS Users
Analyzing Health Data in R
for SAS Users
Monika Wahi
Peter Seebach
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2018 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed on acid-free paper
International Standard Book Number-13: 978-1-4987-9588-3 (Hardback)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Library of Congress Cataloging‑in‑Publication Data
Names: Wahi, Monika, author. | Seebach, Peter, author
Title: Analyzing health data in R for SAS users / Monika Wahi and Peter
Seebach
Description: Boca Raton : CRC Press, 2017. | Includes bibliographical
References, Identifiers: LCCN 2017021131 | ISBN 9781498795883 Subjects: LCSH:
Bioinformatics. | Medical informatics. | R (Computer program language) | SAS
(Computer file) Classification: LCC QH324.2 .W34 2017 | DDC 610.285--dc23
LC record available at https://lccn.loc.gov/2017021131
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
Preface ......................................................................................................................ix
Acknowledgments .................................................................................................xi
Authors ................................................................................................................. xiii
1. Differences between SAS and R .................................................................1
Structure of Program.......................................................................................1
Installation of PC Version: SAS versus R .................................................2
Licensing Differences: SAS versus R ........................................................ 3
SAS Components versus R Packages .......................................................3
What is RStudio? .........................................................................................4
Maintaining Current Versions in SAS versus R .....................................4
SAS versus R User Communities .............................................................. 6
SAS versus R User Interfaces ..................................................................... 7
Code Documentation and Metadata: SAS versus R ............................... 9
Handling of Data ........................................................................................... 12
A Focus on SAS Data Handling .............................................................. 12
Comparison to R Data Handling ............................................................ 13
Basic Differences in Code Syntax: R versus SAS .................................. 15
SAS Formats and Labels versus R Approaches .................................... 17
SAS versus R—What to Choose? .................................................................22
Why R Can Be a Difficult Choice for Public Health Efforts................22
Considerations When Choosing R versus SAS ..................................... 24
Optional Exercises ......................................................................................... 25
All Sections ................................................................................................ 25
Questions ............................................................................................... 25
Answers ................................................................................................. 25
2. Preparing Data for Analysis....................................................................... 27
Reading Data into R....................................................................................... 27
Importing Data .......................................................................................... 27
Checking the Dataset after Reading It in .............................................. 28
Checking Data in R........................................................................................ 33
Statistics on Continuous Data in R ......................................................... 33
Visualizing Continuous Data in R.......................................................... 37
Statistics on Categorical Data in R: One Variable ................................. 40
Statistics on Categorical Data in R: Crosstabs.......................................44
Editing Data in R............................................................................................ 48
Trimming off Unneeded Variables ......................................................... 48
Applying Qualification Criteria through Subsetting Datasets .......... 50
Creating Grouping Variables................................................................... 53
v
vi Contents
Creating Indicator Variables for Two-Level Categories....................... 61
Creating Indicator Variables for Three-Level Categories .................... 62
Creating Indicator Variables for Multilevel Ordinal Categories ........ 63
Creating Indicator Variables for Multilevel Nominal Categories ...... 65
Creating Missing Flags............................................................................. 67
Preparing Binary Outcome Variable ...................................................... 68
Planning a Survival Analysis Dataset with Time-to-Event
Variables ..................................................................................................... 68
Developing the Survival Dataset ............................................................ 70
Dealing with Dates ................................................................................... 82
Recoding and Classifying Continuous Variables................................. 88
Recoding a Continuous Outcome Variable ........................................... 89
Data Validation in R ...................................................................................... 92
Bivariate Relationships between Continuous Variables...................... 93
Bivariate Relationships between Categorical and
Continuous Variables ............................................................................... 99
Bivariate Relationships between Categorical Variables .................... 106
Power Calculations ................................................................................. 108
Write Out Analytic File .......................................................................... 115
Optional Exercises ....................................................................................... 116
Section “Reading Data into R” .............................................................. 116
Questions ............................................................................................. 116
Answers ............................................................................................... 116
Section “Checking Data in R” ............................................................... 117
Questions ............................................................................................. 117
Answers ............................................................................................... 118
Section “Editing Data in R” ................................................................... 122
Questions ............................................................................................. 122
Answers ............................................................................................... 124
Section “Data Validation in R” .............................................................. 127
Questions ............................................................................................. 127
Answers ............................................................................................... 128
3. Basic Descriptive Analysis ....................................................................... 133
Making “Table 1”—Categorical Outcome ................................................ 133
Structure of Categorical Table 1 ............................................................ 133
SAS Approaches to Categorical Table 1 Structure.............................. 136
SAS Approaches to Table Presentation Using Excel .......................... 141
SAS Bivariate Categorical Tests ............................................................. 142
The Table Command in R ...................................................................... 143
R Approaches to Categorical Table 1 .................................................... 145
Approaches to Automating Table Generation in R ............................ 153
R Bivariate Statistical Tests .................................................................... 155
Contents vii
Making “Table 1”—Continuous Outcome ............................................... 156
Structure of Continuous Table 1 ........................................................... 156
SAS Approaches to Continuous Table 1 .............................................. 156
Continuous Bivariate Statistical Tests in SAS ..................................... 163
R Approaches to Continuous Table 1 ................................................... 165
Continuous Bivariate Statistical Tests in R .......................................... 171
Descriptive Analysis of Survival Data...................................................... 176
Summary Statistics and Plots on Time Variable................................. 176
Generating and Plotting Survival Curves ........................................... 177
Bivariate Tests of Survival Curves ........................................................ 185
Optional Exercises ....................................................................................... 189
Section “Making ‘Table 1’—Categorical Outcome” ........................... 189
Questions ............................................................................................. 189
Answers ............................................................................................... 189
Section “Making ‘Table 1’—Continuous Outcome”........................... 190
Questions ............................................................................................. 190
Answers ............................................................................................... 190
Section “Descriptive Analysis of Survival Data” ............................... 193
Questions ............................................................................................. 193
Answers ............................................................................................... 194
4. Basic Regression Analysis ........................................................................ 197
This Book’s Approach ................................................................................. 197
Selection of Modeling Approach .......................................................... 197
Selection of Manual Approach .............................................................. 199
Operationalizing the Stepwise Selection Process .............................. 200
Prespecifying Hypotheses and Avoiding Fishing ............................. 201
Linear Regression and ANOVA ................................................................. 203
Preparing to Run Linear Regression .................................................... 203
Linear Regression Modeling and Model Fit Statistics ....................... 205
Selecting the Final Linear Regression Model ..................................... 211
Considerations in Improving the Final Model ................................... 214
Considering Collinearity ....................................................................... 214
Adding Interactions ................................................................................ 215
Goodness-of-Fit Statistics....................................................................... 217
Linear Regression Model Presentation ................................................ 218
Plot to Assist Interpretation ...................................................................225
Logistic Regression ...................................................................................... 228
Estimates Produced by Logistic Regression ....................................... 229
Logistic Regression Considerations ..................................................... 229
Introduction to Logistic Regression Modeling ................................... 231
Logistic Regression Modeling and Model Fitting .............................. 232
Selecting the Final Logistic Regression Model ................................... 239
viii Contents
Logistic Regression Model Presentation ............................................. 249
Plot to Assist Interpretation ................................................................... 255
Survival Analysis Regression .................................................................... 256
Selecting a Parametric Distribution in Survival Analysis ................ 257
Selecting a Semiparametric Distribution for Survival Analysis ...... 262
Introduction to Survival Analysis Regression Modeling ................. 264
Survival Analysis Regression Modeling and Model Fitting ............ 265
Parametric Survival Analysis................................................................ 265
Semiparametric Survival Analysis ....................................................... 270
Issues to Consider in Survival Analysis .............................................. 274
Selecting the Final Survival Analysis Model ...................................... 275
Survival Analysis Model Presentation ................................................ 275
A Note about Macros................................................................................... 276
Optional Exercises ....................................................................................... 278
Section “This Book’s Approach” ........................................................... 278
Questions ............................................................................................. 278
Answers ............................................................................................... 278
Section “Linear Regression and ANOVA” .......................................... 278
Questions ............................................................................................. 278
Answers ............................................................................................... 279
Section “Logistic Regression” ............................................................... 280
Questions ............................................................................................. 280
Answers ............................................................................................... 280
Section “Survival Analysis Regression” .............................................. 281
Questions ............................................................................................. 281
Answers ............................................................................................... 282
Section “A Note about Macros” ............................................................ 285
Questions ............................................................................................. 285
Answers ............................................................................................... 285
References ........................................................................................................... 287
Index ..................................................................................................................... 297
Preface
When I, Monika Wahi, gave a presentation at the Effective Applications of the
R Language (EARL) Conference in Boston in November 2015, there I attended
a panel discussion of the event leaders, although it was a very open dis-
cussion, with the audience enthusiastically participating. The goals of the
R Consortium were delineated, which led to the consideration of how to
promote the increased usage of the R language by statisticians. After all,
R is open source software, so it does not have a marketing department per se.
Someone in the audience brought up with the specific topic of healthcare
analytics, asking the attendees why R is not used more often in healthcare.
After a pause, someone pointed out the domination of SAS in the indus-
try, which crowds out R. But then, another person speculated that the Food
and Drug Administration (FDA) would only accept analysis in SAS (a myth
which we debunked in Chapter 1). It became clear that there seemed to be
many barriers to promoting R among healthcare analysts.
After further consideration, this was ironic, because both R and SAS are
very extensive languages, but what we do in healthcare analytics on a day-to-
day basis is relatively straightforward. We are not modeling any spaceship
trajectories, or building complex economic models, or predicting weather or
the outcome of sports games. It occurred to me that as we generally do the
same things over and over again in healthcare analytics, and because most
people in healthcare analytics do these things in SAS, a book that focuses
only on explaining how to do what we normally do in SAS in R would help
healthcare analysts who may want to use R also.
Therefore, this book is aimed at the healthcare analyst who is a SAS user
looking to learn R. Chapter 1: “Differences between SAS and R” is written
mainly for those who are interested in the backstory of how SAS and R are
run differently. Readers who want to get immediately to coding should
skip to Chapter 2: “Preparing Data for Analysis”, which describes editing
and validating an analytic dataset in R rather than using SAS data steps.
Chapter 3: “Basic Descriptive Analysis” will explain how to use the analytic
dataset developed to produce a basic descriptive analysis (often presented
as Table 1 in manuscripts). Finally, Chapter 4: “Basic Regression Analysis”
covers linear and logistic regression and basic survival analysis. As R pro-
duces such lovely plots, an insert with 16 color plots is included in the center
of this book.
We hope this book encourages you to try R for some healthcare analysis
tasks.
Monika Wahi and Peter Seebach
ix
Acknowledgments
The number of people who have helped us with this book is large, and we
are eternally grateful. But after controlling for age, relevant skills, and educa-
tion, it turns out that specifically no one individual contributed significantly
more to help us with this book than another. Our conclusion is that you are
all amazing, and we thank you very much for your support!
xi
Authors
Monika Wahi, MPH, CPH obtained a bachelor of science in costume design
and textiles and clothing (with a concentration in Journalism) from the College
of Human Ecology, University of Minnesota, St. Paul, Minnesota, and, then
went on to complete her masters in public health from the University of
Minnesota School of Public Health, Minneapolis, Minnesota. After serving
in several different scientific and administrative roles at Hennepin County
in Minnesota, a nonprofit Alzheimer’s research institute in Florida, and at
the U.S. Army at a site in the Greater Boston area, she struck out on her own
to build her consulting business, DethWench Professional Services (DPS).
Since 2012, she has been serving as a lecturer at the Labouré College, Milton,
Massachusetts, teaching classes in the U.S. healthcare system and statistics.
Monika has also led to the expansion of DPS to serving an international
clientele through providing online educational material and services as well
as public health research consulting.
Peter Seebach was raised by mathematicians, and never fully recovered.
He earned his bachelor’s degree in psychology from St. Olaf College,
Northfield, Minnesota, and then went on to apply it to a career as a software
developer and writer. He enjoys computers, writing, and writing about com-
puters. He is an outlier, and therefore tends to throw off the average.
xiii
1
Differences between SAS and R
This chapter is meant to help the SAS user conceptualize the important dif-
ferences between R and SAS that will affect the work of the healthcare ana-
lyst who knows SAS but is looking to learn how to use R. The first section
describes important differences in the structures of the SAS and R programs.
This leads to the discussion in the second section, which focuses on how
these differences in structure affect the differences in data handling between
the two programs. The third section of this chapter contextualizes the choice
between using R versus SAS for a healthcare analytics project and provides
a guide for selecting which software to use. Optional practice exercises are
included in the fourth section.
Structure of Program
Perhaps the most important difference between SAS and R is the structure
of how the program is built and maintained. To begin to explain this differ-
ence, we will start with considering how different the download and install
process is between PC SAS and R. Next, we will cover how these differences
are also reflected in the differences in the way licensing is maintained in the
two programs. Third, the differences between SAS components and their
parallel in R, called R packages, will be discussed, and after that, differences
between SAS and R in approach to maintaining the most current version in
a production environment will be explained. The differences between the
activities of SAS and R user communities will be described, followed by the
differences between the SAS and R user interfaces. Finally, some thoughts on
the principles of organizing code, metadata, and documentation in SAS and
R are presented.
1
2 Analyzing Health Data in R for SAS Users
Installation of PC Version: SAS versus R
When using a typical PC SAS license from a university, the data analyst
has to download and install the program on his or her Windows personal
computer (PC), as the current operating system (OS) for Macintosh is not
supported.* To do this, the analyst is provided access to either many CDs
that her computer will tell you to put in or take out of her CD/DVD drive
during the installation process, or alternatively, an extremely large setup
file that takes a very long time to download (plan for hours, not minutes).
The analyst is also provided a small text file with unique license information
for her institution’s license.
Once the analyst has access to these setup files, a setup file is run on the
analyst’s PC, and this setup file takes a while to extract. The installation
must be monitored because the user has to click through many menus. This
is because these SAS institutional licenses are very extensive and include
many add-ons and components. This heavy-handed installation process is
designed to make sure the analyst has access to all the components to which
she is entitled under the license (which, at a university, tends to be a large
volume).
However, when working in SAS, the analyst may encounter the rare case
in which she is using an analytic function that is not included in the SAS
license components, and this throws up an error message. This is a confus-
ing situation because the error message does not usually point to a miss-
ing component; it simply rejects the code as being incorrect in some way, so
troubleshooting can be challenging.
Because R is open source, meaning that R developers are volunteers and
make the code and documentation for how R runs readily available on the
Internet, there is no need for the advanced functions to control licensing
that SAS employs. This makes the way R is distributed different from
SAS. Anyone can go to the public website called “The Comprehensive
R Archive Network” (CRAN) (https://cran.r-project.org) [1] and down-
load the latest version of R for Mac or Windows and install the core
program. The user interface (UI) looks slightly different on Mac versus
Windows, but the differences are minor.† The R installation file is small
relative to the SAS installation file, and once the user downloads this file
and runs it, the setup wizard is clear and easy, making for a quick instal-
lation process.
* Per SAS, SAS versions developed since Macintosh OS X came out are not supported on OS
X. A modification of SAS called JMP (pronounced “jump”) has been developed for Mac and
Windows users alike, although it has greatly reduced functionality. SAS also has a web-based
version that can run in Mac and Windows browsers which is mainly used in teaching rather
than production environments.
† The examples given in this book refer to the Windows version of R.
Differences between SAS and R 3
Licensing Differences: SAS versus R
The exact components of the SAS institutional license make are communi-
cated to SAS during the installation process at the step where the user is
asked by the setup wizard to reference the small text file provided with the
setup files. As stated earlier, the list of components included in a SAS insti-
tutional license are usually large. When SAS negotiates enterprise licenses
with universities, they are set up such that a large volume of components is
included, and the student or faculty pays only a negligible fee or nothing to
obtain this licensed version, provided they can prove their status with the
university. This is because SAS wants to promote use and learning of all its
components at universities.
Importantly, SAS prices their licenses for non-university businesses dif-
ferently. As an example, an independent nonprofit research institute on
the campus of a state university in Florida contacted SAS and asked if the
research institute could use the university license. SAS did not agree, and
instead, prepared a license for PC SAS strictly for the research institute. This
license only had one seat, and only had the base component of SAS (“base
SAS”) and the basic engine that runs regression functions (“SAS STAT”). This
one-seat license cost the nonprofit approximately $10,000 per year in 2007, so
the nonprofit could not add extra seats. Hence, researchers who learned SAS
at the university and then went on to be hired as scientists at the nonprofit
research institute were not able to use their extensive knowledge of all the
components of SAS due to the prohibitively expensive licensing approach.
Because R is open source, there is no licensing fee, so R is perfectly suited
for public health work in low- to middle-income (LME) countries, nonprof-
its, and businesses with a low profit margin. This book can hopefully help
bridge the gap between SAS and R for SAS-experienced healthcare analysts
who are priced out of the market by this situation.
SAS Components versus R Packages
Base SAS, the core of SAS, is a rather large program. As mentioned before, the
main R program that is downloaded from CRAN is much smaller than the SAS
setup program, and therefore, downloading it is fast and installing it is very easy.
The drawback is that just about anything the user tries to do once installing R will
require an outside component not in the base program, called an R “package.”
Because R is not licensed for purchase, it is much more efficient for each
user to simply build their version of R by installing the packages needed.
Admittedly, this can be a daunting task, as some packages are based on other
packages, so these all need to be installed. For example, to make a Kaplan–
Meier plot in R, the user must install the “survival” package, as well as the
“KMSurv” package [2]. More recently, packages are designed to automati-
cally install packages on which they are dependent, so this problem does not
occur as frequently anymore.
4 Analyzing Health Data in R for SAS Users
Luckily, as with the base R program, all the packages are free and are easy
to install from within the program. In the native R UI, there is a menu with
only seven selections, and “packages” is one of them. If the user chooses the
“packages” menu and selects “load package,” the UI presents a list of pack-
ages that can be installed, and the user just needs to select the correct ones
and install them. Packages can also be installed using commands, but the
load package menu makes the process extremely easy.
What is RStudio?
This book gives guidance on using the PC version of R in Windows as it
appears with its native UI. R’s native UI is typically sufficient to be used by
healthcare analysts to develop statistical models. However, there is also an
integrated development environment (IDE) that can be used called RStudio.
This is supported by the R Consortium, which is a collaboration between the
R Foundation (which maintains CRAN), RStudio (a collection of R develop-
ers working on the IDE), and other big tech companies such as Microsoft and
Google [3]. Like R, RStudio is also open source and free to individual users.
RStudio is different than R in that it is an IDE and includes a source code
editor, build automation tools, and a debugger. RStudio can be run as a desk-
top or server version, so it is used at universities in programs that teach pro-
gramming in IDEs [4]. It is an excellent tool for deploying Shiny, which is
an R package that interfaces R with the web and turns R analyses into web
applications [5].
Generally, the capabilities afforded by RStudio are required for deploying
web applications, but for hypothesis-driven healthcare analytics, RStudio
can be overkill. For example, RStudio has several windows that are associ-
ated specifically with the IDE, and would not appear in R. Therefore, unless
the analyst needs an IDE, using R rather than RStudio is preferred.
Maintaining Current Versions in SAS versus R
When using a university SAS license, there is a month in the year that the
license expires because these are set up as yearly licenses. When the license
expires, the user can still open the SAS program, but it throws up an error
message indicating the user will need an updated text file with a new license,
and does not let the user unlock the program until this is loaded. At this
point, the user must obtain the new text file from the university, and from
within SAS, load this license, thus unlocking the program. Although SAS
does update its base program with new version from time to time, it does not
do it often, so renewing the license is much more common than installing a
new version of SAS.
Part of the reason SAS does not update its base program often has to do
with how it determines what to put in updates. Independent SAS program-
mers typically develop macros (canned code procedures) in SAS macro
Differences between SAS and R 5
language and make these available on the Internet. These are not held in a
central repository but posted all over the Internet, and are also highlighted
in SAS white papers presented for regional as well as national and interna-
tional user groups [6–8], which are also not available in a central indexed
repository (although they are posted on SAS-sponsored websites and are
easy to find through a search on the web).
If a particular macro becomes popular with SAS users, some may choose
to write a peer-reviewed article about it [9], fostering discussion of the macro
in the SAS community. If SAS receives enough requests to include the macro
as a main function in SAS, and can verify the functioning of the macro, SAS
may choose to include it in its next build. This is a decision made by SAS
on a business level, not by the SAS user community, so opinions may differ
between the user community and the business as to what macros should be
included as procedures, or “PROCs,” in new versions of SAS.
R does not have the constraints SAS has with respect to innovation and
change management. First, its base program is very lean, which means it
is not difficult to update and disseminate. For this reason, new versions of
R can be released quite often (at least once per year). The penalty for the
R user is that if she wants to update her core version of R, she will have to
reload all the R packages she has been using to carry on with her analy-
sis. An advantage is that the newest version is always available on the
CRAN site for free, as are the packages, and this means that the only loss
in efficiency is through time required to update the R core program and
reload the packages. Also, R developers have created packages available
on CRAN that can be loaded and configured to run automated functions
to help R shops keep their base R and all its packages updated on a regular
basis.
SAS macros that are not available within the program are documented all
over the Internet by individual programmers, just like R packages that are
not official or are under development. However, macros built into SAS as
commands or “PROCs” are well-documented in SAS’s online and paper help
files. Similarly, R’s published packages are well-documented on the CRAN
website [10], as standardized comprehensive documentation is required to
be approved to have a package published to CRAN.
When SAS updates its program, it includes updates to its core, new mac-
ros built in as PROCs, and updates to old PROCs and functions as well. This
means that production operations dependent on SAS can effectively plan
for change management. Conversely, because R updates its core regularly,
and the authors of their packages also update their packages idiosyncrati-
cally, maintaining the most up-to-date version of R is actually rather chal-
lenging in a production environment. Independent tools, functions, and
routines have been developed to assist with this task [11], but it is impor-
tant to emphasize that those running R in production must be proactive in
maintaining the most up-to-date versions of R and packages being used in
production.
6 Analyzing Health Data in R for SAS Users
It must be noted that R has a different method than SAS in deciding what
packages to include on CRAN. Like with SAS, R users can prepare and pub-
lish unverified packages for R online [12,13], and adventurous R users can
test them. If the package is deemed worthy by many, the author may apply
to have the package included on the CRAN site (and thus have it show up
in the “load package” menu). But depending upon the package, installing
a new package may effectively require devoting 15 min or more to loading
different packages on which a new package might be based, and testing the
results to make sure the new package can actually run. When this happens,
the R user may nostalgically muse about how easy it is to run macros and
PROCs in SAS by comparison, but once the packages are sufficiently loaded
in R and running well, this nostalgia generally passes.
SAS versus R User Communities
SAS is known for its expansive support for its user community. SAS orga-
nizes “SUGs,” or SAS User Groups, and provides logistic and financial
support for conferences requested by user groups [6–8,14,15]. These confer-
ences produce an extensive set of non-peer-reviewed white papers that pro-
mote the use of all of SAS’s various components by helping the user deal
with the unnecessary complexity of SAS. These white papers are unfor-
tunately of varying quality; some are well-written and helpful, whereas
others are poorly written and confusing, and there seems to be little in the
way of consistent standards as to what is accepted as a SAS white paper.
Nevertheless, SAS users are provided a fertile opportunity to thrive aca-
demically with SAS given the many opportunities offered to publish white
papers and present them at meetings of SUGs, as well as the vast amount
of documentation available online that results from the publication of these
white papers, which effectively facilitates the adoption of SAS’s more com-
plex functions.
On the other hand, R, which is an open-source software and therefore
does not have a conference or marketing department, has a different land-
scape when it comes to user groups and community gatherings. Meetup.
com, an online platform designed with tools meant to be used to build com-
munities through the scheduling of informal meetings or “Meetups,” has
been used extensively by the R community to gather old hands as well as
early adopters together informally for mutual presentations [16]. Unlike
with SUGs, these Meetups are community-initiated and supported, so they
are held in cheaply available spaces, and often garner little or no industry
financial support.
Happily, this trend is changing, as R is piquing interest with big play-
ers, most notably Microsoft, which purchased Revolution Analytics in 2015.
Revolution Analytics was an R shop that specialized in improving data han-
dling in R so as to enable it to process larger datasets [17]. In addition, larger
scale gatherings, such as the Effective Applications of the R Language (EARL)
Differences between SAS and R 7
conference, which is now held twice per year on two different continents, is
sponsored by various members of industry, including Mango Solutions, a
long-time big-ticket R analytics consulting group founded in 2002, as well as
Microsoft itself [18].
SAS versus R User Interfaces
The PC SAS interface provides a birds-eye-view of many of SAS’s functions
all at once. On the left panel, the user can toggle between a list of results
(output) and an explorer window that will show the contents of “work” (the
temporary directory where SAS puts datasets while it is running) as well as
other libraries mapped with the LIBNAME statement. Through this panel,
the user can easily open and view the datasets (in *.sas7bdat format) in the
mapped libraries.
In the main panel, users can display multiple windows at once. The user
can open multiple code windows (which can be saved as *.sas files), display
the log window (which shows the log of executed statements as well as error
messages, and can be saved as a *.log file), as well as view an output file
that can be displayed in hypertext markup language (HTML) and saved
separately.
R’s interface runs differently. Like SAS, the R PC interface allows for multi-
ple windows to be opened at once, but the function of the windows is differ-
ent. When R is launched, only one window is opened automatically, and this
is called the R console. This window is where statements will be executed,
log statements will be recorded, and tabular and numeric output will appear.
Next, the user can open one or more code editing windows. R has a menu
with only five choices: File, Edit, Packages (described earlier), Windows, and
Help. Choosing File—New Script from the menu will open a new code edit-
ing window. If the users saves a file from this window (Choose File—Save
As when the code window is selected), it will be saved as an *.R file.* Please
note that it is easy to open *.R files in Notepad or another word processing
program without needing to load R.
Once the user has a code window open, she can prepare code and run it
in the console by either copy/pasting it from the code editing window into
the console and pressing “enter,” or by highlighting the code in the code
window, right-clicking and choosing “Run line or selection” (also Control-R),
and this will transfer the code to the console and run it. Or, she can simply
type code directly into the console and hit “enter,” and the code will run.
An important point to be made here is that although R has an extensive
array of statements that could be used to do the equivalent of mapping a
LIBNAME in SAS, probably the easiest way to set a singular default directory
* Our experience is that occasionally, the *.log extension used in SAS is already designated to
default to another application in Windows (thus prompting a dialogue box during installa-
tion). So far, we have not encountered extension conflicts in R.
8 Analyzing Health Data in R for SAS Users
is to run the R program, select the console, then using the menu, choose
File—Change Dir. This will allow the user to navigate to the directory on her
PC or server where most of her R project is stored, and this will make loading
code from that directory much simpler throughout the session. Please note
that this designation will end at the end of the R session. Also, please take
notice of the fact that if the code window is selected and the user chooses
File—Open Script, the browser will default to the last code directory used,
but if the console is selected, the same operation will bring the user to the
directory designated during the File—Change Dir function.
The console is the place where the following appear: log messages, error
messages, and non-graphical output. For example, if the user requests a fre-
quency table, or conducts a calculation, or asks for the number of observa-
tions in a dataset, after the line of code is run in the console (displayed in red
text), this information will be reported subsequent to the code (displayed in
blue text). The console will continue to fill up with a history unless the user
chooses Edit—Clear Console. Clearing the console from time to time may be
helpful for the user involved in a large project, and might be seen as equiva-
lent to clearing the log file in SAS.
Because so many different types of messages appear in the console when
R code is run, saving “log files” is not as straightforward in R as it is in SAS.
(Please notice that the menus change depending upon whether the console
or the code window is highlighted.) Other options exist in the console’s set
of menus, but it may serve the user best if the user wants to save elements of
the session (such as error messages for later troubleshooting, spontaneously
formed code in the console, or log messages) to select the console window
and choose Edit—Select All, and then choose Edit—Copy.* Next, the user
can switch to a word processing environment (such as Notepad or Microsoft
Word) and choose Paste or Control−V.
At this point, it is important for the user to edit the pasted text to remove
the elements of the output not wanted to be saved in the manually edited
“log file.”† Because the R log files require some hand-editing, there are both
pros and cons. An advantage in R, unlike in SAS, is that the user can easily
remove parts of the text that are unimportant to her (such as confirmation
of reading in a data file, or reports of successful execution of simple com-
mands). It is also easy to annotate these parts to provide insight into the doc-
umentation being kept. The disadvantage is that this effort constitutes work
that is generally not done when SAS log files are saved. In reality, however,
anyone who has dug through SAS log files to troubleshoot the execution of
a large batch of code might not consider this feature of R a disadvantage!
* Please note that the menu must be used to execute the Select All command, but after that,
Control-C can be used in lieu of Edit – Copy command.
† Users familiar with SPSS will consider this reminiscent of the output window in SPSS, which
is easy to save in an SPSS analysis, but hard to understand after the fact if the file is not some-
how reduced, rearranged, or annotated.
Differences between SAS and R 9
This is because actually organizing a log file into what is important and
annotating it builds in efficiency later when troubleshooting may be neces-
sary, especially when a programming team is involved. Otherwise, it may
not be helpful to save the log file in the first place.
Finally, R handles graphical output files somewhat similarly to PC SAS in that
it opens and displays graphics in a separate window. The window automatically
opens when graphics requests are run in the console, and the window is titled
R Graphics. Once this window is open, the user can select it (which changes the
menu options) and choose Save As, where the user is presented a list of formats
to choose from, including the popular *.pdf, *.png, and *.jpg. If *.jpg is chosen, the
user is offered the selection between 50%, 75%, and 100% quality.
R’s handling of graphics represents a strong advantage over SAS, because in
SAS, the output function for graphics always involves the rigid and complex
“output delivery system,” or ODS, which produces files that generally need
to be post-processed in another program to be made even minimally pre-
sentable for journal publication or even presentation in an informal meeting.
In SAS, the alternative to this requires advanced expertise on the part of the
programmer to set options in SAS to modify the output as SAS commands
are executed so the resulting graph does not need extensive post-processing.
With R, this entire complexity is avoided, and even a basic graphics editor
such as Paint can be used to manually add detail to *.png and *.jpg graph-
ics files saved from R in this way. Further, as will be demonstrated later in
this book, but most specifically in Chapter 2, little expertise is required to
add elements through programming code to graphical output in R, such as
adding labels to x and y-axes, whereas these functions require much more
programming ability when done in SAS.
Code Documentation and Metadata: SAS versus R
A classic common challenge in any statistical program when working with large
datasets to complete a complex, long-term health data analytics project involv-
ing an interdisciplinary team, a lot of data, a lot of code, and a lot of time, is
entropy.* A part of this entropy that can drive the programmer to distraction is
having trouble keeping track of the meaning, content, and method of generation
of both the native variables (the variables present in the original data file used in
the analysis) and the new variables the programmer creates through code that
edits data. For this reason, in both SAS and R as well as any other statistical program,
the following modern process for maintaining code files is recommended:
1. Code files should be relatively short and focused on only one func-
tion (e.g., reading in a dataset, running histograms, running fre-
quencies, running regressions).
* A close colleague has referred to this phenomenon as “biostatistics paper hell.”
10 Analyzing Health Data in R for SAS Users
2. Code files should have a numerical prefix that causes the code files
to line up in the order in which they should be run, followed by a
shorthand label indicating the function of that code (e.g., 100_Read
in data, 105_Run summary statistics, 110_Create age group variable).
3. Advanced users can designate logic and naming conventions with
respect to these prefixes (code starting with 00n indicates pre-
processing, code starting with 10n indicates code that edits data,
code starting with 20n indicates analytic code that does not edit
data, code starting with 70n indicates exploratory code).
4. Prefixes should not be named in increments of 1 to allow for insertion
of code in between code files at a later date (e.g., naming code “100_
read in data,” with the next code being named “105_create age group
variable” will allow for the possibility of insertion of code “103_*”
after the fact, in case minor post-processing is found to be required
after reading the data in).
An analogy can be made to making a movie. Although viewers perceive the
movie as happening in chronological order, the director and actors are aware
that the scenes are actually not shot in chronological order. As a data “director,”
the programmer may develop “scenes” (programming commands) for later
in the “movie” (such as regression code), only to find that to make the story
straight and clear, she has to redo scenes from earlier in the movie (such as
generating a new variable in the analytic dataset). This way, if the age group
variable she created has an error or needs to be changed, she can go back to
the code earlier in the “movie,” edit this code properly, then start from the
beginning and run the whole “movie” of code from the beginning to the pres-
ent state to make sure the story hangs together after the edit.
Along with this principle, the importance of prolific and well-organized
comments in code, in both SAS and R as well as any other statistical program,
cannot be overstated. In SAS, commenting is done in between bookends of
/* and */ (e.g., /*This is a comment*/) or by simply using * at the beginning
of the line (e.g., *This is a comment). R uses the second approach to com-
menting (e.g., #This is a comment), but not the first. Therefore, although old
hands at SAS might be familiar with SAS banners that precede the code in
code files, these are not used in R.
This discarding of banners before code actually represents an innovation;
modernly, SAS banners (as well as banners in other statistical code) are to
be strongly discouraged in health data analysis. Instead, the information that
would have belonged in the SAS banners of the 1980s and 1990s programming
styles should now be placed instead in an outside program that is more accessi-
ble (such as Microsoft Word or Excel) and follow an organized standard. Banners
contain information that is human-generated, such as notes on interpretation of
variables or variances from standards in programming, not information that can
be automatically generated, such as a list of field names and attributes.
Differences between SAS and R 11
Many long-term SAS programmers have not learned this new principle,
and this inhibits clear communication and discussion about the nature of
the data, code, and analysis, as non-programmer managers and subject mat-
ter experts in other fields in healthcare and science cannot easily view this
metadata reported in the SAS banners. Hence, today, SAS banners should
be avoided in code, and R code should also not include banners, and should
only include prolific, carefully-worded and carefully-placed comments.
While some programmers shun excessive commenting, given the modular
approach recommended to developing code files, many code files are short,
so the comments do not complicate comprehension of the code.
Additionally, datasets themselves should not be documented in actual
code anymore, and all dataset metadata should be developed and made
available in easily accessible word processing and spreadsheet applications
(such as Microsoft Word and Excel) or else in the less-recommended PDF
format.* This means that SAS “labels” and “formats” should no longer be
used, and instead, this information should appear in a data dictionary docu-
ment, preferably in a spreadsheet format. An excellent example of publicly
available metadata that is presented in optimal format is the set of metadata
developed for the US Military Health System Data Repository, the MDR [19].†
The MDR’s use of this approach to metadata ensures that those receiving
MDR data files can study them adequately before attempting to read them
into any statistical software. This excellent set of metadata also provides
researchers from all fields necessary documentation to aid them in develop-
ing intelligent and informed data requests prior to receiving the data.
SAS programmers have historically embedded the metadata of variable
meanings, meanings of various categories in categorical variables, or infor-
mation about complex code functions that require explanation in their actual
SAS files, not in external metadata. This means that the actual file of SAS
code that assigns labels to variables in a dataset, and SAS FORMAT code
that labels the meanings of various levels of a categorical variable, would
sometimes be the only documentation of what these variables mean. This
effectively limits the ability of an interdisciplinary team to use the metadata
and complicates the interdisciplinary communication required to complete a
health analytics project successfully. It also essentially quashes the prospect
of group troubleshooting if multiple programmers are using different types
of statistical software.
On the other hand, using the methods described here of code organiza-
tion, naming conventions, and metadata maintained external to the statisti-
cal program, the programmer can not only reduce the time and complexity
associated with troubleshooting but also make sure that such a “movie” can
* PDF format is harder to navigate, copy, and paste, compared to Microsoft Word and Excel.
† For an excellent example, the reader is encouraged to download the M2 data dictionary from
the MDR web site, which is formatted in Microsoft Excel, and browse the different tabs to
better understand the optimal format of metadata described in this section.
Exploring the Variety of Random
Documents with Different Content
JAN. 1, 1861, TEARS ENDED JUNE 30 — TO JUNE 30, 1870
1871 TO 1880 1881 TO 1890 1891 TO 1900 1901 TO 1910 1911 TO
1920 7.800 6,734 17,094 85,984 787,468 72,969 7,221 31,771
72,206 718,182 353,719 20,177 88,132 50,464 1,452,970 597,047
20,062 52,670 36,006 543,922 15,996 655,694 31,816 ( 95,265 \
230,679 693,703 ( 6,723 ( 23,010 14.559 33.149 2.562 2,145,266
41,635 65,285 73,379 341,498 167,519 2,045,877 48,262 190,505
249,534 1.597,306 27,935 69,149 53,008 34,922 118,202 896,342
33,746 41,983 61,897 143,945 184,201 11,728 9,102 109,298 4,536
8,493 55,759 16,541 211,245 52,254 9,893 307,309 53,701 568,362
265,088 6,535 1,109,524 43,718 66,395 95,074 921,957 68,611
89,732 13,311 23,286 28,293 81,988 23,091 77,098 668,128 38,768
435,778 460.479 87,564 436,871 657,488 149,869 655,482 271,094
60,053 403,496 11,186 388,017 120,469 339,065 17,464 249,944
78,601 145,937 13,107 1,042,674 984.914 1,462,839 745,829
865,015 487,589 210 656 10,318 4,370 1,719 18,350 2,064,407
2.261,904 4,721,602 3,703,061 8,136,016 4,376.564 153,871 2,191
96 9,043 1,396 383.269 5,362 210 13,957 928 392,802 1,913 462
29,042 2,304 2,631 746 1,183 35,040 3,059 179,226 49,642 8,112
107,548 17,280 742,185 219,004 17,159 123,424 41,899 166,597
403,726 426,523 42,659 361,808 1,143,671 3,446 10,056 15,798
64,301 123,201 61,711 23,166 26 26,855 8,398 28,370 20,605
4,713 129,797 77,393 11,059 21,278 2,082 83,837 79,389 308 622
6,669 5,973 64,609 123.823 68,380 86,815 243,567 192,559 221
312 15,232 10,913 229 1,540 12,574 437 1,299 8,793 1,343 1,749
12,973 7,368 33,654 13,427 8,443 1,147 2,314,824 2,812,191
5,246,613 3,844,420 8,795,386 6,735.811 ^Includes Serbia.
Bulgaria, and Montenegro prior to 1920; included in "Europe, not
specified," prior to 18911900; also, after 1919, Czechoslovakia,
Poland, and the Kingdom of the Serbs, Croats, and Slovenes. »Not
separately stated prior to 1891-1900. 'Immigrants from British North
America and Mexico were not reported from 1886 to 1895, inclusive.
8Not separately enumerated prior to 1899. 161
162 AMERICAN INTELLIGENCE decades from 1820 to 1920.
My own Tables 9 and 10 give the distribution of the intelhgence
scores on the combined scale for the nativity groups we are
studying. Anybody who disagrees with the estimates given in Table
33 may take these tables and split them according to any other
estimates he wishes to make. However, minor changes in the
proportions given in Table 33 would make very little difference in,
the final results. The figures which follow are merely estimates
based on Table 33. I am not claiming that these figures are
absolutely reliable, but merely that they represent very much closer
approximations to the truth than would be obtained from the
Northern and Western, and Southern and Eastern classification. To
obtain an estimate of the proportion of Nordic, Alpine, and
Mediterranean blood in our immigration since 1840, the immigration
figures by countries, given in Table 34, have been cut according to
the proportions given in Table 33 and re-combined into percentage
estimates which are given in Table 35. These estimates show in
general an immigration prior to 1890 which ran 40% or 50% Nordic
blood. Since 1890, the proportion of Nordic blood has dropped to
20% or 25%, the Alpine stock now constituting about 50% of the
total and the Mediterranean 20% or 25%. The proportions given in
Table 35 are shown graphically in Figure 41. The percentage
estimates, given in Figure 35 and shown graphically in Figure 41,
should be considered in connection with the total volume of
immigration for each decade given in Table 34 and shown graphically
in Figure 42.
Table No. 35 Estimate of the amount of Nordic, Alpine and
Mediterranean blood coming to this country from Europe in each
decade since 1840. PEB CENT. PEB CENT. PER CENT PER CENT.
TOTAL NORDIC ALPINE MEDITERRANEAN OTHERS AND DECADE
IMMIGRATION BLOOD BLOOD BLOOD UNCLASSIFIED 1841-1850
1,713,251 40.5 19.0 36.2 4.3 1851-1860 2,598,214 42.3 25.5 28.9
S.3 1861-1870 2,314,824 50.6 26.0 19.2 4.2 1871-1880 2,812,191
48.8 28.5 16.7 6.0 1881-1890 5,246,613 46.1 85.2 16.0 2.7 1891-
1900 3,844,420 30.2 43.8 22.5 8.5 1901-1910 8,795,386 19.8 51.3
24.3 4.6 1911-1920 5,735,811 22.6 44.0 23.7 9.7 163
The text on this page is estimated to be only 2.58%
accurate
^^ S z 1 •< H tf • u 1^ tf 1 s ;5 H ^ I • Q H « 1 « PU *^
W ' 1 o H^ Q W ' • z ^ 1 ,g'
The text on this page is estimated to be only 0.17%
accurate
go ^^13^^ ^ en 43 G^_fl ^ o a, ^ «N '^ ^ ^ S ^ ^ vQ O
(u ■ 1? > X .t^ C 550 «3 (^ •- -g 0^ ^ C 3 W 03 a a § « g ■M ^! '^
"73 -^ ^ '*^ SV, f:2 a fl H .« - ^ =3 ^ ^ '3 -^ -^ •- H :3 O fl >H rj
O ^' ^ 5yo 0, - .S o »
The text on this page is estimated to be only 1.12%
accurate
C3 C S3 en > "Jj O -N >^ o a^ w S e_ >> O .—I 03 a; .2 o
"^iij M CO ^j ! H a _ri ,_( O ' — I oc > O c3 ^ o G a :: CO CO ^ ^
.3^ 1 S p:5 -^ '-^ ^ . 2 o S>~ C 553 ^ r^ -H ^ a cj ^ •^ ^^ S o o
o 2 a T3 00 '-' °? •-'
168 AMERICAN INTELLIGENCE In order to obtain an
estimate of the intelligence of the three European races in this
country, the distributions of the intelligence scores on the combined
scale given in Table 9 were cut according to the proportions given in
Table 33, and re-combined into Nordic, Alpine, and Mediterranean
groups. The final distributions are, of course, neither purely Nordic,
Alpine, nor Mediterranean, but the sample of individuals we have
thus selected as Nordic is undoubtedly more typical of the Nordic
race type than it is of the Alpine and Mediterranean types. In the
same way, the Alpine and Mediterranean groups are more typical of
each of these race types than they are of either of the other two.
With thus much of apology for the method, I will, in the following
pages, simply for brevity of expression, call these groups Nordic,
Alpine, and Mediterranean. The reader must bear in mind that the
distributions are only approximate samplings. The actual
distributions on the combined scale of the three race groups so
selected are given in Table 36, together with the proportions in each
thousand. The distribution curves of the three groups are shown in
Figure 43, in which the horizontal direction represents scores on the
combined scale, and the vertical direction proportions in each
thousand making each intelligence score. The differences found are
very marked. The difference between the Nordic and Alpine group is
1.61 =1=0.042, a difference which is 38.3 times the probable error
of the difference. The difference between the Nordic and
Mediterranean group is 1 .85 =*= 0.042, a difference which is 44
times the probable error of the difference. The Alpine and
Mediterranean groups are, on the other hand, very much closer
together, the difference being 0.24 =*= 0.04, a difference which is 6
times the probable error of the difference. The easiest and most
obvious objection that can be made !
Table No. 36 Analysis of the foreign born white draft by
races. Distributions of the inteUigence scores of the Nordic, Alpine
and Mediterranean groups. COMBINED ACTUAL DISTRIBUTION
PROPORTION IN EACH SCALE THOSUAND INTERVALS NORDIC
ALPINE MEDITERRANEAN NORDIC ALPINE MEDITERRANEAN 24.0-
24.9 23.0-23.9 .... .... .... .... .... 22.0-22.9 "3 1 1 .... 21.0-21.9 8 5 2
2 20.0-20.9 19 11 5 5 2 "'2 19.0-19.9 37 22 11 11 5 3 18.0-18.9 71
47 26 21 10 6 17.0-17.9 135 90 55 39 19 13 16.0-10.9 238 155 103
69 32 24 15.0-15.9 357 246 180 103 51 43 14.0-14.9 469 372 296
136 78 71 13.0-13.9 566 544 408 164 114 111 12.0-12.9 528 650
591 153 136 141 11.0-11.9 371 628 590 107 132 140 10.0-10.9 260
595 509 75 125 136 9.0- 9.9 184 546 523 53 115 125 8.0- 8.9 112
403 376 32 85 90 7.0- 7.9 59 248 223 17 52 53 6.0- 6,9 26 124 108
8 26 26 5.0- 5.9 9 52 47 3 11 11 4.0- 4.9 3 19 16 1 4 4 3.0- 3.9 1 6
5 .... 2 1 2.0- 2.9 2 1 .... .... .... 1.0- 1.9 .... .... .... .... .... .... No. of
cases. 3456 4766 4196 Average .... 13.28 11.67 11.43 S.D 2.70 2.87
2.70
The text on this page is estimated to be only 0.46%
accurate
« d 3i S s 2 o 2 w ^ -S^ ^ 5^ g 5 ;^ "^ *^. § ^ O "P ^
CO „ ^
AMERICAN INTELLIGENCE 171 to these findings is that the
superiority of the Nordic group is due to the fact that it contains so
many EngUsh speaking persons, and that lack of facihty in the use
of EngHsh is a handicap to the non-Enghsh speaking foreign born in
the army tests. We have previously examined this hypothesis in
connection with the argument establishing the fact that each
succeeding five year period since 1902 shows a gradual deterioration
in the intelligence of the immigrants examined in the army, and have
definitely shown that the language factor does not distort the scores
of the years of residence groups. There is, however, a considerable
amount of wishful thinking on the subject of race, and it is well to
make assurance doubly sure by testing the hypothesis that the
superiority of the Nordic group is caused by the presence in the
group of English speaking populations. It is possible to split the
Nordic distribution in such a way that one group will contain
representatives from countries which are predominantly English
speaking (England, Scotland, Ireland and Canada), while the other
group will contain representatives from countries which are
predominantly non-English speaking (Holland, Denmark, Germany,
Sweden, Norway, Belgium, Austria, Russia, Italy and Poland) . This
we have done, and the results are given in Table 37, the two
distributions being shown in Figure 44. The distributions of the
English speaking Nordic group and the non-English speaking Nordic
group show a difference of 0.87=1=0.065, a difference which is 13.4
times the probable error of the difference. There are, of course,
cogent historical and sociological reasons accounting for the
inferiority of the non-English speaking Nordic group. On the other
hand, if one wishes to deny, in the teeth of the facts, the superiority
of the Nordic race on the ground that the language factor
mysteriously aids this group when tested,
Table No. 37 Analysis of the total Nordic sample into an
English speaking Nordic group and a non-English speaking Nordic
group. COMBINED ACTUAL DISTRIBUTION PROPORTION IN EACH
SCALE THOUSAND INTEKVAT,a ENGLISH NON-ENGLISH ENGLISH
NON-ENGLISH SPEAKING SPEAKING SPEAKING SPEAKING •
NORDIC NORDIC NORDIC NORDIC 24.0-24.9 .... 23.0-23.9 22.0-
22.9 2 "2 .... 21.0-21.9 7 ' k 6 20.0-20.9 12 6 10 8 19.0-19.9 21 16
17 7 18.0-18.9 39 82 32 14 17.0-17.9 67 67 54 30 16.0-16.9 108
181 87 59 15.0-15.9 143 214 116 96 14.0-14.9 176 298 148 132
13.0-13.9 201 865 168 164 12.0-12.9 172 856 189 160 11.0-11.9
109 262 88 118 10.0-10.9 70 189 57 85 9.0- 9.9 49 185 40 61 8.0-
8.9 31 82 25 87 7.0- 7.9 16 48 18 19 6.0- 6.9 7 19 6 9 5.0- 5.9 2 2 8
4.0- 4.9 1 .... 2 3.0- 3.9 .... .... 2.0- 2.9 .... .... .... .... 1.0- 1.9 .... ....
.... .... No. of cases. . . 1234 2222 Average 13.84 12.97 S.D 2.79
2.60 173
The text on this page is estimated to be only 0.44%
accurate
O jQ pfi .JH be ^ O 52 ^-Sst! 2 .2 o bcna ^ i3!tj ^ « t: ^
E2 :i b^ «S S JZ! i-M 'ft •si " ""I 0 • ej .rt ^_> -^ 1 O O g ^ C « O O
bC-^ g^ .^^ g S. 2 .3^ .s :3 .S^ ;S ^ ^ „ '^ 'IS .£ .S ^ V5 d 'bc5
*..S:J O S fl a. ^ p g .^:S S S^ V. o .S ^ 9^ o - ■ • \g So S JH 03
.Oi bC^
174 AMERICAN INTELLIGENCE he may cut out of the
Nordic distribution the English speaking Nordics, and still find a
marked superiority of the non-English speaking Nordics over the
Alpine and Mediterranean groups. The difference between the non-
English speaking Nordic group and the Alpine group is 1.30 =*=
0.047, a difference which is 27.6 times the probable error of the
difference. The difference between the non-English speaking Nordic
group and the Mediterranean group is 1.54=1= 0.047, a difference
which is 31.3 times the probable error of the difference. The
distributions are shown graphically in Figure 45. Discarding the
English speaking Nordics entirely, we still find tremendous
differences between the non-English speaking Nordic group and the
Alpine and Mediterranean groups, a fact which clearly indicates that
the underlying cause of the nativity differences we have shown is
race, and not language. It may be convenient for some to interpret
the differences found between the representatives of the three
European races in this country in terms of the standards having
popular significance which were used in Section VI. The criteria of
the per cent. A and B, and the per cent. D, D — and E give the
following results: PER CENT. PER CENT. A AND B D,D — AND E
English speaking Nordic 12 . 3 19.9 Total Nordic 8.1 25.8 Non-
English speaking Nordic . 5.7 29 . 1 Alpine 3.8 50.3 Mediterranean
2.5 53 . 6 The criteria of the per cent, at or above the average white
officer, and at or below the average of the negro draft give the
following results:
The text on this page is estimated to be only 0.42%
accurate
S3 t^ -^ ^^ C -M o o bf) 'J5 a C O rQ c >j Q, a> ^ C oj a
2 ^ d .t^ ^ C C g ;^. ^ .2 P -^^g^ g bc rt S 2 *^ ^ c .S ^ &- ^ ^
o ^ ;:5 bc q ?J 0) ^§W I i S " c -M G o 2; =« o fl •S -p. « ^ ^ ^ ^ 2
^ < ^ -B '? ^ a ^ bo p.^ S c' o o I— H j^ M M ^ .2 be Sf C3 rO o o
bC'-S o ^
176 AMERICAN INTELLIGENCE PER CENT. PER CENT. AT
OR ABOVE AT OR BELOW AVERAGE AVERAGE OF WHITE THE
NEGRO OFFICER DRAFT English speaking Nordic 4.0 10 . 9 Total
Nordic 2.3 14.5 Non-English speaking Nordic . 1.3 16.5 Alpine 1.0
34.5 Mediterranean 0.5 36 . 5 The criterion of the per cent, below an
approximate **mental age" of eight gives the following results: PER
CENT. BELOW "mental age" 8 English speaking Nordic 0.8 Total
Nordic 1.1 Non-English speaking Nordic . 1.3 Alpine 4.2
Mediterranean 4.2
SECTION IX RE-EXAMINATION OF PREVIOUS
CONCLUSIONS IN THE LIGHT OF THE RACE HYPOTHESIS It is now
necessary to retrace our steps for a moment to examine some of our
previous conclusions in the light of this new hypothesis. The
hypothesis that the differences between the nativity groups found in
the army tests are due to the race factor may be used to re-test our
previous conclusions that each succeeding five year period of
immigration since 1902 has given us an increasingly inferior
selection of individuals (Section IV) . The periods which we sample
by means of the army data, and the average score on the combined
scale of each sample are as follows : PERIOD NUMBER OF CASES
COMBINED SCALE AVERAGE 1887-1897 764 13.82 1898-1902 771
13.55 1903-1907 1897 12.47 1908-1912 4287 11.74 1913-1917
3576 11.41 Table 35, which gives our estimates of the per cent, of
Nordic, Alpine and Mediterranean blood coming to this country,
shows that the big change in immigration came between the
decades 1881-1890 and 1891-1900, the percentage of Nordic blood
which formerly ran from 40% to 50% having dropped to 30% in the
decade 1891-1900, and to approximately 20% or 25% in the two
subsequent decades. On the other hand, the big drop in the
intelligence of immigrants arriving came after 1902. The change in
177
178 AMERICAN INTELLIGENCE character of the immigration
would account for part of the decHne in the average intelligence of
succeeding periods of immigration, but not for all of it. The decline
in intelligence is due to two factors, the change in the races
migrating to this country, and to the additional factor of the sending
of lower and lower representatives of each race. The only tendency
which would relieve this deplorable situation would be a current of
emigration strong enough to counteract the current of immigration.
Table 6 preceding shows the ratio between emigration and
immigration for each of the nativity groups involved in this study,
and we find in general between 1908 and 1917 a return current
approximately one third of the arriving current. Unfortunately, no
emigration statistics are available prior to 1908, and the figures after
1912 are distorted by the Balkan and European wars. The only
sample that we can take that is comparatively free from outside
influences is the sample 1908-1912. Taking the figures of arrivals
and departures for this period, and dividing them into Nordic, Alpine
and Mediterranean groups according to the method previously
outlined, we obtain the following percentage estimates : ALIEN
ALIEN TMMTGRANTS ADMITTED EMIGRANTS DEPARTED NET
IMMIGRATION Per cent, of Nordic blood 21.2 16.0 23.9 Per cent, of
Alpine blood 50.4 50.6 50.2 Per cent, of Mediterranean blood 23.2
28.6 20.5 Per cent, others and unclassified 5.2 4.8 5.4
AMERICAN INTELLIGENCE 179 The sample from this five
year period shows a shght change (approximately 3%) in favor of
the Nordic type and against the Mediterranean type, the Alpine
immigration holding its own. There is therefore no relief from our
receding curve of intelHgence from emigration, if this five year
period be taken as typical of the outward alien passenger movement
in other years. I?- It will be remembered that the army authors
tentatively offered the hypothesis that the more intelhgent
immigrants remained in this country, while the more stupid ones
went home, as a possible method of accounting for the increase of
intelhgence scores with increasing years of residence. The gain of
3% in favor of the Nordic immigration would produce a very slight
tendency in this direction, but not enough to account for the actual
increase of intelligence scores found with increasing years of
residence, 11.41 (1913-1917) to 13.82 (1887-1897). It will also be
remembered that the army writers offered the hypothesis of the
better adaptation of the more thoroughly Americanized group to the
situation of the examination to account for the increases shown. The
factor of the adaptation to the situation of the examination cannot
be dissected out of the total scores of the test. If such a factor were
present, it would fall equally heavity on Nordic, Alpine and
Mediterranean alike, unless the change in the character of
immigration were so complete that the groups sampled at the two
extremes of the residence groups (1887-1897 and 1913-1917)
represented different race groups. But the difference between these
two years of residence groups (2.41 =t 0.0735) is so marked that it
would be necessary to assume (if our Nordic group were the more
thoroughly Americanized) that the 1887-1897 group was composed
entirely of English speaking Nordics or their equivalent in
intelligence, and that our 1913-1917 group was
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!
textbookfull.com