ASSOCATION MAPPING WITH GAPIT
Genomic association and Prediction Integrated Tools
We are going to use this package to carry out association mapping using the mixed model.
Even if you do not grasp all the detail at a first run through this tutorial, I hope there is
enough information here to help you run association mapping on your own data sets. For this
reason, I’ve given a lot of detail.
TriticeaeGenome http://www.triticeaegenome.eu was a European collaborative project which
developed a panel of 384 UK, French and German winter wheat varieties. We have used this
previously to illustrate the cross-locations analysis of phenotype data. Here we are going to
use the variety means from those analyses as phenotypes for association mapping.
GAPIT is R software from the Buckler Lab (http://www.maizegenetics.net/). It is an
alternative to TASSEL, which some of you may be familiar with. In my hands TASSEL does
not function well. In addition to association mapping using mixed effects modelling GAPIT
will also carry out a form of genomic selection.
Unlike most packages in R, GAPIT doesn’t install or load simply. The first time you use it,
you must issue the following commands: (Cut and paste them into the R command window.
They are also given in the GAPIT manual
(http://zzlab.net/GAPIT/gapit_help_document.pdf):
source("http://www.bioconductor.org/biocLite.R")
biocLite("multtest")
install.packages("gplots")
install.packages("LDheatmap")
install.packages("genetics")
install.packages("EMMREML")
install.packages("scatterplot3d")
library(multtest)
library(gplots)
library(LDheatmap)
library(genetics)
library(EMMREML)
library(compiler) #this library is already installed in R
library(scatterplot3d)
source("http://zzlab.net/GAPIT/gapit_functions.txt")
source("http://zzlab.net/GAPIT/emma.txt")
In subsequent R sessions, you can omit the first seven lines unless you wish to update those
packages. If you cut-and-paste these commands from here and they do not work, it is worth
downloading the latest version of the GAPIT manual and checking that the location and
names of some of the files have not changed. This has happened before.
If you type:
ls()
1
you will see that you have now loaded an extensive number of functions into R with names
beginning with ‘GAPIT’ or ‘emma’. All of these functions are called by options within a
single R command “GAPIT().” All options are listed in the manual (though some appear to
be either not implemented or not working). The manual is not that clear, I find.
Before running GAPIT, we need to upload our data and create a kinship matrix. GAPIT will
create a kinship matrix on-the-go, but this is not necessarily appropriate as we shall see.
There is also some pre-processing required of the genotype data, but I have done most of this.
If you haven’t already done so, change directory to the “GAPIT” directory found in todays
course folder. You will shortly be creating a large number of output files.
GAPIT only works with SNPs or binary markers. To analyse SSRs, each allele could be
tested for association independently, possibly with some amalgamation of the results at the
end, but this is an unnecessary complication for demonstration purposes.
I also deleted markers with a minor allele frequency (maf) <2% and markers with greater
than 10% missing data. Most association mapping studies delete loci with low maf as they
are unlikely, on their own, to detect any association, even if they are functional. The
threshold is somewhat arbitrary. In the current case, <2% means less than 8 copies of the rare
allele. There is an option within GAPIT to set the maf threshold for each trait in turn: patterns
of missing data among the traits can alter the allele frequency for the phenotyped individuals.
Finally, I deleted markers with > 10% missing data (ie > 37 observations). These markers are
likely to be of poor quality.
Data are in the Excel file ‘TG data for course’ in the sheet ‘selected raw data’. The top two
rows are for information only, and give the map position, where known. Data have been
sorted on map position. The position of the mapped markers is given in the second sheet. All
the data required for this exercise have also been pasted into separate text files, which is less
confusing here than repeatedly copying data to the clipboard and reading it in from there.
There are two remaining stages of data preparation for the marker data.
Firstly, we need to select a subset of the markers to use for kinship calculation. Kinship
matrices, and any operations upon these matrices, such as PCO and tree construction, are
dominated by dense clusters of markers, if these exist. So a tree, for example, will not
necessarily reflect average relationships over the genome, but the relationships among
varieties in the genomic regions which have the highest number of markers. This is
particularly a problem if, as is often the case, the dense clusters of markers are also in high
LD. For this reason, it is good practice to thin the markers to give a more uniform
representation of relationships over the whole genome. One way of doing this is to calculate
the correlation coefficient among all markers and delete one marker of each pair in high LD
(say r>0.9). Another is to delete markers if they are too closely linked.
There is a fine balancing act in thinning markers. If markers are not uniformly distributed
over the genome, then marker-trait associations will be over-corrected in some parts of the
genome and under-corrected in others. If too few markers are left, then kinship will not be
accurately estimated and adjusted for. For our data, we shall use only the mapped markers
for kinship estimation, deleting all but one of any markers which map to the same position.
2
The map for the 1,117 markers I selected as markers for kinship calculation is in the sheet
“kinship map.”
The second stage of data preparation for the markers is to impute the missing data: we fill in
the missing values with our best guess at what the alleles would have been if the genotyping
had been successful. There are two good reasons for imputing the missing data. Firstly
several operations on the markers, including kinship calculation, do not handle missing data
well or at all. We could delete any marker (or variety) with missing data but we usually can
end up with very few varieties or markers left. Secondly, if we can make accurate estimations
of the missing values we gain power in our analyses by increasing the sample size.
There are packages specifically targeted at imputing marker data which incorporate LD
information among markers directly (“Beagle” is the standard). There are also general
packages for imputation of missing data, such as “impute” and “softImpute” in R. In practice,
especially for the kinship matrix, inserting missing data with the mean allele frequency (if
genotypes are coded 0 and 1) or two times the mean allele frequency (if genotypes are coded
0, 1 and 2) will work adequately.
For this example, I have already imputed missing data, using the package Imputation (which
is no longer supported).
The raw genotype data are in “selected raw data.” However, we’ll use the imputed data in
the sheet “imputed data”. Note the data are coded as 0 and 1. 0, 1 and 2 would be more
common if any markers were coded as heterozygous and is required for the collation of a
kinship matrix by GAPIT. Data cannot be read in as A, C, G, T or other alphanumeric coding.
Read in the imputed data. If reading from the clipboard, do not copy the first two rows to the
clipboard; they are for information only. Easiest is to read in from the prepared file:
geno.imp.corrected<-read.table("all_markers.txt",header=T)
For the kinship calculation (but not for the association analysis itself) the data are expected to
be coded as 0, 1, 2 (or on a 0 to 2 scale for the imputed data).
length(names(geno.imp.corrected))
geno.imp.corrected [,2:2873]<- geno.imp.corrected [,2:2873]*2
The map for these markers is in “all markers map”. Note that the chromosomes must be
numbered rather than given names: don’t copy the column of chromosome names into R.
Because all markers must be given a map position, BioGemma SNP markers are positioned
on chromosome 22 (I am grateful to them for their permission to use them), unmapped DArT
markers on 23, and some specific gene markers on 24. Read in the map:
geno.map<-read.table("all_markers_map.txt",header=T)
Now we can finally use GAPIT. If provided with nothing but genotype data, it will create a
kinship matrix:
GAPIT(GD=geno.imp.corrected,GM=geno.map)
You may get an error line:
3
“some row.names duplicated: 2,3,5,6,8,9,10,11”
If you do, the fix is to ensure that the genotype and map data both have a first column of
names.
You should get a lot of output on the screen, and in your working directory you will find that
four files have been created, two with kinship information. Rename the kinship files because
we want to compare them with a kinship matrix constructed from the thinned markers which
will otherwise overwrite these files.
The subset of markers selected for this purpose and the corresponding map are in “kinship
markers” and “kinship map”. Read them in:
kin.markers <-read.table("markers_for_kinship.txt",header=T)
kin.map<- read.table("kinship_markers_map.txt", header=T)
For the kinship calculation, the data are expected to be coded as 0, 1, 2 (or on a 0 to 2 scale
for the imputed data).
length(names(kin.markers))
kin.markers [,2:1118]<-kin.markers[,2:1118]*2
Now create a kinship matrix using just the thinned markers:
GAPIT(GD=kin.markers,GM=kin.map)
Compare the two pdf files that have been created. The effect isn’t large in this case, but the
kinship matrix from the thinned markers has lower average kinship and the heatmap is more
“washed out”. Sometimes the difference is much more extreme than seen here, and we shall
look at it in a more dramatic fashion shortly.
The kinship matrix itself will be exported to a csv file (opens in Excel) called
GAPIT.Kin.VanRaden.csv. You may note that the diagonals are not 2, as they should be for
fully inbred lines, but (a) something less for the all markers data set (b) something more for
the thinned (kinship) data set. (b) appears to largely caused by imputation, many genotypes
having values >0 and <2. We can discuss possible consequences and fixes for this.
We can read the kinship matrix back into R directly from the created spreadsheet. This seems
perverse but we want to use this matrix explicitly rather than the default (created from all the
markers) which GAPIT would otherwise use:
Rename the kinship matrix created from the thinned marker set as
“GAPIT.Kin.VanRaden.kin_markers_only.csv”
kin.matrix<-read.csv("GAPIT.Kin.VanRaden.kin_markers_only.csv",header=F)
Note 1. The file was output as a csv file so we’re using the special form of “read.table” for
these files. We could also use “read.table(.....,sep =",")
2. We require “header = F”, though variety names are still expected in
the.first column.
4
Having gone to all this trouble, we can also compare relationships among lines by PCO,
using either the full data or the thinned marker set. Close down any open pdf and csv files
and rename them if you want to keep them.
GAPIT(GD=kin.markers,GM=kin.map,PCA.total=4)
As well as the four files from above, you will now get 6 more output files, all beginning with
GAPIT.PCA. Once more, before running the next line, rename the output files to prevent
them being overwritten, then run the PCA using all the markers.
GAPIT(GD=geno.imp.corrected,GM=geno.map,PCA.total=4)
The added option PCA.total=4 causes GAPIT to include the largest four eigenvalues in the
association analysis. This follows the approach used in ‘Structured Association’ and
‘Eigensoft’ aka ‘Eigenstrat’ for adjusting for population structure. Here, we aren’t carrying
out that analysis, but it does create a csv with all the principal components.
To compare the effect of the marker thinning, open the “GAPIT.PCA.pdf” file for each data
set and compare the PC1-PC2 plots against each other. You should see clear evidence of
population structure when using all markers, almost certainly originating from the 1B1R rye
translocation, but observe no such structure when using the thinned markers. The effect is
also just about visible in the 3D plot of the first 3 PCAs. For this dataset, the PCA clustering
for all markers results from a clustering of markers in a single area of the genome and should
be removed from the kinship matrix. In other cases, if the clustering results from genome
wide differentiation, it will remain, correctly, after thinning of markers. An example would
be a spring versus winter split in European barley. These are differentiated breeding pools
and kinship differs genome-wide between the two, and this should be accounted for in any
association analysis.
Now we can do some association analysis. For this we need trait data: in the sheet of that
name.
pheno<-read.table("phens.txt",header=T)
To jump straight in, the following command will carry out a standard mixed model analysis
on the trait “awns”: presence or absence of awns on the ear. This is highly heritable and is
used for variety discrimination, so is a good place to start.
test<-
GAPIT(Y=pheno[,c(1,13)],GD=geno.imp.corrected,GM=geno.map,KI=kin.matrix,gro
up.from=376,group.to=376)
This may take a couple of minutes. It should exit saying "GAPIT accomplished
successfully...”. There will also be a notice about warnings which you can ignore.
Multiple output files will be created. It is worth moving these to a separate directory so the
current directory does not get cluttered, or subsequent runs overwrite files you want. Files
worth looking at are:
5
GAPIT.AWNS.Manhattan-Plot.Genomewise.pdf: a plot of the results on a –log10 scale. This
is generally the most interesting and informative plot.
GAPIT.AWNS.GWAS.Results.csv: results for each marker, ordered by p-value.
GAPIT.AWNS.Allelic_Effect_Estimates.csv - the allelic effects of each marker,
GAPIT.AWNS.QQ-Plot.pdf A QQ plot of the p-values to assess if there is an excess of
(perhaps false) positive results.
To break the GAPIT command down:
Y=pheno[,c(1,13)] analyse the 13th phenotype in the input file. You can analyse
multiple phenotypes – they will be given different output file
names by, for example Y=pheno[,c(1,13,15)]. “Y=pheno” will
analyse all phenotypes. Note you must have variety names in
column 1.
GD=geno.imp.corrected. test the markers for association in geno.imp.corrected. You could
select a subset of markers in exactly the same way as for
phenotypes.
GM=geno.map the map. It must match the genotypes, so if you select specific
markers, you must select the appropriate rows here too
KI=kin.matrix the kinship matrix. Note, if KI is missing, then GAPIT will create
its own kinship matrix. If markers are clustered, this can cause
causes problems. (Feel free to try it.)
group.from In addition to the EMMA algorithm, GAPIT uses an
implementation of the mixed model in which individuals are
clustered into more-or-less unrelated groups. The kinship matrix
can then set relationships between groups to zero, treat the groups
as independent and work with a much small number of
relationships within groups. This speeds up the analysis, which
can be important if working with very large numbers of
individuals and markers, but it is not necessary in the crop datasets
I have come across so far. See the manual for details. By setting
group.from = group.to = to the number of individuals to be
analysed, GAPIT is obliged to use the basic EMMA mixed model,
without adulteration, which I think is best.
group.to see above.
We can carry out an analysis unadjusted for kinship by restricting group.from and group.to 1:
test<-
GAPIT(Y=pheno[,c(1,13)],GD=geno.imp.corrected,GM=geno.map,KI=kin.matrix,gro
up.from=1,group.to=1)
We could also use lm() or lmer() for this analysis.
6
If you compare the QQ and Manhattan plots between the adjusted and unadjusted analyses,
you will see evidence of the increased frequency of small p-values, resulting from population
structure and kinship which have not been adjusted for.
Note that we have used the imputed genotype data in this analysis, rather than the raw data.
Which is best depends on the accuracy of the imputation for the marker under test.
Let’s look at height. (trait 12). All you need to is re-run the commands above, pointing at the
new phenotype.
Compare the QQ plots. You should see very many more significant results in the unadjusted
analysis. But if you look at the gene markers (chromosome 24), you will see one very
significant result with no adjustment, but two with adjustment. We can look in the GWAS
results file to see what is going on. I find it easier to do this if I first rename them – add ‘adj’
and ‘unadj’ to the names so it is clear which you are looking at.
In the unadjusted results, you should find Rht2 is the most significant marker. This accounts
for a very large proportion of the variation. It is the most important dwarfing gene employed
in Europe, so this is not surprising.
In the adjusted results, you will still find Rht2 on top, but the p-value is much reduced, as is
the additional variation accounted for by Rht2. Height is a major contributor to population
structure in wheat: German varieties tend to rely less on dwarfing genes for example and so
are taller. Adjusting for kinship has the effect of leaving less variation to be accounted for by
genuine marker relationships. Note, however, that the second highest result is from ppdD1, a
gene for earliness which is used in some wheat varieties, particularly in Southern Europe,
where early maturation may allow escape from summer drought.
Perhaps Rht2 is dominating the results. If we include Rht2 as a covariate in the analysis, will
other sources of reduced height will be revealed. For this we need to include Rht2 in the
analysis as a covariate.
I’ve set up a file called ‘covs.txt’, which contains the covariates we shall use in this session.
The format is the same as for the phenotype data. Read all the data in.
covs<-read.table("covs.txt",header=T)
To add a covariate to the mixed model:
test<-GAPIT(Y=pheno[,c(1,12)],CV=covs[,c(1,5)],GD=geno.imp.corrected,
KI=kin.matrix,GM=geno.map,group.from=376,group.to=376)
Not the addition of the “CV=” option to include covariates. Rht2 is in the fifth column. You
can include more than one covariate: just select what you want as described for selecting
phenotypes. CV=covs would include all the covariates in the data frame.
Compare the Manhattan plots. If you look at the GWAS results file, you will notice that Rht2
is no longer the peak marker. This is correct: it has been included as a covariate so its effect
should disappear. PpdD1 is now promoted to top position. Rht1 has emerged too. Rht1 is
another dwarfing gene used in wheat breeding, though it is less prevalent in Europe than has
7
Rht2. In the first analysis, its association with height was masked by Rht2, in the second its
significance is increased.
It is possible to include multiple covariates in the model and select the subset which gives the
best fitting model. For this the option:
"Model.selection = TRUE",
Is added to the GAPIT command. Goodness of fit is tested by BIC: Bayesian Information
Criterion. We won’t go into this option here. We’ll have a look at yield next, which is in
column 2 of ‘pheno’.
If there is time:
We are going to adjust for population structure in three ways:
None
Kinship
Population membership
Kinship + population membership.
We have discussed adjustment by population membership already. In essence, covariates of
membership are included in the model first, so that the significance of any marker effect is
tested after adjustment by these covariates. Before the advent of the mixed model, this was
the approved method. Population membership could be estimated by STRUCTURE, or by
eigenvectors. STUCTURE has the disadvantage that it takes a long time to run and for many
crops it is difficult to decide what value of k to use. Inclusion of eigenvectors is more popular
now. Many workers like to run a mixed model and include eigenvectors as covariates as well.
The merit of this approach is still debated. In part, at least, it can depend on the form of the
kinship matrix. For TriticeaeGenome, we are going to use country of origin as a surrogate for
population membership.
That country of origin is important for yield, we can we can test in a simple anova:
anova(lm(pheno$ALL~ covs$COUNTRY))
Unfortunately, GAPIT does not allow factors to be entered as covariates, they must be
numerical. We have three countries (one entry from the USA has been coded as unknown).
We can therefore have a column for each of the three countries, with 1 indicating origin in
that country, 0 for elsewhere. However, these three columns are not independent: they sum to
1 across rows, so if you know two columns, you also know the third. This is why COUNTRY
has only 2 df in the analysis of variance. Therefore we must drop one of these columns. I’ve
chosen to exclude GBR, leaving column 3 containing FRA and column 4 DEU. Repeat the
analysis of variance on these two columns:
anova(lm(pheno$ALL~ covs[,3]+covs[,4]))
This gives a very similar result to before: the partition of the sum of squares is very similar. It
is not identical because there are two countries of unknown origin, coded NA originally. In
my recoding, I’ve given these values of 1/3 for each country.
8
There is another covariate that has an important effect on yield: age of the variety. You can
see this either by plotting yield against age, or by adding age to the previous analysis of
variance. Three varieties had an unknown age of origin and I’ve given these the mean year of
origin, 1991.4.
For the mixed model, with all covariates:
test<-
GAPIT(Y=pheno[,c(1,2)],CV=covs[,1:4],GD=geno.imp.corrected,GM=geno.map,
KI=kin.matrix,group.from=376,group.to=376)
This command can be edited to run the other models. Save the output of each run in a
separate directory. Compare the QQ plots, Manhattan plots and GWAS results. What do you
think? It is clear that some correction must be done but what is the best.
Things to consider:
Repeat using the raw genotype data
Repeat the mixed model with the major hit (Rht2 again) as a covariate.
Repeat the mixed model with automatic covariate selection
Repeat for all traits.
Try a different kinship matrix. Create one based on correlations across individuals on
marker scores standardised for each marker. This will give diagonal elements of 1
(two after doubling for inclusion). The code below will do this:
# standardise markers, then correlate across individuals;
# The function scale will scale the data by columns (ie markers)
# ‘t’ transforms the data
# cor creates the correlation matrix across individuals.
cor.kin<-cor(t(scale(geno.imp.corrected[,2:532])))
write.table(cor.kin,file="cor.kin.txt")
rm(cor.kin)
This table could be edited in Excel (don’t forget to multiply by 2), and re-imported: easier in
Excel than doing it in R and you’ll have the matrix for future reference.
Conclusion & Recommendations:
There remains some art in carrying out association analysis: what kinship matrix to use, how
and if to use for imputed markers and so on. The best approach is probably to simulate traits
from the marker data for your data set and test analysis methods on those. In simulations here
(on the AGOUEB barley data and on a much smaller set of UK wheat lines), we concluded
that the best all-round method is the mixed model, with a kinship matrix based on excess
allele sharing (ie the correlation matrix), and no additional covariates to account for
population structure. But this wasn’t always best for all traits and datasets. If one has time, it
is worth experimenting.
9