bioinformatics 1 -- lecture 7
Probability and conditional probability
 Random sequences and significance (real sequences are
 not random)
 Erdos & Renyi: theoretical basis for the significance of an
 alignment given its length and score
 Extreme value distribution, better than a Gaussian for
 estimating significance.
 E-values
     How significant is that?
                                      ...a specific
                                      inference. Thanks.
                          ...as opposed to...
         ...how likely the data would not
         have been the result of chance,...
Please give me a number for...
        Dayhoff's randomization
             experiment
                                                                 Margaret Dayhoff
Aligned scrambled Protein A versus scrambled Protein B
100 times (re-scrambling each time).
NOTE: scrambling does not change the AA composition!
Results: A Normal Distribution
                                       significance of a score is measured
                                       as the probability of getting
                                       this score in a random alignment
Why do we need a “model” for
       significance?
       Because random scores end here.
                And non-random scores start here.
                                  And we care about the difference
                                  significance here.
 So we want to be able to calculate the significance.
Easy right? Just fit to a Gaussian (i.e. normal) distribution...?
   Lipman's randomization experiment
 Aligned Protein A to 100 natural sequences, not scrambled.
 Results: A wider normal distribution (Std dev = ~3 times larger)
                                                                      David Lipman
WHY? Because natural sequences are different than random.
Even unrelated sequences have similar local patterns, and
uneven amino acid composition.
                                               Was the significance
                                               over-estimated using
                                               Dayhoff's method?
Lippman got a similar result if he randomized the sequences by
words instead of letters.
Why was Lippman’s distribution wider?
              low complexity
   complexity = sequence heterogeneity
  A low complexity sequence is homogeneous in its
  composition. For example:
  AAAAAAAAHAAAAAAAAKAAAAAEAA
  is a low-complexity sequence.
  Compared to other sequences, there are relatively
  few ways to make a 26-residue sequence that has 23
  A's, 1 H, 1 K and 1 E.
  Why was Lipman’s distribution wider? (part 2)
                   Local patterns (words).
The three-letter sequence PGD occurs more often than
expected by chance because PGD occurs in beta-turns.
The amphipathic helix pattern pnnppnn (p=polar, n=non-polar) occurs
more often than expected because it folds into an alpha helix
•Even sequences that are non-homologous (Lippman’s exp)
have word matches. But scrambled sequences (Dayhoff’s exp)
do not have recurrent words.
  Dayhoff & Lipman assumed a
  normal distribution for random
        alignment scores.
   Should it really be a normal
          distribution?
I guess we need a theoretical foundation
                      Probability
   Susan B Anthony        The Queen         Ireland
            coin S          coin Q          coin I
"P(H)" means the probability of H, heads.      0≤P≤1
     Unconditional probabilities
Joint probability of a sequence of 3 flips, given any one (un)fair coin,
is the product.
             P(HHT) = P(H)*P(H)*P(T)
       Conditional probabilities
  If the coins are "unfair" (not 50:50), then P(H) depends on the
  coin you choose (S,Q or I). P(H) is "conditional" on the choice
  of coin, which may have its own odds.
            P(S,H) = P(S)*P(H|S)
        Conditional probabilities
"P(A|B)" means the probability of A given B, where A is the
result or observation, B is the condition. (The condition
may be a result/observation of a previous condition.)
P(H|S) is the probability of H (heads) given that the coin
is S.
In general, the probability of two things together, A and B, is
P(A,B) = P(A|B)P(B) = P(B|A)P(A)
Divide by P(B), you get Bayes' rule:
                                  To reverse the order of conditional
P(A|B)= P(B|A)* P(A) /P(B)        probabilities, multiply by the ratio of the
                                  probabilities of the conditions.
        Scoring alignments using P
For each aligned position (match), we get P(A|B) , which is the
substitution probability. Ignoring all but the first letters, the
probability of these two sequences being homologs is:
P(s1[1]|s2[1])             substitution of s2[1] for s1[1]
Ignoring all but the first two letters, it is:
P(s1[1]|s2[1])xP(s1[2]|s2[2])
Counting all aligned positions:
ΠiP(s1[i]|s2[i])          Each position is treated as a different coin.
                          (An independent stochastic process).
 Log space is more convenient
    log ΠiP(s1[i]|s2[i])/P(s1[i]) = ΣiS(s1[i]|s2[i])
where S(A|B) = 2*log2( P(A|B)/P(A) ) = BLOSUM score
       This is the form of the substitution score,
       Log-likelihood ratios (alias LLRs, log-
       odds, lods). Usually “2 times log2 of the
       probability ratio” (or “half-bits”).
                    The point is...
• Alignment scores are like coin tosses.
• ...with 20 “unfair” coins...
• (...and each coin actually has 20 sides...)
  To establish the theoretical foundation, we first go to
                     the simple case.
                                                            14
              Theoretical distribution of coin
                     toss alignments.
      Consider a fair coin, tossed n=25 times. The sequence is, let’s say:
      HTHTHTTTHHHTHTTHHTHHHHHTH
      The longest row of H’s is 5 in this case.
      What is the expected length of the longest row of H's given n?
                                        Erdos & Renyi equation:
number of                               E(n) = log1/p(n)
times it
occurred                                where p is the P(H).
Hey! Score
never goes
below zero.
                   length of longest sequence of H
           Heads = match, tails = mismatch
Similarly, we can define an expectation value, E(M), for the
longest row of matches in an alignment of length n. E(M) is
calculated similar to the heads/tails way, using the Erdos &
Renyi equation (p is the odds of a match, 1-p is the odds of a
mismatch):
E(M) = log1/p(M)           expectation given an alignment of length M
But over all possible alignments of two sequences of length
n, the number is
E(M) = log1/p(n*n) = 2 log1/p(n)
If the two sequences are length n and length m, it is
E(M) = log1/p(mn) [+ some constant terms that don’t depend on m and n]
   Heads/tails = match/mismatch
Theoretically derived equation for the expectation value for
M, the longest block of Matches.
E(M) = log1/p(mn) + log1/p(1-p) + 0.577log(e) - 1/2
Note that we can define a number K such that log1/p(K) = constant terms.
E(M) = loge(Kmn)/λ
...where λ = loge(1/p)
       Theoretical expectation value
        E(M) = log1/p(mn) + log1/p(1-p) + 0.577log(e) - 1/2
        E(M) = loge(Kmn)/λ
       Solving, using p=0.25, we get K=0.6237, λ= loge(4) = 1.386, m=n=140
        E(M) = loge(Kmn)/λ = 6.8
                       This would be the most likely number of matches in an
                       alignment of two random DNA sequences of length 100
                       and 100. You can try this! Align randomly selected
                       100bp DNA using local DP alignment (i.e. LALIGN
                       server. DNA. Local. Gap penalty=0). Count the longest
                       string of matches. Plot the distribution of scores.
google: “LALIGN server” or http://www.ch.embnet.org/software/LALIGN_form.html
                        P(S > x)
E(M) gives us the expected length of the longest number of
matches in a row. But, what we really want is the answer to this
question:
How good is the score x? (i.e. how significant)
So, we need to model the whole distribution of chance scores,
then ask how likely is it that my score or greater comes from
that model.
  freq
                score
        Distribution Definitions
       Mean = average value.
       Mode = most probable value.
       extremes = minimum and maximum values.
       Standard deviation = one type of decay function.
For a variable whose distribution comes from extreme
value, such as random sequence alignment scores, the
score must be greater than expected from a normal
distribution to achieve the same level of significance.
           A Normal Distribution
Usually, we suppose the likelihood of deviating from the mean
by x in the positive direction is the same as the likelihood of
deviating by x in the negative direction, and the likelihood of
devating by x decreases as the power of x.
Why? Because multiplying probabilities gives this type of
curve.
This is called a Normal, or Gaussian distribution.
     Extreme value distribution, a distribution derived
                  from extreme values
                                   Get statistics from lots of
                                   optimal scores
                   best score (optimal)
                                                            Using only best
all possible scores                                         scores produces a
for two sequences =                                         skewed distrib.
Normal distrib.
   Extreme value distribution
   y = exp(–x – e–x)
EVD has this shape. But the Mode and decay parameters depend on the data.
  Empirical proof of the EVD
Suppose you had a Gaussian distribution “dart-board”. You
throw 1000 darts randomly. Score your darts according the
number on the X-axis where it lands. What is the
probability distribution of scores?
Answer:The same Gaussian distribution! (duh)
  Empirical proof of the EVD
What if we throw 10 darts at a time and remove all but the
highest-scoring dart. Do that 1000 times. What is the
distribution of the scores?
Empirical proof of the EVD
                        The EVD
                        gets sharper
                        as we
                        increase the
                        number of
                        darts thrown
Empirical proof of the EVD
    Extreme value distribution for sequence scores
The EVD with mode
uλ and decay           y = exp(–x – e–λ(x-u))
parameter λ:
 The mode, from
 the Erdos &           u = loge(Kmn)/λ
 Renyi equation:
substituting gives:   P(x) = exp(–x – e–λ(x-ln(Kmn)/λ))
Integrating from x    P(S≥x) = 1 - exp(-Kmne-λx)
to infinity gives:
       Theoretical value for the decay
                 function λ
     λ is calculated as the value of x that satisfies:
                              ΣpipjeSijx = 1
    Substitution matrix values.
Sij is the log-likelihood ratio, log[P(i->j)/pipj]. So, eSij is the likelihood ratio,
P(i->j)/pipj. So eSijx is exP(i->j)/pipj. Let ex = pipj (averaged over all i and j), then
exP(i->j)/pipj. = P(i->j), and this sums to one by definition.
So λ = log( <pipj> ) = the log of the average expected subsitution probability pipj.
          voodoo mathematics
 For values of x greater than 1, we can make this approximation:
                  1-exp[-e-x] ≈ e-x
The integral of
  the EVD,        P(S≥x) = 1 - exp(-Kmne-λx)
  approximates to,       P(S≥x) ≈ Kmne-λx
Taking the log of both
    sides, we get         log(P(S≥x)) ≈ log(Kmn) - λx
                                                 which is linear.
   The linearized EVD may be used to find the parameters K and
               λ empirically (instead of theoretically).
              Finding the EVD parameters
The Linearized EVD                 log(P(S≥x)) ≈ log(Kmn) - λx
  To determine K and λ, we can plot log(P(S≥x)) versus x
      (x = false alignment score), and then fit it. (least squares)
  1. Generate a large number of known false alignment scores S.
     (all with the same query length m and database size n).
  2. Calculate P(S≥x) for all x.
  3. Plot log(P(S≥x)) versus x.
  4. Fit the data to a line. (Least squares)
  5. The slope is −λ.
  6. Intercept is log(Kmn). Solve for K.
  7. Use K and λ to calculate the e-value = exp( log(Kmn) - λx )
               x
                   x                                          The slope is −λ, (the
  logP(S≥x)
                       x x
                             xxx
                                   xx                         intercept is log(Kmn)
                                        xx
                                             xx
                                              x x   xxx
                                                          x    x
                                        x
                 e-values in BLAST
•Every BLAST "hit" has a bit-score, x, derived from the
substitution matrix.
•Parameters for the EVD have been previously calculated
for m and n, the lengths of the database and query.
•Applying the EVD to x we get P(S≥x), which is our "p-
value"
•To get the “e-value” (expected number of times this score
will occur over the whole database) we multiply by the size
of the database m.
                    e-value(x) = P(S≥x)*m
  where x is the alignment score, m is the size of the database, and P is calculated from
  false alignment scores.
    Matrix bias in local alignment
In Local Alignment we take a MAX over zero (0) and three other
scores (diagonal, across, down). Matrix Bias is added to all
match scores, so the average match score,and the extremes, can be
adjusted.
What happens if match scores are....?
all negative? : 
       Best alignment is always no alignment.
all positive? : 
      Best alignment is gapless, global-local.
average positive? : 
   Best alignment is local (longer).
     
        
        Typical random alignment is local.
average negative? :
    Best alignment is local (shorter).
     
        
        Typical random alignment is no alignment.
              Altschul's Principle
For local DP alignment, the match (substitution) scores should be
       > zero for a match, and
       < zero for a mismatch,
on average! (Some mismatches may have a > 0 score)
Why?
  Matrix bias to control alignment length
                       *
Matrix bias = constant added to each match score.
If we add a constant to each value in the substitution matrix, it
favors matches over gaps. As we increase matrix bias...
• Longer alignments are more common in random sets.
• Longer alignments are less significant.
Negative matrix             No matrix bias
                                                    Positive matrix bias
bias
                                                         *aka: bonus
          History of significance
•Significance of a score is measured by the probability of
getting that score by chance.
•History of modeling “chance” in alignments
   •1970’s Dayhoff: Guassian fit to scrambled alignments
   •1980’s Lipman: Gaussian fit to false alignments
   •1990’s Altschul: EVD fit to false alignments
         summary of significance
•The expected value for the maximum length of a match
between two sequences, lengths n and m, given the probability
of a match p, has a theoretical solution, which is log(1/p)(nm),
the Erdos & Lenyi equation. The
•The score of a DP alignment is the maximum score, which is
roughly proportional to the length (for local alignments only!).
Therefore, the expected value of alignment scores follows the
same theoretical distribution.
                        Review
•What does the Extreme Value Distribution model?
•How are the parameters (λ,K) of the EVD determined?
•Once λ and K have been found, how do we calculate the e-
value?
•What is the meaning of the e-value?
•For a given score x and a given database, which method gives
the highest estimate and which gives the lowest estimate of the
significance of x?
   --Dayhoff’s method or scrambled sequences.
   --Lipman’s method of false alignments, using a Normal
   distribution.
   --False alignments using the EVD.
                    Pop-quiz
You did a BLAST search using a sequence that has
absolutely no homologs in the database. Absolutely
none.
The BLAST search gave you false “hits” with the top e-
values ranging from 0 to 20. You look at them and you
notice a pattern in the e-values.
How many of your hits have e-value ≤ 10.?