Pairwise alignment Biology 162 Computational Genetics Todd Vision Fall 2004 31 Aug 2004 Preview General concepts in alignment How to read a dotplot Scoring matrices and gap penalties Basic dynamic programming algorithms Needleman-Wunsch Smith-Waterman Using more realistic gap penalties Homology Two sequences are homologous if they are descended from a common ancestral sequence. Homology is either all or nothing Only similarity can be quantitative An alignment is a model of positional homology between nucleotide or amino acid residues Some applications of sequence alignment Sequence/genome assembly Locating exons within genomic sequences Functional annotation by homology search against a database Identification of conserved signatures/motifs/domains Molecular evolution and phylogenetics Structural homology modeling Alignments classified by Span Global, encompassing full-length sequences Local, restricted to conserved segments

Number of sequences Pairwise, involving only two sequences Multiple, involving more than two Algorithm Optimality guarantee Heuristic Amino acids versus DNA DNA sequences give much worse alignments than amino acid sequences Fewer letters Less realistic scoring matrices Some applications can align codons How large would that scoring matrix be? If thats not possible Use aa alignment to guide DNA alignment of coding sequences Use conceptual translations (6 potential coding frames) for database searches Dotplots: phage cI vs. P22 c2 repressor B A Window size Stringency 1 1 C 11 7 25 15

Internal repeats: human LDL protein Window size = 23 bp Stringency = 7 bp Inversions Hanging ends y x x Overlap x y y Nesting Twilight Zone 100 90 80 70 60 50 40 30 20 10 0 Percent amino acid identity Scoring an alignment Possible relationships at a position Match (identity) Mismatch (substitution) Gap (insertion/deletion, or indel) A scoring matrix is used for matches and mismatches Typically binary for nucleic acids PAM, BLOSUM, & others for amino acids Gap penalties must be tacked on The alignment score is the sum of the scores at each position in the alignment, including gaps

LOD scores Let pab be the expected frequency of aligned residue pair a and b among all aligned residues Let qa be the frequency of individual amino acid a pab s(a,b) = log qa q b S = s(x i , y i ) i Point Accepted Mutation (PAM) matrices Trained on alignments of closely related proteins PAM1 implies 1 substitution per 100 amino acids PAM250 = (PAM1)250 Training set strongly biased toward globular proteins (more suitable matrices are available for more specific protein classes) Choosing the right PAM matrix Low PAM values discriminate among amino acids more dramatically As the exponent increases, values within a row converge on amino acid frequencies Choice of matrix typically depends on observed % identity Classic chicken and egg problem PAM250 corresponds to 20% identity Assumes substitution rate is equal among sites (Poisson model), which we know to be false BLOSUM scoring matrices Trained on ungapped alignments (blocks) of divergent sequences to capture long-term

substitution patterns Named BLOSUMx, where x (from 0 to 100) is the minimum percent identity of the sequences in the alignment. (The smaller the value of x, the more divergent the sequences). Note that numbers have opposite meanings for PAM and BLOSUM! BLOSUM62 is in wide use (eg it is the default in BLAST) BLOSUM62 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1

-1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1

-4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3

-1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3

5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4

-1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2

-1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4

-1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0

1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3

0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1

0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 BLOSUM62 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2

-2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2

-1 -1 -2 -1 3 -3 -2 2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B

-2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1

-3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * 4 4 4 4 -4 4 4 4

-4 4 4 4 4 4 -4 4 4 -4 4 4 4 4 4 1 Gap penalties Nave score Each gap position receives independent penalty of d (g) = dg Affine scores Score depends on length of contiguous gap Gap opening penalty d Gap extension (g) =penalty d (ge 1)e Dynamic programming A problem solving technique that employs recursion to solve a larger problem by solving a nested set of similar subproblems Application to pairwise alignment Imagine We know the score for the first i-1 and j-1 residues of sequences x and y

i-1 and j-1 are aligned in the optimal alignment There are three possibilities for the next position in the alignment A gap in sequence x A gap in sequence y A match or mismatch between i and j The maximum scoring alignment among these has to be in the optimal global alignment Overview of algorithm We can use this fact to recursively fill out a matrix containing the score F(i,j) of the optimal alignment for every pair of residues i and j We also store a pointer to one of three previously filled out cells in the matrix, forming a path graph The optimal global alignment must be a path within the path graph It can be found by performing a traceback from the final cell in the recursion Path Graph Diagonal moves represent matches and mismatches Horizontal and vertical moves represent gaps (indels) Needleman-Wunsch recursion F(i 1, j 1) + s(x i , y j ) F(i, j) = max F(i 1, j) d F(i, j 1) d

Let s(C,C)=1, s(C,G)=s(G,C)=-2, and d=-1 y A G A C x A 3 C 3 3 x: AC y: AC A 3 C 2 3 ?C ?- xi, yj match xi aligned to gap yj aligned to gap A G A 3 C 3 2 ??G NeedlemanWunsch 0

c -2 t -4 g -6 t -8 g -10 a -12 t -14 c g t -2 -4 -6 Initialize F(0,0) = 0 Use pointers to remember path

Match=+1, Mismatch=-1, Gap=-2 arbitrary order of precedence: , , g -8 c -10 g -12 t -14 Needleman-Wunsch: completed path graph c g t g c g t 0 -2 -4 -6 -8 -10

-12 -14 c -2 +1 -1 -3 -5 -7 -9 -11 t -4 -1 0 0 -2 -4 -6 -8 g -6

-3 0 -1 +1 -1 0 -2 t -8 -5 -2 +1 -1 0 g -10 -7 -4 -1 +2 0

+1 -1 a -12 -9 -4 -3 0 +1 -1 0 t -14 -11 -6 -3 -2 -1 0 0 -2 +1

Needleman-Wunsch: traceback c g t g c g t 0 -2 -4 -6 -8 -10 -12 -14 c -2 +1 -1 -3 -5 -7 -9

-11 t -4 -1 0 0 -2 -4 -6 -8 g -6 -3 0 -1 +1 -1 0 -2 t -8

-5 -2 +1 -1 0 g -10 -7 -4 -1 +2 0 +1 -1 a -12 -9 -4 -3 0 +1 -1

0 t -14 -11 -6 -3 -2 -1 0 0 optimal global alignment: cgtgcg-t | || | | c-tgtgat -2 +1 Complexity of algorithm For sequences of length m and n We consider 3 options at each cell We store mn scores and pointers We trace back no more than m+n steps 3mn +m+n in time, 3mn in memory

O(mn) in both time and memory If m=n, O(n2) Smith-Waterman algorithm Local pairwise alignment Cells with negative scores are set to zero Traceback from highest scoring cell Stop when 0 is encountered Also O(nm) Smith-Waterman recursion 0 F(i 1, j 1) + s(x i , y j ) F(i, j) = max F(i 1, j) d F(i, j 1) d score 0 xi, yj match xi aligned to gap yj aligned to gap Smith-Waterman algorithm c g 0 0 0 c

0 +1 t 0 g t g c g 0 0 0 0 0 0 0 0 0 0 0 0

0 +1 0 0 0 +1 0 0 +1 0 +2 0 0 0 t 0 0 0 +2 0 +1 g

0 0 +1 0 +3 +1 +2 0 a 0 0 0 0 +1 +2 0 +1 t 0 0 0

+1 0 0 cgtgcgt optimal local alignment: ||| ctgtgat 0 0 t +1 0 Guaranteeing a local alignment Use of SW algorithm alone does not guarantee local behavior Sensitive to the scoring function (should be negative for random sequences) Use of LOD scores help ensure this Gap penalties must also be chosen carefully If it is cheaper to insert a gap than to tolerate a mismatch, then gaps will be inserted where no alignment is possible More realistic gap penalties General gap function (g) F(i 1, j 1) + s(x i , y j ), F(i, j) = max F(i 1, j) (i k), k = 0,..,i 1 F(i, j 1) ( j k), k = 0,.., j 1

Requires O(n3) operations Affine gap penalties Gap score: (g)=-d-(g-1)e Can be done in O(mn) We need to keep track of three scores (and pointers) at each cell M(i 1, j 1) + s(x i , y j ) M(i, j) = max Ix (i 1, j 1) + s(x i , y j ) I (i 1, j 1) + s(x , y ) y i j M(i 1, j) d Ix (i, j) = max Ix (i 1, j) e M(i, j 1) d Iy (i, j) = max Iy (i, j 1) e A theme with variations Overlapping or nested sequences Do not penalize hanging ends Repeated sequences Asymmetric algorithm can find multiple local alignments of x in y or y in x The basic idea admits many variations Things to keep in mind about pairwise alignments There may be multiple optima Optimality is only guaranteed with respect to the scoring function the alignment may still be biologically wrong! O(mn) is still too big when n is the size of a major sequence database

Summary Dotplots are an excellent visual tool to decide whether and what kind of alignment is appropriate PAM and BLOSUM series matrices provide empirical LOD values for scoring alignments Different flavors of alignment are produced by variants on a basic dynamic programming algorithm Needleman-Wunsch for global alignments Smith-Waterman for local alignments Affine gap penalties balance biological realism with computational feasibility Reading assignment Nicholas et al. (2002) Strategies for multiple sequence alignment. Biotechniques 32, 572-591 (handout)