Sequence alignment using Longest Common Subsequence algorithm
The way of detecting the similarity of two or more
sequences is to find their longest common subsequence. Longest common
subsequence (LCS) of two sequences is a subsequence, of maximum possible
length, which is common to both the sequences.
Let's consider an example.
Suppose we have the following two DNA
sequences: TAGTCACG and AGACTGTC.
The LCS of the
two sequences is AGACG, which can be obtained from the following alignment.
TAGTCAC-G--
|| || |
-AG—ACTGTC
There are other possible common sequences of shorter
length, such as AGCG or AGTC. Although multiple LCS are possible in general,
there is only one LCS for this particular example, that is, there is no other
common subsequence of length 5 for these two sequences. The LCS also helps in
computing how similar the two sequences are: the longer the LCS, the higher the
similarity.
LCS algorithm
· Let
us suppose we have two sequences S1 and S2 of lengths m and n respectively,
where S1 = a1 a2 ... am and S2 = b1 b2 ... bn.
· We
will construct a matrix A where Ai,j denotes the length of the longest common
subsequence of a1 a2 ... ai and b1 b2 ... bj.
Example matrix for sequences S1 =
TAGTCACG and S2 = AGACTGTC.
· Sequence
S1 is written vertically, and S2 horizontally. Each letter representing the
rows is compared with each letter representing the columns. We'll proceed row
by row, column by column.
If ai = bj, then we have found a
match. We get a score of 1 for the current match, and Ai-1,j-1 from the rest of
the LCS we already obtained from substrings a1 ... ai-1 and b1 ... bj-1.
If ai ≠ bj, then we have a mismatch.
In this case, we need to consider two possibilities. The LCS a1 ... ai and b1
... bj-1 , and the LCS of a1 ... ai-1 and b1 ... bj .
Hence, we have
Complexity
It takes O(nm) time to fill in the m
by n matrix A. This approach is called dynamic programming.
Step by step example
·
The matrix is filled row by row,
column by column.
The first row R1 and column C1 are
filled with zero.
(mismatch example) The first letter
in the row R2 is compared with the column C2. The cells in this case contains
‘T’ and ‘A’. Since it is a mismatch, the score is taken from left or above,
i.e. (R1, C2) and (R2, C1). As both values are ‘0’, the total score becomes
max(0, 0) = 0 in this case.
We continue filling rest of R2.(match
example) When R3 is compared with C2, both the cell contain ‘A’. Since it is a
match, the score is given by 1 + the score at diagonally up and to the left,
i.e. (R2, C1), which is ‘0’ in this case. So the total score becomes 1.
We continue filling the rest of R3.(match
example) In a similar way, when R4 is compared with C3, it is again a match.
This time, the score is given by 1 + score from (R3, C2) which is 1. Hence,
total score becomes 2.
The entire matrix is filled in this
way. The final score is given by Am,n = 5.
Traceback
· The
actual subsequence is deduced by performing a traceback, i.e. tracing backwards
from the right bottom to the top left. When the length decreases is all
directions, the sequences must have had a match. Several paths are possible
when two arrows are possible in a cell (given in the box below). In such cases
either path can be chosen.
Comments
Post a Comment