Sequence alignment using Longest Common Subsequence algorithm

Sequence alignment using Longest Common Subsequence algorithm

The way of detecting the similarity of two or more sequences is to find their longest common subsequence. Longest common subsequence (LCS) of two sequences is a subsequence, of maximum possible length, which is common to both the sequences.

Let's consider an example.

Suppose we have the following two DNA sequences: TAGTCACG and AGACTGTC.

The LCS of the two sequences is AGACG, which can be obtained from the following alignment.

TAGTCAC-G--

|| || |

-AG—ACTGTC

There are other possible common sequences of shorter length, such as AGCG or AGTC. Although multiple LCS are possible in general, there is only one LCS for this particular example, that is, there is no other common subsequence of length 5 for these two sequences. The LCS also helps in computing how similar the two sequences are: the longer the LCS, the higher the similarity.

LCS algorithm

· Let us suppose we have two sequences S1 and S2 of lengths m and n respectively, where S1 = a1 a2 ... am and S2 = b1 b2 ... bn.

· We will construct a matrix A where Ai,j denotes the length of the longest common subsequence of a1 a2 ... ai and b1 b2 ... bj.

Example matrix for sequences S1 = TAGTCACG and S2 = AGACTGTC.

· Sequence S1 is written vertically, and S2 horizontally. Each letter representing the rows is compared with each letter representing the columns. We'll proceed row by row, column by column.

If ai = bj, then we have found a match. We get a score of 1 for the current match, and Ai-1,j-1 from the rest of the LCS we already obtained from substrings a1 ... ai-1 and b1 ... bj-1.

If ai ≠ bj, then we have a mismatch. In this case, we need to consider two possibilities. The LCS a1 ... ai and b1 ... bj-1 , and the LCS of a1 ... ai-1 and b1 ... bj .

Hence, we have

Complexity

It takes O(nm) time to fill in the m by n matrix A. This approach is called dynamic programming.

Step by step example

· The matrix is filled row by row, column by column.

The first row R1 and column C1 are filled with zero.

(mismatch example) The first letter in the row R2 is compared with the column C2. The cells in this case contains ‘T’ and ‘A’. Since it is a mismatch, the score is taken from left or above, i.e. (R1, C2) and (R2, C1). As both values are ‘0’, the total score becomes max(0, 0) = 0 in this case.

We continue filling rest of R2.(match example) When R3 is compared with C2, both the cell contain ‘A’. Since it is a match, the score is given by 1 + the score at diagonally up and to the left, i.e. (R2, C1), which is ‘0’ in this case. So the total score becomes 1.

We continue filling the rest of R3.(match example) In a similar way, when R4 is compared with C3, it is again a match. This time, the score is given by 1 + score from (R3, C2) which is 1. Hence, total score becomes 2.

The entire matrix is filled in this way. The final score is given by Am,n = 5.

Traceback

· The actual subsequence is deduced by performing a traceback, i.e. tracing backwards from the right bottom to the top left. When the length decreases is all directions, the sequences must have had a match. Several paths are possible when two arrows are possible in a cell (given in the box below). In such cases either path can be chosen.

Search This Blog

Sequence alignment using Longest Common Subsequence algorithm

Comments

Post a Comment