Largest common subsequence

This online calculator is designed to find the largest common subsequence of two sequences given as strings.

The calculator below solves the problem of finding the largest common subsequence. The problem is solved by dynamic programming, using an iterative algorithm, or bottom-up algorithm. The result is a matrix of solutions to the subproblem of finding the lengths of subsequences, which is then used to form the largest common subsequence. The calculator outputs the subsequence, its length and the solution matrix. You can read more about the problem and its solution algorithm under the calculator.

PLANETCALC, Largest common subsequence

Largest common subsequence

Largest common subsequence
 
Length of the longest common subsequence
 
Solution matrix
 
Numerical solution matrix
 

Most common subsequence

A subsequence is a sequence that can be obtained from an original sequence after removing some set of elements (including an empty one) from it. We should distinguish a subsequence from a substring. For example, if our original sequence is ABCDEF, then ACE is a subsequence obtained by removing elements B, D, and F, and ABC is a substring (which is also a subsequence). In general, all substrings are subsequences, but not all subsequences are substrings.

The problem of finding the largest common subsequence is the problem of finding the largest subsequence contained in all sequences of a given set. Often the largest common subsequence is searched for in only two sequences, for example, the largest common subsequence between a short string and a long string. That is, in fact, they want to find out whether the letters of the short string appear in the same order, but possibly at different distances, in the long string. The problem is a classical computer science problem, its algorithms belong to string algorithms, and practically it is used for data comparison, for example, in bioinformatics and computational linguistics.

Finding the largest common sequence

Consider an iterative algorithm for finding the largest common sequence between two strings. The easiest way to obtain such a sequence is to form a matrix of solutions to the problem of finding the length of a common subsequence.

Algorithm for filling the solution matrix

Note: We will assume that characters in the string are numbered from index 0, as in most programming languages.

  1. Form an empty matrix of size m+1 rows by n+1 columns, where m is the length of the first row, n is the length of the second row.
  2. Let's traverse all the elements of this matrix, starting from the lower right corner, moving first from right to left (index j), then, after reaching index 0, from bottom to top (index i).
  3. For each element with index i,j we calculate its value according to the following rule
    • if the current index of row i is m or the current index of column j is n, assign the element the value 0
    • if the symbol with index i of the first row is equal to the symbol with index j of the second row, assign the value 1 + the value of the element with index i+1, j+1 (the value of the element one position below and to the right of the current one) to the element.
    • if the symbols are not equal, assign to the element the maximum value of its neighbors to the right and bottom.

Having passed all elements in this way, in the element with index 0,0 we will get the length of the maximum common subsequence. Using the obtained matrix, we can form the subsequence itself.

Algorithm for generating the largest common subsequence from the solution matrix
  1. Let's start with the 0,0 element
  2. For the current element, check whether the symbol with index i of the first line is equal to the symbol with index j of the second line
    • if they are equal, add this symbol to the subsequence, increase both indices by 1, moving the current element immediately to the right and below.
    • if they are not equal, check the neighbors below and to the right and move the current element to the maximum of them, i.e. move to the right or below.
  3. stop when we reach the end of any row.

The calculator above allows you to output the solution matrix so that you can trace the workings of the algorithm.

It only remains to note that the computational complexity of such an algorithm is O(mn). There are more efficient algorithms whose speed depends on additional conditions, such as the number of matches between sequences or the power of the alphabet used in the sequences.

URL zum Clipboard kopiert
PLANETCALC, Largest common subsequence

Kommentare