class: center, title-slide
## CSCI-UA 480: APS ## Algorithmic Problem Solving
## Some String Algorithms .author[ Instructor: Joanna Klukowska
] .license[ Copyright 2020 Joanna Klukowska. Unless noted otherwise all content is released under a
[Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/).
Background image by Stewart Weiss
] --- layout:true template: default name: section class: inverse, middle, center --- layout:true template: default name: challenge class: challenge --- layout:true template: default name: poll class: inverse, full-height, center, middle --- layout:true template: default name: breakout class: breakout --- layout:true template:default name:slide class: slide .bottom-left[© Joanna Klukowska. CC-BY-SA.] --- ## String Definitions a __string__ `s` of length $n$ consists of characters `s[0]`, `s[1]`, ..., `s[n-1]` a __substring__ `s[a..b]` is a sequence of consecutive characters in a string that starts at position `a` and ends at position `b` (inclusive on both ends) a __prefix__ is a substring for which `a=0` a __suffix__ is a substring for which `b=n-1` a __subsequence__ is any sequence of characters in a string in their original order (not necesserily consecutive) --- ## Longest Common Subsequence The __longest common subsequence__ (LCS) of two strings is the longest string that appears as a subsequence in both strings. Examples: - "floor" and "donor", the lcs is "oor" - "caged" and "range", the lcs is "age" - "capsule" and "recaps", the lcs is "caps" -- __Solution__ given: two strings `x` and `y` `lcs(i,j)` - returns length of the longest common subsequence of the prefixes `x[0..i]` and `y[0..j]` ``` lcs(i,j) = lcs(i-1,j-1)+1, when x[i]=y[j] //there as a match max(lcs(i-1,j), lcs(i,j-1) ), otherwise //no match, use the previous LCS lcs(-1, j) = lcs(i, -1) = 0 //base case ``` --
[Visualization of the algorithm](https://www.cs.usfca.edu/~galles/visualization/DPLCS.html) (recursive complete search, top-down DP, bottom-up DP) --- ## Edit Distance The __edit distance__ between two strings is defined as the minimum number of editing operations that transform one string into the other. The allowed operations may vary, but are often - insert a character, "ABC" -> "ABCA" - remove a character, "ABC" -> "AC" - replace a character, "ABC" -> "ADC" (this one can be thought of as two separate operations of a remove followed by an insert, or an insert followed by a remove) -- __Solution__ given: two strings `x` and `y` `edit(i,j)` - returns the edit distance between the prefixes `x[0..i]` and `y[0..j]` -- ``` edit(i,j) = min ( edit(i, j-1) + 1, edit(i-1, j) + 1, edit(i-1, j-1) + cost(i,j) ) ``` --- ## Edit Distance The __edit distance__ between two strings is defined as the minimum number of editing operations that transform one string into the other. The allowed operations may vary, but are often - insert a character, "ABC" -> "ABCA" - remove a character, "ABC" -> "AC" - replace a character, "ABC" -> "ADC" (this one can be thought of as two separate operations of a remove followed by an insert, or an insert followed by a remove) __Solution__ given: two strings `x` and `y` `edit(i,j)` - returns the edit distance between the prefixes `x[0..i]` and `y[0..j]` ``` edit(i,j) = min ( edit(i, j-1) + 1, //insert character at the end of x edit(i-1, j) + 1, //remove character from the end of x edit(i-1, j-1) + cost(i,j) //replace the last char in x with the one from y ) ``` where `cost(i,j) = 0` when `x[i]=y[j]` and `cost(i,j) = 1`, otherwise. -- _What should the base case be?_ -- ``` edit(-1,j) = edit(i, -1) = +INF ``` --- ## Pattern Matching Given a pattern string `P` and a target string `T`, determine if and where the pattern `P` occurs in `T`. --
Variations of the problem - find all occurrences of `P` in `T` - count the number of occurrences of `P` in `T` - find the longest prefix of `P` that occurs in `T` --- name: naive ## Naive Approach Compare the pattern string to the target string starting at each position of the target string: --- template: naive ``` 1 2 index: 01234567890123456789012 string: ABC ABCDAB ABCDABCDABDE pattern: ABCDABD // mismatch at D ``` --- template: naive ``` 1 2 index: 01234567890123456789012 string: ABC ABCDAB ABCDABCDABDE pattern: ABCDABD // mismatch at A ``` --- template: naive ``` 1 2 index: 01234567890123456789012 string: ABC ABCDAB ABCDABCDABDE pattern: ABCDABD // mismatch at A ``` --- template: naive ``` 1 2 index: 01234567890123456789012 string: ABC ABCDAB ABCDABCDABDE pattern: ABCDABD // mismatch at A ``` --- template: naive ``` 1 2 index: 01234567890123456789012 string: ABC ABCDAB ABCDABCDABDE pattern: ABCDABD // mismatch at second D ``` --- template: naive ``` 1 2 index: 01234567890123456789012 string: ABC ABCDAB ABCDABCDABDE pattern: ABCDABD // mismatch at A ``` --- template: naive ``` 1 2 index: 01234567890123456789012 string: ABC ABCDAB ABCDABCDABDE pattern: ABCDABD // mismatch at A ``` --- template: naive ``` 1 2 index: 01234567890123456789012 string: ABC ABCDAB ABCDABCDABDE pattern: ABCDABD // mismatch at A ``` --- template: naive ``` 1 2 index: 01234567890123456789012 string: ABC ABCDAB ABCDABCDABDE pattern: ABCDABD // mismatch at C ``` --
and so on ... --
``` for i in 0..|T| for j in 0..|P| if T[i+j] != P[j] mismatch found break from the inner loop if j == |P|-1 && T[i+j] == P[j] found the match! ``` --
This is $O(|T|\times|P|)$. --- ## Knuth-Morris-Pratt’s (KMP) Algorithm Knuth-Morris-Pratt’s (KMP) algorithm for pattern matching in strings avoids repeated comparisons of the part of the pattern string for which we already have an answer. -- name: kmp It uses the observation that when a mismatch occurs, the pattern itself contains sufficient information to determine where the next match could begin. This allows to skip re-examination of previously matched characters. --- template: kmp Example of the idea: ``` 1 2 index: 01234567890123456789012 string: ABC ABCDAB ABCDABCDABDE pattern: ABCDABD ``` --- template: kmp Example of the idea: ``` 1 2 index: 01234567890123456789012 string: ABC ABCDAB ABCDABCDABDE pattern: ABCDABD 1110 ``` - The mismatch occurs at D at index 3 in the pattern. - Since the pattern does not contain an A in the range of indexes [1..2] and all the characters before index 3 matched, we know that there is no point in trying to match the patters before index 3. --- template: kmp Example of the idea: ``` 1 2 index: 01234567890123456789012 string: ABC ABCDAB ABCDABCDABDE pattern: ABCDABD ``` - We restart the search at index 3 (avoiding some comparisons). -- - There is no match at index 3, so we move on to index 4. --- template: kmp Example of the idea: ``` 1 2 index: 01234567890123456789012 string: ABC ABCDAB ABCDABCDABDE pattern: ABCDABD ``` --- template: kmp Example of the idea: ``` 1 2 index: 01234567890123456789012 string: ABC ABCDAB ABCDABCDABDE pattern: ABCDABD 1111110 ``` - From index 4 to 9 all characters match. - The mismatch occurs at D at index 6 in the pattern and 10 in the target string. -- - Before mismatch occurred we matched `AB` at indexes [4..6] in the pattern and [8..9] in the target string. - This is a prefix of the pattern, so we can shift the pattern to align index 0 of the pattern with index 8 of the target string, and restart matching at index 10 (since we already know that indexes [8..9] match). --- template: kmp Example of the idea: ``` 1 2 index: 01234567890123456789012 string: ABC ABCDAB ABCDABCDABDE pattern: ABCDABD 110 ``` --- template: kmp Example of the idea: ``` 1 2 index: 01234567890123456789012 string: ABC ABCDAB ABCDABCDABDE pattern: ABCDABD 0 ``` --- template: kmp Example of the idea: ``` 1 2 index: 01234567890123456789012 string: ABC ABCDAB ABCDABCDABDE pattern: ABCDABD 1111110 ``` -- - Again, mismatch at D at index 6 in the pattern and 17 in the target string. - Similarly to before, we shift the pattern to align index 0 of the pattern with index 15 of the target string, and start comparing at index 17. --- template: kmp Example of the idea: ``` 1 2 index: 01234567890123456789012 string: ABC ABCDAB ABCDABCDABDE pattern: ABCDABD 1111111 match found ``` -- - This gives us the match. --
[Visualization of KMP](http://www.whocouldthat.be/visualizing-string-matching/) [another visualization of KMP](https://people.ok.ubc.ca/ylucet/DS/KnuthMorrisPratt.html) --- ## KMP ### How do we realign the pattern? - For each position `i` in the pattern string `P`, we want to know the length of the longest substring that ends at index `i-1` that matches the prefix of the pattern itself (and is not the prefix itself). -- name: fail Example 1 ``` index 0 1 2 3 4 5 6 P A B C D A B D length of substring that matches prefix -1 0 0 0 0 1 2 ``` ------ -- Example 2 ``` index 0 1 2 3 4 5 6 7 8 P A B C A B A B C B length of substring that matches prefix ``` --- template: fail Example 2 ``` index 0 1 2 3 4 5 6 7 8 P A B C A B A B C B length of substring that matches prefix -1 0 0 0 1 2 1 2 3 ``` --- template: fail How do we compute this array? -- name: fail-code .left-column2[ `fail` is an array with length equal to |P| ``` i = 0 //position to be calculated j = -1 //length of the match fail[0] = -1 while i < |P| while j >= 0 and P[i] != P[j] j = fail[j] i++ j++ fail[i] = j ``` ] --- template: fail-code before loop is entered: ``` i = 0 j = -1 fail = [-1, -, -, -, -, -, - ] ``` --- template: fail-code after first iteration of the outer loop ``` i = 1 j = 0 fail = [-1, 0, -, -, -, -, - ] ``` --- template: fail-code during and after second iteration of the outer loop ``` i = 1 j = -1 fail = [-1, 0, -, -, -, -, - ] i = 2 j = 0 fail = [-1, 0, 0, -, -, -, - ] ``` --- template: fail-code during and after third iteration of the outer loop ``` i = 2 j = -1 fail = [-1, 0, 0, -, -, -, - ] i = 3 j = 0 fail = [-1, 0, 0, 0, -, -, - ] ``` --- template: fail-code during and after fourth iteration of the outer loop ``` i = 3 j = -1 fail = [-1, 0, 0, 0, -, -, - ] i = 4 j = 0 fail = [-1, 0, 0, 0, 0, -, - ] ``` --- template: fail-code during and after fifth iteration of the outer loop ``` i = 4 j = 0 //skip inner while, P[4] == P[0] fail = [-1, 0, 0, 0, 0, -, - ] i = 5 j = 1 fail = [-1, 0, 0, 0, 0, 1, - ] ``` --- template: fail-code during and after sixth iteration of the outer loop ``` i = 5 j = 1 //skip inner while, P[5] == P[1] fail = [-1, 0, 0, 0, 0, 1, - ] i = 6 j = 2 fail = [-1, 0, 0, 0, 0, 1, 2 ] ``` --- ## KMP in problem solving - Modify the naive pattern matching algorithm to use the `fail` array. - Modify it further to - find all occurrences of `P` in `T` - count the number of occurrences of `P` in `T` - find the longest prefix of `P` that occurs in `T` (if `P` occurs in `T` then the answer is `|P|`) --- ## Z-Array / Z-Algorithm __Z-array__ -- The Z-array for a string `s` of length `n` is an array of length `n` in which `Z[i]` stores the length of the longest substring starting at `s[i]` that is also a prefix of `s`. -- .left-column2[ Example 1 `s = "ABCABCABAB"` ``` index 0 1 2 3 4 5 6 7 8 9 s A B C A B C A B A B Z - 0 0 5 0 0 2 0 2 0 ``` ] .right-column2[ Example 2 `s = "aaaaaa"` ``` index 0 1 2 3 4 5 s a a a a a a Z - 5 4 3 2 1 ``` ] --
__Z-Algorithm__ Z-Algorithm is an algorithm that computes the Z-Array for a given string `s`. - naive implementation takes O(N^2) - efficient implementation is O(N) --- template: challenge ## Challenge: Pattern Matching Find all locations of a pattern string `p` in a given string `s`. -- __Solution__ - create a new string `p#s` in which `#` is a special character that does not occur in neither `p` nor `s` - create the z-array for the new string - the locations in the z-array for which the value is equal to the length of the pattern string `p` are the location of the pattern in `s` (adjust indexes by subtracting the `length(p)+1`) -- [Visualization of the solution](https://algorithm-visualizer.org/dynamic-programming/z-string-search) --- template: challenge ## Challenge: Finding Borders A __border__ in a string is a substring that is both a prefix and a sufix of that string (but not the entire string, i.e., proper prefix and proper suffix). Example: `s = ABACABACABA` the borders are ``` - A ABACABACABA - --- ABA ABACABACABA --- ------- ABACABA ABACABACABA ------- ``` --- ## Challenge: Finding Borders __Solution__ - create the z-array for `s` - boarders are all suffixes `s[k..n-1]`, such that `k+z[k]=n` ``` index 0 1 2 3 4 5 6 7 8 9 10 s A B A C A B A C A B A Z - ? ? ? ? ? ? ? ? ? ? ``` -- ``` index 0 1 2 3 4 5 6 7 8 9 10 s A B A C A B A C A B A Z - 0 1 0 7 0 1 0 3 0 1 | | | ``` --- ## Another Pattern Matching: Rabin-Karp Algorithm - Let `hash(s)` be a function that maps any string to an integer ``` match ( P, T ) //P = pattern, T = target hp = hash(P) for i = 0 .. |T| - |P| ht = hash(T[i..i+|P|-1]) if hp == ht match found break ``` -- - What is the performance of this? -- __$O(|T|\times|P|)$__ -- - Can we do better? can we make the call to `hash()` in the loop $O(1)$? --- ## Polynomial Hash or Rolling Hash - Consider a string s = c
0
c
1
c
2
...c
n-1
-- - Assign unique values to all characters in the alphabet. For example, if the alphabet is $\Sigma = {a, b, c, ..., z}$ (all lowercase letters), then such an assignment could be `'a'= 1`, `'b'= 2`, ..., `'z'= 26`. -- - Pick a constant `A` and a large, prime number `M` -- - Compute the `hash()` function as follows `hash(s)` = (c
0
A
0
+ c
1
A
1
+ ... + c
n-1
A
n-1
) % M = $\Sigma$
i=0
N-1
c
i
A
i
% M -- Example - A = 2, M = 10
9
+7 (large enough that it does not really do anything in the computations below) - `hash(abc)` = $1\times 2^0 + 2\times 2^1 + 3\times 2^2 ) % M$ = 17 -- - `hash(abcc)` = $1\times 2^0 + 2\times 2^1 + 3\times 2^2 + 4\times 2^3) % M$ = 59 -- (or $17 + 4\times 2^3$) -- - `hash(xxy)` = $(24\times2^0 + 24\times2^1 + 25\times2^2) % M$ = 172 -- - `hash(abcxxy)` = $(1\times 2^0 + 2\times 2^1 + 3\times 2^2 + 24\times2^3 + 24\times2^4 + 25\times2^5) % M$ = 1393 -- (or $17 + 172\times 2^3$) -- - `hash(bcxxy)` = $(2\times 2^0 + 3\times 2^1 + 24\times2^2 + 24\times2^3 + 25\times2^4) % M$ = 696 -- (or $1393 / 2$ - integer division) ---