String Algorithms

## CSCI-UA 480: APS
## Algorithmic Problem Solving
 
## Some String Algorithms

.license[
Copyright 2020 Joanna Klukowska. Unless noted otherwise all content is released under a 
[Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/). 
Background image by Stewart Weiss ]

---
layout:true
template: default
name: section
class: inverse, middle, center

---
layout:true
template: default
name: challenge
class: challenge

---
layout:true
template: default
name: poll
class: inverse, full-height, center, middle

---
layout:true
template: default
name: breakout
class: breakout

---

layout:true
template:default
name:slide
class: slide

---
## String Definitions

a __string__ `s` of length $n$ consists of characters `s[0]`, `s[1]`, ..., `s[n-1]`

a __substring__ `s[a..b]` is a sequence of consecutive characters in a string
that starts at position `a` and ends at position `b` (inclusive on both ends)

a __prefix__ is a substring for which `a=0`

a __suffix__ is a substring for which `b=n-1`

a __subsequence__ is any sequence of characters in a string in their original
order (not necesserily consecutive)

---
## Longest Common Subsequence

The __longest common subsequence__ (LCS) of two strings is the longest string
that appears as a subsequence in both strings.

Examples:

- "floor" and "donor", the lcs is "oor"
- "caged" and "range", the lcs is "age"
- "capsule" and "recaps", the lcs is "caps"

__Solution__

given: two strings `x` and `y`

`lcs(i,j)` - returns length of the longest common
subsequence of the prefixes `x[0..i]` and `y[0..j]`

```
      lcs(i,j) =

lcs(i-1,j-1)+1,      when  x[i]=y[j]       //there as a match

max(lcs(i-1,j), lcs(i,j-1) ),  otherwise   //no match, use the previous LCS

lcs(-1, j) = lcs(i, -1) = 0                   //base case
```

[Visualization of the algorithm](https://www.cs.usfca.edu/~galles/visualization/DPLCS.html)
(recursive complete search, top-down DP, bottom-up DP)

---

## Edit Distance

The __edit distance__ between two strings is defined as the minimum number
of editing operations that transform one string into the other.

The allowed operations may vary, but are often
- insert a character, "ABC" -> "ABCA"
- remove a character, "ABC" -> "AC"
- replace a character, "ABC" -> "ADC"
  (this one can be thought of as two separate operations of a remove followed by an insert,
or an insert followed by a remove)

__Solution__

given: two strings `x` and `y`

`edit(i,j)` - returns the edit distance between
the prefixes `x[0..i]` and `y[0..j]`

```
      edit(i,j) =  min (
              edit(i, j-1) + 1,
              edit(i-1, j) + 1,
              edit(i-1, j-1) + cost(i,j)
            )
```

---

## Edit Distance

The __edit distance__ between two strings is defined as the minimum number
of editing operations that transform one string into the other.

__Solution__

given: two strings `x` and `y`

`edit(i,j)` - returns the edit distance between
the prefixes `x[0..i]` and `y[0..j]`

```
      edit(i,j) =  min (
              edit(i, j-1) + 1,           //insert character at the end of x
              edit(i-1, j) + 1,           //remove character from the end of x
              edit(i-1, j-1) + cost(i,j)  //replace the last char in x with the one from y
            )
```

where `cost(i,j) = 0` when `x[i]=y[j]` and `cost(i,j) = 1`, otherwise.

_What should the base case be?_

```
      edit(-1,j) = edit(i, -1) = +INF
```
---

##  Pattern Matching

Given a pattern string `P` and a target string `T`, determine if and where the pattern
`P` occurs in `T`.

--

Variations of the problem
- find all occurrences of `P` in `T`
- count the number of occurrences of `P` in `T`
- find the longest prefix of `P` that occurs in `T`

---
name: naive
## Naive Approach

Compare the pattern string to the target string starting at each position of the target string:

---
template: naive

```
                    1         2
index:    01234567890123456789012
string:   ABC ABCDAB ABCDABCDABDE
pattern:  ABCDABD                   // mismatch at D
```

---
template: naive

```
                    1         2
index:    01234567890123456789012
string:   ABC ABCDAB ABCDABCDABDE
pattern:   ABCDABD                  // mismatch at A
```

---
template: naive

```
                    1         2
index:    01234567890123456789012
string:   ABC ABCDAB ABCDABCDABDE
pattern:    ABCDABD                 // mismatch at A
```

---
template: naive

```
                    1         2
index:    01234567890123456789012
string:   ABC ABCDAB ABCDABCDABDE
pattern:     ABCDABD                // mismatch at A
```

---
template: naive

```
                    1         2
index:    01234567890123456789012
string:   ABC ABCDAB ABCDABCDABDE
pattern:      ABCDABD              // mismatch at second D
```

---
template: naive

```
                    1         2
index:    01234567890123456789012
string:   ABC ABCDAB ABCDABCDABDE
pattern:       ABCDABD              // mismatch at A
```

---
template: naive

```
                    1         2
index:    01234567890123456789012
string:   ABC ABCDAB ABCDABCDABDE
pattern:        ABCDABD             // mismatch at A
```

---
template: naive

```
                    1         2
index:    01234567890123456789012
string:   ABC ABCDAB ABCDABCDABDE
pattern:         ABCDABD            // mismatch at A
```

---
template: naive

```
                    1         2
index:    01234567890123456789012
string:   ABC ABCDAB ABCDABCDABDE
pattern:          ABCDABD           // mismatch at C
```

and so on ...

```
for i in 0..|T|
  for j in 0..|P|
    if T[i+j] != P[j]
      mismatch found
      break from the inner loop
    if j == |P|-1   && T[i+j] == P[j]
      found the match!
```

This is $O(|T|\times|P|)$.

---
## Knuth-Morris-Pratt’s (KMP) Algorithm

Knuth-Morris-Pratt’s (KMP) algorithm for pattern matching in strings avoids
repeated comparisons of the part of the pattern string
for which we already have an answer.

--
name: kmp

It uses the observation that when a mismatch occurs,
the pattern itself contains sufficient information to
determine where the next match could begin. This allows to
skip re-examination of previously matched characters.

---
template: kmp

Example of the idea:

```
                    1         2
index:    01234567890123456789012
string:   ABC ABCDAB ABCDABCDABDE
pattern:  ABCDABD
```

---
template: kmp

Example of the idea:

```
                    1         2
index:    01234567890123456789012
string:   ABC ABCDAB ABCDABCDABDE
pattern:  ABCDABD
          1110
```

- The mismatch occurs at D at index 3 in the pattern.

- Since the pattern does not contain
an A in the range of indexes [1..2] and all the characters before index 3 matched,
we know that there is no point in trying to match the patters before index 3.

---
template: kmp

Example of the idea:

```
                    1         2
index:    01234567890123456789012
string:   ABC ABCDAB ABCDABCDABDE
pattern:     ABCDABD

```

- We restart the search at index 3 (avoiding some comparisons).

- There is no match at index 3, so we move on to index 4.

---
template: kmp

Example of the idea:

```
                    1         2
index:    01234567890123456789012
string:   ABC ABCDAB ABCDABCDABDE
pattern:      ABCDABD

```

---
template: kmp

Example of the idea:

```
                    1         2
index:    01234567890123456789012
string:   ABC ABCDAB ABCDABCDABDE
pattern:      ABCDABD
              1111110
```

- From index 4 to 9 all characters match.

- The mismatch occurs at D at index 6 in the pattern and 10 in the target string.

- Before mismatch occurred we matched `AB` at indexes [4..6] in the pattern and [8..9]
in the target string.

- This is a prefix of the pattern, so we can shift the pattern to align index 0 of the pattern
with index 8 of the target string, and restart matching at index 10 (since we already know
that indexes [8..9] match).

---
template: kmp

Example of the idea:

```
                    1         2
index:    01234567890123456789012
string:   ABC ABCDAB ABCDABCDABDE
pattern:          ABCDABD
                  110
```
---
template: kmp

Example of the idea:

```
                    1         2
index:    01234567890123456789012
string:   ABC ABCDAB ABCDABCDABDE
pattern:            ABCDABD
                    0
```
---
template: kmp

Example of the idea:

```
                    1         2
index:    01234567890123456789012
string:   ABC ABCDAB ABCDABCDABDE
pattern:             ABCDABD
                     1111110
```

- Again, mismatch at D at index 6 in the pattern and 17 in the target string.

- Similarly to before, we shift the pattern to align index 0 of the pattern with
index 15 of the target string, and start comparing at index 17.

---
template: kmp

Example of the idea:

```
                    1         2
index:    01234567890123456789012
string:   ABC ABCDAB ABCDABCDABDE
pattern:                 ABCDABD
                         1111111   match found
```

- This gives us the match.

[Visualization of KMP](http://www.whocouldthat.be/visualizing-string-matching/)

[another visualization of KMP](https://people.ok.ubc.ca/ylucet/DS/KnuthMorrisPratt.html)

---

## KMP
### How do we realign the pattern?

- For each position `i` in the pattern string `P`, we want to know the length of the longest
substring that ends at index `i-1` that matches the prefix of the pattern itself (and is not the
prefix itself).

--
name: fail

Example 1

```
                                  index     0 1 2 3 4 5 6
                                      P     A B C D A B D
length of substring that matches prefix    -1 0 0 0 0 1 2

```
------

Example 2

```
                                  index     0 1 2 3 4 5 6 7 8
                                      P     A B C A B A B C B
length of substring that matches prefix

```

---

Example 2

```
                                  index     0 1 2 3 4 5 6 7 8
                                      P     A B C A B A B C B
length of substring that matches prefix    -1 0 0 0 1 2 1 2 3

```

---

How do we compute this array?

--
name: fail-code
.left-column2[

`fail` is an array with length equal to |P|

```
i = 0       //position to be calculated
j = -1      //length of the match
fail[0] = -1

while i < |P|
 while j >= 0 and P[i] != P[j]
 j = fail[j]
 i++
 j++
 fail[i] = j

```
]
---
template: fail-code

before loop is entered:

```
i = 0
j = -1

fail = [-1, -, -, -, -, -, - ]
```
---
template: fail-code

after first iteration of the outer loop

```
i = 1
j = 0

fail = [-1, 0, -, -, -, -, - ]
```
---
template: fail-code

during and after second iteration of the outer loop

```
i = 1
j = -1

fail = [-1, 0, -, -, -, -, - ]

i = 2
j = 0

fail = [-1, 0, 0, -, -, -, - ]

```
---
template: fail-code

during and after third iteration of the outer loop

```
i = 2
j = -1

fail = [-1, 0, 0, -, -, -, - ]

i = 3
j = 0

fail = [-1, 0, 0, 0, -, -, - ]

```
---
template: fail-code

during and after fourth iteration of the outer loop

```
i = 3
j = -1

fail = [-1, 0, 0, 0, -, -, - ]

i = 4
j = 0

fail = [-1, 0, 0, 0, 0, -, - ]

```
---
template: fail-code

during and after fifth iteration of the outer loop

```
i = 4
j = 0     //skip inner while, P[4] == P[0]

fail = [-1, 0, 0, 0, 0, -, - ]

i = 5
j = 1

fail = [-1, 0, 0, 0, 0, 1, - ]

```
---
template: fail-code

during and after sixth iteration of the outer loop

```
i = 5
j = 1     //skip inner while, P[5] == P[1]

fail = [-1, 0, 0, 0, 0, 1, - ]

i = 6
j = 2

fail = [-1, 0, 0, 0, 0, 1, 2 ]

```

---

## KMP in problem solving

- Modify the naive pattern matching algorithm to use the `fail` array.

- Modify it further to
  - find all occurrences of `P` in `T`
  - count the number of occurrences of `P` in `T`
  - find the longest prefix of `P` that occurs in `T` (if `P` occurs in `T`
    then the answer is `|P|`)

---
## Z-Array / Z-Algorithm

__Z-array__

The Z-array for a string `s` of length `n` is an array of length `n`
in which `Z[i]` stores the length of the longest substring starting
at `s[i]` that is also a prefix of `s`.

--
.left-column2[

Example 1

`s = "ABCABCABAB"`

```
index   0  1  2  3  4  5  6  7  8  9
  s     A  B  C  A  B  C  A  B  A  B
  Z     -  0  0  5  0  0  2  0  2  0

```
]
.right-column2[

Example 2

`s = "aaaaaa"`

```
index   0  1  2  3  4  5
  s     a  a  a  a  a  a
  Z     -  5  4  3  2  1

```
]

__Z-Algorithm__

Z-Algorithm is an algorithm that computes the Z-Array for a given string `s`.

- naive implementation takes O(N^2)
- efficient implementation is O(N)

<!--
Z-Algorithm - Take One

```
s - string for which we want to compute the z-array

z compute(s)
 n = length of s
 create z array of length n initialized to zero
 for i in 1...n
 while ( i + z[i] < n) && ( s[z[i]] == s[i + z[i]] )
 increment z[i] by 1
 return z

```

works, but $O(n^2)$

if a substring repeats, we should be able to reuse previously calculated information

Z-Algorithm - Take Two

- observer that for a character at index `i` of `s`, the value of `z[i]` implies
that substring starting at `i` and ending at `i+z[i]-1` _matches_ the prefix of `s`

- define a __segment__ to be a substring that matches the previx of `s`

- we'll keep track of indexes `[l,r]` that delimit the rightmost known segment (i.e., the one that ends
rightmost)

- if current index `i > r`, then the position is outside of the region that
      we already processed and we compute `z[i]` comparing values one by one (i.e., using the _take one_ algorithm approach)

- if current index `i<=r`, then the position is inside a segment that was already processed and
 we should be able to initialize `z[i]` based on what was computed previously

- note that substrings `s[l...r]` and `s[0...r-l]` match, so we can initialize `z[i]`
 to be equal to the corresponding value from the previously processed segment, i.e., `z[i-l]` 
 (this may not be its final value)

- however, the vlaue of `z[i-l]` might be too large (it could exceed `r`, `i + z[i-l] > r` and
          we do not know anything about the string past the character at index `r`)

- so the initial estimate for `z[i]` can be made as `z0[i]= min(r-i+1, z[i-1])`

Z-Algorithm - Take Two

```
z compute( s )
 n = length of s
 create z array of length n initialized to zero
 l = 0, r = 0
 for i = 1...n
 if (i <= r)
 z[i] = min (r - i + 1, z[i - l]) // estimate z[i]
 while i + z[i] < n && s[z[i]] == s[i + z[i]])
 increment z[i]
 if i + z[i] - 1 > r
 l = i
 r = i + z[i] - 1
 return z
```

this is $O(n)$, for a proof see, for example, https://cp-algorithms.com/string/z-function.html

Visualization:
[http://www.utdallas.edu/~besp/demo/John2010/z-algorithm.htm](http://www.utdallas.edu/~besp/demo/John2010/z-algorithm.htm)

-->

---

## Challenge: Pattern Matching

Find all locations of a pattern string `p` in a given string `s`.

__Solution__

- create a new string `p#s` in which `#` is a special character that does not
occur in neither `p` nor `s`
- create the z-array for the new string
- the locations in the z-array for which the value is equal to the length
of the pattern string `p` are the location of the pattern in `s` (adjust indexes
by subtracting the `length(p)+1`)

[Visualization of the solution](https://algorithm-visualizer.org/dynamic-programming/z-string-search)

---

## Challenge: Finding Borders

A __border__ in a string is a substring that is both a prefix and a sufix of that
string (but not the entire string, i.e., proper prefix and proper suffix).

Example:

`s = ABACABACABA`

the borders are

```
          -
A         ABACABACABA
                    -

---
ABA       ABACABACABA
                  ---

-------
ABACABA   ABACABACABA
              -------

```

---

## Challenge: Finding Borders

__Solution__

- create the z-array for `s`
- boarders are all suffixes `s[k..n-1]`, such that `k+z[k]=n`

```
index   0  1  2  3  4  5  6  7  8  9 10
  s     A  B  A  C  A  B  A  C  A  B  A
  Z     -  ?  ?  ?  ?  ?  ?  ?  ?  ?  ?

```

```
index   0  1  2  3  4  5  6  7  8  9 10
  s     A  B  A  C  A  B  A  C  A  B  A
  Z     -  0  1  0  7  0  1  0  3  0  1

|           |     |

```

---

## Another Pattern Matching: Rabin-Karp Algorithm

- Let `hash(s)` be a function that maps any string to an integer

```
        match ( P, T )  //P = pattern, T = target
          hp = hash(P)

for i = 0 .. |T| - |P|
            ht = hash(T[i..i+|P|-1])
            if hp == ht
              match found
              break
```

- What is the performance of this?
--
    __$O(|T|\times|P|)$__

- Can we do better? can we make the call to `hash()` in the loop $O(1)$?

---

## Polynomial Hash or Rolling Hash

- Consider a string s = c0c1c2...cn-1

- Assign unique values to all characters in the alphabet. For example, if the
alphabet is $\Sigma = {a, b, c, ..., z}$ (all lowercase letters), then such
an assignment could be `'a'= 1`, `'b'= 2`, ..., `'z'= 26`.

- Pick a constant `A` and a large, prime number `M`

- Compute the `hash()` function as follows

`hash(s)` = (c0A0 + c1A1 + ... + cn-1An-1 ) % M = $\Sigma$i=0N-1ciAi % M

Example

- A = 2, M = 109+7 (large enough that it does not really do anything in the computations below)

- `hash(abc)` = $1\times 2^0 + 2\times 2^1 + 3\times 2^2 ) % M$ = 17

- `hash(abcc)` = $1\times 2^0 + 2\times 2^1 + 3\times 2^2 + 4\times 2^3) % M$ = 59
--
  (or $17 + 4\times 2^3$)

- `hash(xxy)` = $(24\times2^0 + 24\times2^1 + 25\times2^2) % M$ = 172

- `hash(abcxxy)` = $(1\times 2^0 + 2\times 2^1 + 3\times 2^2 + 24\times2^3 + 24\times2^4 + 25\times2^5) % M$ = 1393
--
  (or $17 + 172\times 2^3$)

- `hash(bcxxy)` = $(2\times 2^0 + 3\times 2^1 + 24\times2^2 + 24\times2^3 + 25\times2^4) % M$ = 696
--
  (or $1393 / 2$ - integer division)

---

</optgroup>