Revised: September 18, 2006
Published: September 29, 2006
Abstract: [Plain Text Version]
Edit distance is a fundamental measure of distance between strings, the extensive study of which has recently focused on computational problems such as nearest neighbor search, sketching and fast approximation. A very powerful paradigm is to map the metric space induced by the edit distance into a normed space (e.g., $\ell_1$) with small distortion, and then use the rich algorithmic toolkit known for normed spaces. Although the minimum distortion required to embed edit distance into $\ell_1$ has received a lot of attention lately, there is a large gap between known upper and lower bounds. We make progress on this question by considering large, well-structured submetrics of the edit distance metric space.
Our main technical result is that the Ulam metric, namely, the edit distance on permutations of length at most $n$, embeds into $\ell_1$ with distortion $O(\log n)$. This immediately leads to sketching algorithms with constant size sketches, and to efficient approximate nearest neighbor search algorithms, with approximation factor $O(\log n)$. The embedding and its algorithmic consequences present a big improvement over those previously known for the Ulam metric, and they are significantly better than the state of the art for edit distance in general. Further, we extend these results for the Ulam metric to edit distance on strings that are (locally) non-repetitive, i.e., strings where (close by) substrings are distinct.