The package is registered in the `General`

registry and so can be installed at the REPL with `] add StringDistances`

.

Distances are defined over iterators that define `length`

(this includes `AbstractStrings`

, but also `GraphemeIterators`

or `AbstractVectors`

)

The available distances are:

- Edit Distances
- Hamming Distance
`Hamming()`

- Jaro and Jaro-Winkler Distance
`Jaro()`

`JaroWinkler()`

- Levenshtein Distance
`Levenshtein()`

- Damerau-Levenshtein Distance
`DamerauLevenshtein()`

- RatcliffObershelp Distance
`RatcliffObershelp()`

- Hamming Distance
- Q-gram distances compare the set of all substrings of length
`q`

in each string.- QGram Distance
`Qgram(q::Int)`

- Cosine Distance
`Cosine(q::Int)`

- Jaccard Distance
`Jaccard(q::Int)`

- Overlap Distance
`Overlap(q::Int)`

- Sorensen-Dice Distance
`SorensenDice(q::Int)`

- MorisitaOverlap Distance
`MorisitaOverlap(q::Int)`

- Normalized Multiset Distance
`NMD(q::Int)`

- QGram Distance

You can always compute a certain distance between two strings using the following syntax:

```
evaluate(dist, s1, s2)
dist(s1, s2)
```

For instance, with the `Levenshtein`

distance,

```
evaluate(Levenshtein(), "martha", "marhta")
Levenshtein()("martha", "marhta")
```

`pairwise`

returns the matrix of distance between two `AbstractVectors`

of AbstractStrings

```
pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])
```

The function `pairwise`

is particularly optimized for QGram-distances (each element is processed only once).

The function

`compare`

returns the similarity score, defined as 1 minus the normalized distance between two strings. It always returns a Float64. A value of 0.0 means completely different and a value of 1.0 means completely similar.`Levenshtein()("martha", "martha") #> 0.0 compare("martha", "martha", Levenshtein()) #> 1.0`

`findnearest`

returns the value and index of the element in`itr`

with the highest similarity score with`s`

. Its syntax is:`findnearest(s, itr, dist::StringDistance)`

`findall`

returns the indices of all elements in`itr`

with a similarity score with`s`

higher than a minimum value (default to 0.8). Its syntax is:`findall(s, itr, dist::StringDistance; min_score = 0.8)`

The functions `findnearest`

and `findall`

are particularly optimized for `Levenshtein`

, `DamerauLevenshtein`

distances (these distances stop early if the distance is higher than a certain threshold).

The package also defines Distance "modifiers" that can be applied to any distance.

- Partial returns the minimum of the distance between the shorter string and substrings of the longer string.
- TokenSort adjusts for differences in word orders by returning the distance of the two strings, after re-ordering words alphabetically.
- TokenSet adjusts for differences in word orders and word numbers by returning the distance between the intersection of two strings with each string.
- TokenMax normalizes the distance, and combine the
`Partial`

,`TokenSort`

and`TokenSet`

modifiers, with penalty terms depending on string lengths. This is a good distance to match strings composed of multiple words, like addresses.`TokenMax(Levenshtein())`

corresponds to the distance defined in fuzzywuzzy

- The stringdist Package for Approximate String Matching Mark P.J. van der Loo
- fuzzywuzzy

10/22/2015

3 days ago

345 commits