This Julia package computes various distances between AbstractStrings

The package is registered in the `General`

registry and so can be installed at the REPL with `] add StringDistances`

.

The function `compare`

returns a similarity score between two strings. The function always returns a score between 0 and 1, with a value of 0 being completely different and a value of 1 being completely similar. Its syntax is:

```
compare(s1::AbstractString, s2::AbstractString, dist::StringDistance)
```

- Edit Distances
- Jaro Distance
`Jaro()`

- Levenshtein Distance
`Levenshtein()`

- Damerau-Levenshtein Distance
`DamerauLevenshtein()`

- RatcliffObershelp Distance
`RatcliffObershelp()`

- Jaro Distance
Q-gram distances compare the set of all substrings of length

`q`

in each string.- QGram Distance
`Qgram(q::Int)`

- Cosine Distance
`Cosine(q::Int)`

- Jaccard Distance
`Jaccard(q::Int)`

- Overlap Distance
`Overlap(q::Int)`

- Sorensen-Dice Distance
`SorensenDice(q::Int)`

- QGram Distance
The package includes distance "modifiers", that can be applied to any distance.

- Winkler boosts the similary score of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
- Partial returns the maximal similarity score between the shorter string and substrings of the longer string.
- TokenSort adjusts for differences in word orders by reording words alphabetically.
- TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
- TokenMax combines scores using the base distance, the
`Partial`

,`TokenSort`

and`TokenSet`

modifiers, with penalty terms depending on string lengths.

Some examples:

```
compare("martha", "marhta", Jaro())
compare("martha", "marhta", Winkler(Jaro()))
compare("martha", "marhta", QGram(2))
compare("martha", "marhta", Winkler(QGram(2)))
compare("martha", "marhta", Levenshtein())
compare("martha", "marhta", Partial(Levenshtein()))
compare("martha", "marhta", Jaro())
compare("martha", "marhta", TokenSet(Jaro()))
compare("martha", "marhta", TokenMax(RatcliffObershelp()))
```

A good distance to match strings composed of multiple words (like addresses) is `TokenMax(Levenshtein())`

(see fuzzywuzzy).

`findmax`

returns the value and index of the element in`itr`

with the highest similarity score with`s`

. Its syntax is:`findmax(s::AbstractString, itr, dist::StringDistance; min_score = 0.0)`

`findall`

returns the indices of all elements in`itr`

with a similarity score with`s`

higher than a minimum value (default to 0.8). Its syntax is:`findall(s::AbstractString, itr, dist::StringDistance; min_score = 0.8)`

The functions `findmax`

and `findall`

are particularly optimized for `Levenshtein`

and `DamerauLevenshtein`

distances (as well as their modifications via `Partial`

, `TokenSort`

, `TokenSet`

, or `TokenMax`

).

The function `compare`

returns a similarity score: a value of 0 means completely different and a value of 1 means completely similar. In contrast, the function `evaluate`

returns the litteral distance between two strings, with a value of 0 being completely similar. Some distances are between 0 and 1, while others are unbouded.

```
compare("New York", "New York", Levenshtein())
#> 1.0
evaluate(Levenshtein(), "New York", "New York")
#> 0
```

- The stringdist Package for Approximate String Matching Mark P.J. van der Loo
- fuzzywuzzy

10/22/2015

2 days ago

228 commits