Sometimes we might be faced with tabular data that has at least one text-based column. Whether they are name, address, etc. It always needs to be cleaned as they are often filled by people and, therefore, higher risk of errors.
This is where Fuzzy String Matching comes. It is a collection of methods that are used to find the most suitable match between two sets of strings.
We use PolyFuzz. It allows string grouping, supporting, and checks extensive evaluation functions. PolyFuzz is used to bring fuzzy string matching techniques mutually within a single framework. Currently, some techniques include a variety of edit distance measures, character-based n-gram TF-IDF, word embedding methods like FastText and GloVe, and transformers embeddings.
The outlook of PolyFuzz is Easy to use yet highly customizable. It is a string matcher tool that requires only a few lines of code but that provides you customize and create models.
Install PolyFuzz (install the base dependencies)
Use Transformers, install Flair dependency
Install all the additional dependencies
First, we need to create two lists of strings, one to map from and one to map to. Here, we use the following two small lists
Now, compare the similarity of strings by using Levenshtein edit distance. It is a technique for comparing strings, calculate the number of changes from one string to another.
The finishing matches obtained by model.get_matches()
The similarity column indicates how strings are similar to each other, and the score between 0 and 1 is more manageable to evaluate the results.
Here we calculate the similarity within strings in to_list and use a single linkage to group the strings with a high similarity
Here we see an additional column Group in which all the To matches were grouped to:
Here We show the result as precision and recall, Precision defined minimum similarity score before a match is correct and recall the percentage of matches found at a certain minimum similarity score.
Creating visualizations use model.visualize_precision_recall()
PolyFuzz has the following models
For more details click here.