The Fuzzy String Matching

Posted By :Pradeep Farthyal |8th December 2020

Sometimes we might be faced with tabular data that has at least one text-based column. Whether they are name, address, etc. It always needs to be cleaned as they are often filled by people and, therefore, higher risk of errors.

This is where Fuzzy String Matching comes. It is a collection of methods that are used to find the most suitable match between two sets of strings.
 

We use PolyFuzz. It allows string grouping, supporting, and checks extensive evaluation functions. PolyFuzz is used to bring fuzzy string matching techniques mutually within a single framework. Currently, some techniques include a variety of edit distance measures, character-based n-gram TF-IDF, word embedding methods like FastText and GloVe, and transformers embeddings.

The outlook of PolyFuzz is Easy to use yet highly customizable. It is a string matcher tool that requires only a few lines of code but that provides you customize and create models.

 

Install PolyFuzz (install the base dependencies)

  • pip install polyfuzz

Use Transformers, install Flair dependency

  • pip install polyfuzz[flair]

Install all the additional dependencies

  • pip install polyfuzz[all]

 

First, we need to create two lists of strings, one to map from and one to map to. Here, we use the following two small lists

 

 

 

Now, compare the similarity of strings by using Levenshtein edit distance. It is a technique for comparing strings, calculate the number of changes from one string to another.

The finishing matches obtained by model.get_matches()

The similarity column indicates how strings are similar to each other, and the score between 0 and 1 is more manageable to evaluate the results.

 

Group Matches

 

Here we calculate the similarity within strings in to_list and use a single linkage to group the strings with a high similarity

Here we see an additional column Group in which all the To matches were grouped to:


 

 

Precision-Recall Curve

 

Here We show the result as precision and recall, Precision defined minimum similarity score before a match is correct and recall the percentage of matches found at a certain minimum similarity score.

Creating visualizations use model.visualize_precision_recall()

 

 

Models

 

PolyFuzz has the following models

  1. RapidFuzz
  2. EditDistance (use any distance measure)
  3. TF-IDF
  4. FastText and GloVe
  5. Transformers

For more details click here.

 


About Author

Pradeep Farthyal

He is a hard-working, quick learner, and a result-oriented employee who is always ready to learn new things.

Request For Proposal

Sending message..

Ready to innovate ? Let's get in touch

Chat With Us