thefuzz
thefuzz库是一个简单的基于 Levenshtein Distance 来进行字符串模糊匹配的 Python 库:
Python
from thefuzz import fuzz, process
# 返回的是置信概率 0-100
fuzz.ratio("this is a test", "this is a test!")
# 97
# 部分匹配, 只需要包含就是 100
fuzz.partial_ratio("this is a test", "this is a test!")
# 100
# 顺序不同只需要字符串相同就返回 100
fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") # -> 91
fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") # -> 100
上面都是一一匹配,如果需要在多条数据中找到匹配度最高的字符串可以:
Python
from thefuzz import process
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
process.extract("new york jets", choices, limit=2)
# [('New York Jets', 100), ('New York Giants', 78)]
process.extractOne("cowboys", choices)
# ("Dallas Cowboys", 90)
这是一个非常典型的应用,例如拼写检查、匹配文件路径等等都是通过 process 来实现的。