重复数据消除库使熊猫轻松使用。
pandas-dedupe的Python项目详细描述
熊猫重复数据消除
重复数据消除库使熊猫轻松使用。
安装
pip install pandas重复数据消除
视频教程
基本用法
重复数据消除
import pandas as pd
import pandas_dedupe
#load dataframe
df = pd.read_csv('test_names.csv')
#initiate deduplication
df_final = pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'middle_initial'])
#send output to csv
df_final.to_csv('deduplication_output.csv')
#------------------------------additional details------------------------------
#A training file and a settings file will be created while running Dedupe.
#Keeping these files will eliminate the need to retrain your model in the future.
#If you would like to retrain your model, just delete the settings and training files.
匹配/记录链接
import pandas as pd
import pandas_dedupe
#load dataframes
dfa = pd.read_csv('file_a.csv')
dfb = pd.read_csv('file_b.csv')
#initiate matching
df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4'])
#send output to csv
df_final.to_csv('linkage_output.csv')
#------------------------------additional details------------------------------
#Use identical field names when linking dataframes.
#Record linkage should only be used on dataframes that have been deduplicated.
#A training file and a settings file will be created while running Dedupe.
#Keeping these files will eliminate the need to retrain your model in the future.
#If you would like to retrain your model, just delete the settings and training files.
高级用法
规范化字段
pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'payment_type'], canonicalize=True)
#------------------------------additional details------------------------------
#Creates a standardized version of every element by field & cluster id for instance,
#if you had the field "first_name", and the first cluster id had 3 items, "John",
#"John", and "Johnny", the canonicalized version would have "John" listed for all
#three in a new field called "first_name - canonical"
#If you prefer only canonicalize a few of your fields, you can set the parameter
#as a list of fields you want a canonical version for. In my example above, you
#could have written canonicalize=['first_name', 'last_name'], and you would get
#a canonical version for first_name, and last_name, but not for payment_type.
指定类型
# Price Example
pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', ('salary', 'Price')])
# has missing Example
pandas_dedupe.link_dataframes(df,['SSN', ('bio_pgraph', 'Text'), ('salary', 'Price', 'has missing')])
# crf Example
pandas_dedupe.dedupe_dataframe(df,[('first_name', 'String', 'crf'), 'last_name', (m_initial, 'Exact')])
#------------------------------additional details------------------------------
#If a type is not explicity listed, String will be used.
#Tuple (parenthesis) is required to declare all other types. If you prefer use tuple
#for string also, ('first_name', 'String'), that's fine.
#If you want to specify either a 'crf' or 'has missing' parameter, a tuple with three elements
#must be used. ('first_name', 'String', 'crf') works, ('first_name', 'crf') does not work.
类型
重复数据消除支持多种数据类型;可以找到包含文档的完整列表here.
熊猫重复数据消除正式支持以下数据类型:
- string-使用字符串距离度量的标准字符串比较。这是默认类型。
- text-句子或文本段落的比较。使用余弦相似度量。
- price-用于比较正的、非零的数值。
- datetime-用于比较日期。
- latlong-(39.990334,70.012)与使用字符串距离的(40.01,69.98)不匹配 度量,即使这些点位于地理上相似的位置。latlong类型解析 这是通过计算比较坐标之间的哈弗斯距离。拉特朗需要 格式字段(LAT,液化天然气)。值可以是一个字符串,一个包含两个 字符串、包含两个浮点数的元组或包含两个整数的元组。如果格式 无法处理,将得到回溯。
- exact-测试字段是否完全匹配。
其他支持的参数是:
- 缺少-如果其中一个数据字段包含空值,则可以使用
- crf-使用条件随机字段进行比较,而不是使用距离度量。可能更多 在某些情况下是准确的,但运行速度要慢得多。适用于字符串和短字符串类型。
学分
非常感谢DataMade的人们公开了Dedupe library。对重复数据消除库的无代码实现感兴趣的人可以在这里找到一个链接:Dedupe.io。