Python pandas-dedupe包_程序模块 - PyPI

重复数据消除库使熊猫轻松使用。

pandas-dedupe的Python项目详细描述

熊猫重复数据消除

重复数据消除库使熊猫轻松使用。

安装

pip install pandas重复数据消除

视频教程

Basic Deduplication

基本用法

重复数据消除

import pandas as pd
import pandas_dedupe

#load dataframe
df = pd.read_csv('test_names.csv')

#initiate deduplication
df_final = pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'middle_initial'])

#send output to csv
df_final.to_csv('deduplication_output.csv')


#------------------------------additional details------------------------------

#A training file and a settings file will be created while running Dedupe. 
#Keeping these files will eliminate the need to retrain your model in the future. 
#If you would like to retrain your model, just delete the settings and training files.

匹配/记录链接

import pandas as pd
import pandas_dedupe

#load dataframes
dfa = pd.read_csv('file_a.csv')
dfb = pd.read_csv('file_b.csv')

#initiate matching
df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4'])

#send output to csv
df_final.to_csv('linkage_output.csv')


#------------------------------additional details------------------------------

#Use identical field names when linking dataframes.

#Record linkage should only be used on dataframes that have been deduplicated.

#A training file and a settings file will be created while running Dedupe. 
#Keeping these files will eliminate the need to retrain your model in the future. 
#If you would like to retrain your model, just delete the settings and training files.

高级用法

规范化字段

pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'payment_type'], canonicalize=True)

#------------------------------additional details------------------------------

#Creates a standardized version of every element by field & cluster id for instance,
#if you had the field "first_name", and the first cluster id had 3 items, "John",
#"John", and "Johnny", the canonicalized version would have "John" listed for all
#three in a new field called "first_name - canonical"

#If you prefer only canonicalize a few of your fields, you can set the parameter
#as a list of fields you want a canonical version for. In my example above, you
#could have written canonicalize=['first_name', 'last_name'], and you would get
#a canonical version for first_name, and last_name, but not for payment_type.

指定类型

# Price Example
pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', ('salary', 'Price')])

# has missing Example
pandas_dedupe.link_dataframes(df,['SSN', ('bio_pgraph', 'Text'), ('salary', 'Price', 'has missing')])

# crf Example
pandas_dedupe.dedupe_dataframe(df,[('first_name', 'String', 'crf'), 'last_name', (m_initial, 'Exact')])


#------------------------------additional details------------------------------

#If a type is not explicity listed, String will be used.

#Tuple (parenthesis) is required to declare all other types. If you prefer use tuple
#for string also, ('first_name', 'String'), that's fine.

#If you want to specify either a 'crf' or 'has missing' parameter, a tuple with three elements
#must be used. ('first_name', 'String', 'crf') works, ('first_name', 'crf') does not work.

类型

重复数据消除支持多种数据类型；可以找到包含文档的完整列表here.

熊猫重复数据消除正式支持以下数据类型：

string-使用字符串距离度量的标准字符串比较。这是默认类型。
text-句子或文本段落的比较。使用余弦相似度量。
price-用于比较正的、非零的数值。
datetime-用于比较日期。
latlong-（39.990334，70.012）与使用字符串距离的（40.01，69.98）不匹配度量，即使这些点位于地理上相似的位置。latlong类型解析这是通过计算比较坐标之间的哈弗斯距离。拉特朗需要格式字段（LAT，液化天然气）。值可以是一个字符串，一个包含两个字符串、包含两个浮点数的元组或包含两个整数的元组。如果格式无法处理，将得到回溯。
exact-测试字段是否完全匹配。

{STR 1 } $存在\强/ -有时，数据的存在或不存在可用于预测匹配。是否存在一个、一个或两个字段都是空的存在类型测试。

其他支持的参数是：

缺少-如果其中一个数据字段包含空值，则可以使用
crf-使用条件随机字段进行比较，而不是使用距离度量。可能更多在某些情况下是准确的，但运行速度要慢得多。适用于字符串和短字符串类型。

学分

非常感谢DataMade的人们公开了Dedupe library。对重复数据消除库的无代码实现感兴趣的人可以在这里找到一个链接：Dedupe.io。

欢迎加入QQ群-->： 979659372

pandas-dedupe 0.42

pandas-dedupe的Python项目详细描述

熊猫重复数据消除

安装

视频教程

基本用法

重复数据消除

匹配/记录链接

高级用法

规范化字段

指定类型

类型

学分

推荐PyPI第三方库

isadoraazevedo-pypi

eidos

randomUtil

pykeeb

phitools

django-blowdb

nesterOu

flagz

python-ilorest-librar

wsgiintercept

aus-senate-audit

td_dbf2csv

djangoenviron

django-autosequence

discrete-signals

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

pandas-dedupe 0.42

pandas-dedupe的Python项目详细描述

熊猫重复数据消除

安装

视频教程

基本用法

重复数据消除

匹配/记录链接

高级用法

规范化字段

指定类型

类型

学分

推荐PyPI第三方库

isadoraazevedo-pypi

eidos

randomUtil

pykeeb

phitools

django-blowdb

nesterOu

flagz

python-ilorest-librar

wsgiintercept

aus-senate-audit

td_dbf2csv

djangoenviron

django-autosequence

discrete-signals

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签