针对一行新的重复数据消除

2024-04-23 17:56:06 发布

您现在位置：Python中文网/ 问答频道 /正文

5317

网友

男 | 程序猿一只，喜欢编程写python代码。

我正在使用dedupe python library。在

对于example this，任何代码示例都可以。在

假设我有一个训练有素的deduper，并用它成功地对数据集进行了重复数据消除。在

现在我向数据集添加一个新行。在

我想检查这新行是否重复。在

在重复数据消除中有没有一种方法可以做到这一点（不需要对整个数据集进行重新分类）？在

更新 {3}我的建议是{3}我的建议是^我的代码：

import csv
import exampleIO
import dedupe

def canonicalImport(filename):
    preProcess = exampleIO.preProcess
    data_d = {}
    with open(filename) as f:
        reader = csv.DictReader(f)
        for (i, row) in enumerate(reader):
            clean_row = {k: preProcess(v) for (k, v) in
                         viewitems(row)}
            data_d[i] = clean_row
    return data_d, reader.fieldnames

raw_data = 'tests/datasets/restaurant-nophone-training.csv'

data_d, header = canonicalImport(raw_data)

training_pairs = dedupe.trainingDataDedupe(data_d, 'unique_id', 5000)

fields = [{'field': 'name', 'type': 'String'},
              {'field': 'name', 'type': 'Exact'},
              {'field': 'address', 'type': 'String'},
              {'field': 'cuisine', 'type': 'ShortString',
               'has missing': True},
              {'field': 'city', 'type': 'ShortString'}
              ]

deduper = dedupe.Gazetteer(fields, num_cores=5)
deduper.sample(data_d, 10000)
deduper.markPairs(training_pairs)
deduper.train(index_predicates=False)

alpha = deduper.threshold(data_d, 1)

data_d_test = {}
data_d_test[0] = data_d[0]
del data_d[0];

clustered_dupes = deduper.match(data_d, threshold=alpha)
clustered_dupes2 = deduper.match(data_d_test, threshold=alpha) <- exception here

Tags： csv 数据 test import alpha field data threshold

1条回答

网友
1楼 · 发布于 2024-04-23 17:56:06

您可以根据现有的^{}新建一行。在
但是，如果已经实现了重复数据消除数据集，则可以使用^{}添加更多唯一的数据，然后再次调用^{}。在

针对一行新的重复数据消除

相关问题更多 >

编程相关推荐

热门问题

热门文章

针对一行新的重复数据消除

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >