在python中去除重複行的資料框

for i in range(0,27950): for j in range(1,27950): a = data_sorted['title'].iloc[i].split() b = data_sorted['title'].iloc[j].split() if len(a)-len(b)<=2: data_sorted.drop(b) j=j else: j+=1 i+=1

1条回答

网友

1楼 · 发布于 2024-04-20 07:15:34

我建议采取以下办法：

建立一个标题的差异矩阵，其中i，j元素将表示i'th和j'th标题之间的单词差异。你知道吗

像这样：

    import numpy as np
    from itertools import product

    l = list(data_sorted['title'])

    def diff_words(text_1, text_2):
        # return the number of different words between two texts
        words_1 = text_1.split()
        words_2 = text_2.split()
        diff = max(len(words_1),len(words_2))-len(np.intersect1d(words_1, words_2))
        return diff


    differences = [diff_words(i,j) for i,j in product(l,l)]
    # differences: a flat matrix integers where the i,j element is the word difference between titles i and j

编程相关推荐

ApplyTransfermListener。Marklogic Java客户端Api中的ApplyResult？
java安卓：如何从服务调用方法
java如何在Junit中测试Servlet3.0注释基础servlet和嵌入Tomcat7
java在JSF中嵌入portlet
java检查多个向量是否是回文的？
Selenium Java页面对象模型查询
Java中运算符的优先级
java从包含透明像素的图像创建自定义JButton
hibernate Java类变量与其他变量的声明
安卓错误：任务执行失败：应用程序：mergeDebugResources'>JAVAlang.OutOfMemoryError:无法创建新的本机线程

相关问题更多 >

编程相关推荐

热门问题

热门文章

在python中去除重複行的資料框

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >