用重复次数最多的列中的类似项替换列中的项

1条回答

网友

1楼 · 发布于 2024-06-08 02:42:49

下面是另一种尝试：

此解决方案中有5个步骤：

使用pd.Series.value_counts().reset_index()只获取按频率降序排列的唯一标题
使用Levenshtein距离度量值计算这些唯一的titles之间的距离
使用Levenshtein距离中的threshold查找最接近每个单词的单词索引
合并重复的节点以避免重复（即，如果ID 1、2和5是重复的，我们只需要为它们创建一个条目，而不是[1、2、5]、[2、1、5]和[5、1、2]）
最后，我们将信息整合到df.title.value_counts()系列中，并在字典中替换原始数据帧中的信息

基于先前共享的csv文件的代码：

# Load required libraries
import pandas as pd
import numpy as np
import Levenshtein
from collections import defaultdict

步骤1：加载数据（它已经是值_counts（）所需的格式）

df = pd.read_csv("https://raw.githubusercontent.com/skwolvie/jobprofile_sample/main/sample_jobprofiles.csv", 
        index_col=False)
df.columns = ['title', "frequency"]

步骤2：计算距离

def levenshtein_matrix(titles):
    """
    Fill a matrix with the Levenshtein ratio between each word in a list
    of words with each other.

    Since Levenshtein.ratio(w1, w2) == Levenshtein.ratio(w2, w1), we can
    sequentially decrease the lenght of the inner loop in order to calculate
    the Levenshtein ratio distance only once
    """
    size = len(titles)
    final = np.zeros((size, size))
    for i, w1 in enumerate(titles):
        for j, w2 in enumerate(titles[i:], i):
            lev = Levenshtein.ratio(w1, w2)
            final[i, j] = lev
            final[j, i] = lev

    return final

titles = df.title

lev_matrix = levenshtein_matrix(titles) # 30 seconds to run in my machine with 7k+ items

步骤3：循环遍历lev_matrix的每一行，以找到类似条目的ID

# Create function
def get_similar_nodes(distance_matrix, threshold=.9):
    """
    Takes a matrix of distances and returns a generator with the entries
    that have a distance measure higher than threshold for each row
    in the matrix.
    """
        
    for i in lev_matrix:
        yield np.where(i > threshold)[0].tolist()

similar_nodes = get_similar_nodes(lev_matrix)

步骤4：合并所有共享单个列表中至少一个项目的列表

def connected_components(lists):
    """
    This function yields a generator with all connected lists inside the given
    list of lists.
    """
    neighbors = defaultdict(set)
    seen = set()
    for each in lists:
        for item in each:
            neighbors[item].update(each)
    def component(node, neighbors=neighbors, seen=seen, see=seen.add):
        nodes = set([node])
        next_node = nodes.pop
        while nodes:
            node = next_node()
            see(node)
            nodes |= neighbors[node] - seen
            yield node
    for node in neighbors:
        if node not in seen:
            yield sorted(component(node))

connected_nodes = list(connected_components(similar_nodes))

为了更新这些值，您需要创建一个字典，将所有名称映射到它们的组中最常见的名称，并将其传递给DataFrame

请注意，使用nodes[0]作为节点中最常见的标题是有效的，因为自从我们使用.value_counts()创建数据帧以来，数据帧是按降序排列的

# Copy the DataFrame for comparison
df_test = df.copy()

dict_most_popular_names = {}
for nodes in connected_nodes:
    dict_most_popular_names |= {key: titles[nodes[0]] for key in titles[nodes]}

# Check the dictionary
titles[connected_nodes[0]][:3]
# >>> 0         'software engineer'
# >>> 20     'software qa engineer'
# >>> 23     'software engineer ii'
# >>> Name: title, dtype: object

dict_most_popular_names["software engineer qa"]
# >>> 'software engineer'
dict_most_popular_names["software engineer"]
# >>> 'software engineer'
dict_most_popular_names["software engineer ii"]
# >>> 'software engineer'

# Update the dataframe
df_test["clean_title"] = [dict_most_popular_names[x] for x in titles]

您也可以使用dict_most_popular_names替换原始数据帧中的数据

对我来说，运行整个脚本需要30秒，这相当于计算Levenshtein距离所花费的时间。如果您需要进一步优化，您需要检查

相关问题更多 >

编程相关推荐

热门问题

热门文章

用重复次数最多的列中的类似项替换列中的项

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >