如何在数据帧中查找相似项并将其分组，以便对其值求和？

| Term | Value| | --------------------- | -----| | [Apple, Appel] | 100 | | [Appel, Apple, Lapel] | 50 | | [Banana, Banan] | 200 | | [Banan, Banana] | 25 | | [Orange] | 140 | | [Pear] | 75 | | [Lapel, Appel] | 10 |

1条回答

网友

1楼 · 发布于 2024-06-16 10:51:15

实现目标的一个简单方法是使用Python标准库difflib模块，它为计算增量提供帮助，如下所示：

from difflib import SequenceMatcher

import pandas as pd

# Toy dataframe
df = pd.DataFrame(
    {
        "Term": ["Apple", "Appel", "Banana", "Banan", "Orange", "Pear", "Lapel"],
        "Value": [100, 50, 200, 25, 140, 75, 10],
    }
)

KEY_TERMS = ("Apple", "Banana", "Orange", "Pear")

for i, row in df.copy().iterrows():
    # Get the similarity ratio for a given value in df "Term" column (row[0])
    # and each term from KEY_TERM, and store the pair "term:ratio" in a dict
    similarities = {
        term: SequenceMatcher(None, row[0], term).ratio() for term in KEY_TERMS
    }
    # Find the key term for which the similarity ratio is maximum
    # and use it to replace the original term in the dataframe
    df.loc[i, "Term"] = max(similarities, key=lambda key: similarities[key])

# Group by term and sum values
df = df.groupby("Term").agg("sum").reset_index()

然后：

print(df)
# Outputs
     Term  Value
0   Apple    160
1  Banana    225
2  Orange    140
3    Pear     75

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何在数据帧中查找相似项并将其分组，以便对其值求和？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >