如何在数据帧中查找相似项并将其分组,以便对其值求和?

2024-06-16 10:51:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这样的数据:

| Term     | Value|
| -------- | -----|
| Apple    | 100  |
| Appel    | 50   |
| Banana   | 200  |
| Banan    | 25   |
| Orange   | 140  |
| Pear     | 75   |
| Lapel    | 10   |

目前,我正在使用以下代码:

matches = []
for term in terms:
    tlist = difflib.get_close_matches(term, terms, cutoff = .80, n=5)
    matches.append(tlist)
      
df["terms"] = matches

输出如下

| Term                  | Value|
| --------------------- | -----|
| [Apple, Appel]        | 100  |
| [Appel, Apple, Lapel] | 50   |
| [Banana, Banan]       | 200  |
| [Banan, Banana]       | 25   |
| [Orange]              | 140  |
| [Pear]                | 75   |
| [Lapel, Appel]        | 10   |

这段代码不是很有用。我想要的输出是这样的:

| Term     | Value|
| -------- | -----|
| Apple    | 150  |
| Banana   | 225  |
| Orange   | 140  |
| Pear     | 75   |
| Lapel    | 10   |

主要的问题是列表的顺序不同,而且列表中通常只有一两个单词重叠。例如,我可能有

  • [苹果,阿佩尔]
  • [苹果,苹果,翻领]

理想情况下,我希望这两个词都返回“apple”,因为重叠词的值最高

有办法做到这一点吗


Tags: 代码苹果applevaluebananapearmatchesterm
1条回答
网友
1楼 · 发布于 2024-06-16 10:51:15

实现目标的一个简单方法是使用Python标准库difflib模块,它为计算增量提供帮助,如下所示:

from difflib import SequenceMatcher

import pandas as pd

# Toy dataframe
df = pd.DataFrame(
    {
        "Term": ["Apple", "Appel", "Banana", "Banan", "Orange", "Pear", "Lapel"],
        "Value": [100, 50, 200, 25, 140, 75, 10],
    }
)

KEY_TERMS = ("Apple", "Banana", "Orange", "Pear")

for i, row in df.copy().iterrows():
    # Get the similarity ratio for a given value in df "Term" column (row[0])
    # and each term from KEY_TERM, and store the pair "term:ratio" in a dict
    similarities = {
        term: SequenceMatcher(None, row[0], term).ratio() for term in KEY_TERMS
    }
    # Find the key term for which the similarity ratio is maximum
    # and use it to replace the original term in the dataframe
    df.loc[i, "Term"] = max(similarities, key=lambda key: similarities[key])

# Group by term and sum values
df = df.groupby("Term").agg("sum").reset_index()

然后:

print(df)
# Outputs
     Term  Value
0   Apple    160
1  Banana    225
2  Orange    140
3    Pear     75

相关问题 更多 >