文本比较中的错误值

ID Text Sim 13 fsad amazing ... fsd 14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️... fdsfdgte3e 18 gsd wonderful fast 21 dfsfs i love this its incredible ... reds 23 gwe wonderful end ever seen you ... add ... ... ... ... 261 add wonderful gwe 261 add wonderful gsd 261 add wonderful fdsdf 267 fdsfdgte3e best match ever its a masterpiece fdsdf 277 hgdfgre terrible destroys everything ... tm28

from fuzzywuzzy import fuzz def sim (nm, df): # this function finds matches between texts based on a threshold, which is 100. The logic is fuzzywuzzy, specifically partial ratio. The output should be IDs whether texts match, based on the threshold. matches = dataset.apply(lambda row: ((fuzz.partial_ratio(row['Text'], nm)) = 100), axis=1) return [df.ID[i] for i, x in enumerate(matches) if x] df['L_Text']=df['Text'].str.lower() df['Sim']=df.apply(lambda row: sim(row['L_Text'], df), axis=1) df=df.assign( Sim = df.apply(lambda x: [s for s in x['Sim'] if s != x['ID']], axis=1) ) def tr (row): # this function assign a similarity score for each text applying partial_ratio similarity return (df.loc[:row.name-1, 'L_Text'] .apply(lambda name: fuzz.partial_ratio(name, row['L_Text']))) t = (df.loc[1:].apply(tr, axis=1) .reindex(index=df.index, columns=df.index) .fillna(0) .add_prefix('txt') ) t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))

ID Text Sim 13 fsad amazing ... 14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️... 18 gsd wonderful add 21 dfsfs i love this its incredible ... 23 gwe wonderful end ever seen you ... ... ... ... ... 261 add wonderful gsd 261 add wonderful gsd 261 add wonderful gsd 267 fdsfdgte3e best match ever its a masterpiece 277 hgdfgre terrible destroys everything ...

1条回答

网友

1楼 · 发布于 2024-05-19 01:46:36

初始假设

首先，由于我对您的问题不是百分之百的清楚，我假设您希望对所有行进行两两比较，并且如果比赛分数为>；100您想添加匹配行的键。如果不是这样，请纠正我

句法问题

因此，上面的代码存在多个问题。首先，如果只是复制并粘贴它，那么从语法上讲是不可能运行它的。sim()函数应如下所示：

def sim (nm, df): 
    matches = df.apply(lambda row: fuzz.partial_ratio(row['Text'], nm) == 100), axis=1)
    return [df.ID[i] for i, x in enumerate(matches) if x]

请注意df而不是dataset，以及==而不是=。为了更好的可读性，我还删除了多余的括号

语义问题

如果我随后运行您的代码并打印t（这似乎不是最终结果），这将为我提供以下信息：

   txt0  txt1   txt2  txt3   txt4   txt5   txt6   txt7  txt8  txt9
0   1.0  27.0   12.0  45.0   45.0   12.0   12.0   12.0  27.0  64.0
1  27.0   1.0   33.0  33.0   42.0   33.0   33.0   33.0  52.0  44.0
2  12.0  33.0    1.0  22.0  100.0  100.0  100.0  100.0  22.0  33.0
3  45.0  33.0   22.0   1.0   41.0   22.0   22.0   22.0  40.0  30.0
4  45.0  42.0  100.0  41.0    1.0  100.0  100.0  100.0  35.0  47.0
5  12.0  33.0  100.0  22.0  100.0    1.0  100.0  100.0  22.0  33.0
6  12.0  33.0  100.0  22.0  100.0  100.0    1.0  100.0  22.0  33.0
7  12.0  33.0  100.0  22.0  100.0  100.0  100.0    1.0  22.0  33.0
8  27.0  52.0   22.0  40.0   35.0   22.0   22.0   22.0   1.0  34.0
9  64.0  44.0   33.0  30.0   47.0   33.0   33.0   33.0  34.0   1.0

这对我来说似乎是正确的，因为fuzz.partial_ratio("wonderful end ever seen you", "wonderful")返回100（因为部分匹配已经被认为是100分）。出于一致性原因，您可以进行更改

t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))

到

t += t.to_numpy().T + np.diag(np.ones(t.shape[0])) * 100

因为所有元素都应该完全匹配。所以当你说

But my output says that add matches with gwe and this is not true.

这将是真的，因为{{CD9}}，您可能需要考虑使用^ {CD10>}。另外，在将t转换为新的Sim列时可能会出错，但在提供的示例中似乎没有代码

替代实施

此外，正如一些评论所建议的那样，有时重构代码是有帮助的，这样人们就更容易帮助您。下面是一个示例，说明了这可能是什么样子：

import re

import pandas as pd
from fuzzywuzzy import fuzz

data = """
13   fsad        amazing ...                                           fsd
14   fdsdf       best sport everand the gane of the year❤️❤️❤️❤️...    fdsfdgte3e
18   gsd         wonderful                                             fast 
21   dfsfs       i love this its incredible ...                        reds
23   gwe         wonderful end ever seen you ...                       add
261  add         wonderful                                             gwe
261  add         wonderful                                             gsd
261  add         wonderful                                             fdsdf
267  fdsfdgte3e  best match ever its a masterpiece                     fdsdf
277  hgdfgre     terrible destroys everything ...                      tm28
"""

rows = data.strip().split('\n')
records = [[element for element in re.split(r' {2,}', row) if element != ''] for row in rows]

df = pd.DataFrame.from_records(records, columns=['RowNumber', 'ID', 'Text', 'IncorrectSim'], index='RowNumber')
df = df.drop('IncorrectSim', axis=1)
df = df.drop_duplicates(subset=["ID", "Text"])  # Assuming that there is no point in keeping duplicate rows
df = df.set_index('ID')  # Assuming that the "ID" column holds a unique ID

comparison_df = df.copy()
comparison_df['Text'] = comparison_df["Text"].str.lower()
comparison_df['Tmp'] = 1
# This gives us all possible row combinations
comparison_df = comparison_df.reset_index().merge(comparison_df.reset_index(), on='Tmp').drop('Tmp', axis=1)
comparison_df = comparison_df[comparison_df['ID_x'] != comparison_df['ID_y']]  # We only want rows that do not match itself
comparison_df['matchScore'] = comparison_df.apply(lambda row: fuzz.partial_ratio(row['Text_x'], row['Text_y']), axis=1)
comparison_df = comparison_df[comparison_df['matchScore'] == 100]  # only keep perfect matches
comparison_df = comparison_df[['ID_x', 'ID_y']].rename(columns={'ID_x': 'ID', 'ID_y': 'Sim'}).set_index('ID')  # Cleanup

result = df.join(comparison_df, how='left').fillna('')
print(result.to_string())

给出：

                                                         Text  Sim
ID                                                                
add                                                 wonderful  gsd
add                                                 wonderful  gwe
dfsfs                          i love this its incredible ...     
fdsdf       best sport everand the gane of the year❤️❤️❤️❤...     
fdsfdgte3e                  best match ever its a masterpiece     
fsad                                              amazing ...     
gsd                                                 wonderful  gwe
gsd                                                 wonderful  add
gwe                           wonderful end ever seen you ...  gsd
gwe                           wonderful end ever seen you ...  add
hgdfgre                      terrible destroys everything ...

初始假设

句法问题

语义问题

替代实施

相关问题更多 >

编程相关推荐

热门问题

热门文章