从另一个数据帧列中的另一个单词列表中删除数据帧列中每一行中的单词

+----------+--------------------+ | event_dt| cust_text| +----------+--------------------+ |2020-09-02|hi fine i want to go| |2020-09-02|i need a line hold | |2020-09-02|i have the 60 packs| |2020-09-02|hello want you teach|

+----------+--------------------+ | event_dt| cust_text| +----------+--------------------+ |2020-09-02|hi fine i to | |2020-09-02|i line hold | |2020-09-02|i the 60 packs | |2020-09-02|you teach | +----------+--------------------+

2条回答

网友

1楼 · 编辑于 2024-04-26 06:58:20

这种解决方案将针对熊猫。如果我正确理解了您的挑战，那么您希望删除第二个数据帧的cust_text列中出现的所有单词。让我们给相应的数据帧命名：df1和df2。以下是您将如何做到这一点：

for i in range(len(df1)):
    sentence = df1.loc[i, "cust_text"]
    for j in range(len(df2)):
        delete_word = df2.loc[j, "column1"]
        if delete_word in sentence:
            sentence = sentence.replace(delete_word, "")
    df1.loc[i, "cust_text"] = sentence

我在这些数据帧（sentence和delete_word）中为某些数据点分配了变量，但这只是为了理解。通过不这样做，您可以很容易地将此代码压缩为短几行

网友

2楼 · 编辑于 2024-04-26 06:58:20

如果您只想删除df2对应行中的单词，可以按如下操作，但对于大型数据集来说可能会比较慢，因为它只能部分使用快速C实现：

# define your helper function to remove the string
def remove_string(ser_row):
    return ser_row['cust_text'].replace(ser_row['remove'], '')

# create a temporary column with the string to remove in the first dataframe
df1['remove']= df2['column1']
df1= df1.apply(remove_string, axis='columns')
# drop the temporary column afterwards
df1.drop(columns=['remove'], inplace=True)

结果如下：

Out[145]: 
0        hi fine i  to go
1    i need   lines hold 
2    i have the  60 packs
3           can you teach
dtype: object

但是，如果您想从每列中删除df2列中的所有单词，则需要进行不同的操作。不幸的是str.replace在这里对常规字符串没有帮助，除非您想为第二个数据帧中的每一行调用它。因此，如果第二个数据帧不是太大，可以创建一个正则表达式来利用str.replace

import re
replace=re.compile(r'\b(' + ('|'.join(df2['column1'])) + r')\b')
df1['cust_text'].str.replace(replace, '')

输出为：

Out[184]: 
0      hi fine i  to 
1    i    lines hold 
2    i  the  60 packs
3       can you teach
Name: cust_text, dtype: object

如果您不喜欢保留的重复空格，可以执行以下操作：

df1['cust_text'].str.replace(replace, '').str.replace(re.compile('\s{2,}'), ' ')

补充：如果不仅没有词语的文本是相关的，而且词语本身也是相关的，那该怎么办呢。我们怎样才能得到被替换的单词呢。这里是一个尝试，如果可以识别一个字符，它将不会出现在文本中。让我们假设这个字符是@，那么您可以（在原始列值上）执行以下操作，而无需替换：

# enclose each keywords in @
ser_matched= df1['cust_text'].replace({replace: r'@\1@'}, regex=True)
# now remove the rest of the line, which is unmatched
# this is the part of the string after the last occurance
# of a @
ser_matched= ser_matched.replace({r'^(.*)@.*$': r'\1', '^@': ''}, regex=True)
# and if you like your keywords to be in a list, rather than a string
# you can split the string at last
ser_matched.str.split(r'@+')

相关问题更多 >

编程相关推荐

热门问题

热门文章