如何从一列中删除包含在另一列中的单词?

2024-05-14 07:29:07 发布

您现在位置:Python中文网/ 问答频道 /正文

  Audience              Ad
  Audience1     Audience4.Ad1.image
  Audience2     Audience1.Ad4.image
  Audience3     Audience7.Ad1.image
  Audience4     Audience2.Ad3.image
  Audience5     Audience9.Ad1.image
  Audience6     Audience4.Ad2.image
  Audience7     Audience5.Ad1.image
  Audience8     Audience7.Ad3.image
  Audience9     Audience8.Ad1.image
  Audience10    Audience9.Ad1.image

这是一些示例数据。我想做的是查看广告栏,如果它包含观众栏中的任何内容,则将其替换为零。 对我来说,这里最困难的部分是左边可能会说audence1,右边可能会说audence2,所以它们不一样。如果他们是我知道如何实现这一点,但不幸的是,他们不是

因此,预期结果如下所示

  Audience      Ad
  Audience1     Ad1.image
  Audience2     Ad4.image
  Audience3     Ad1.image  
  Audience4     Ad3.image
  Audience5     Ad1.image
  Audience6     Ad2.image
  Audience7     Ad1.image
  Audience8     Ad3.image
  Audience9     Ad1.image
  Audience10    Ad1.image

我想这样做的方式是通过使用for循环遍历观众栏,然后如果我发现观众栏中的任何元素都包含在广告栏中,则将其删除

这就是我试图解决这个问题的方法,但我被困在return语句中的内容上(当然,如果其余逻辑正确的话):

def replace(text):
    for i in df['Audience']:
        if i in text:
            return ???
df['Ad'] = df['Ad'].apply(replace)

任何帮助都将不胜感激


Tags: imagedfad观众audiencead1audience9audience1
3条回答

可以将^{}^{}一起使用

mask = df['Ad'].str.contains('\.|'.join(set(df['Audience'])))
df.loc[mask,'Ad'] = df.loc[mask,'Ad'].str.replace(r'(Audience\d+.)','')
df
     Audience         Ad
0   Audience1  Ad1.image
1   Audience2  Ad4.image
2   Audience3  Ad1.image
3   Audience4  Ad3.image
4   Audience5  Ad1.image
5   Audience6  Ad2.image
6   Audience7  Ad1.image
7   Audience8  Ad3.image
8   Audience9  Ad1.image
9  Audience10  Ad1.image

不匹配的示例:

df
      Audience                     Ad
0    Audience1    Audience4.Ad1.image
1    Audience2    Audience1.Ad4.image
2    Audience3    Audience7.Ad1.image
3    Audience4    Audience2.Ad3.image
4    Audience5    Audience9.Ad1.image
5    Audience6    Audience4.Ad2.image
6    Audience7    Audience5.Ad1.image
7    Audience8    Audience7.Ad3.image
8    Audience9    Audience8.Ad1.image
9   Audience10    Audience9.Ad1.image
10  Audience12  Audience11.Ad11.image

mask = df['Ad'].str.contains('\.|'.join(set(df['Audience'])))
df.loc[mask,'Ad'] = df.loc[mask,'Ad'].str.replace(r'(Audience\d+.)','')
df

      Audience                     Ad
0    Audience1              Ad1.image
1    Audience2              Ad4.image
2    Audience3              Ad1.image
3    Audience4              Ad3.image
4    Audience5              Ad1.image
5    Audience6              Ad2.image
6    Audience7              Ad1.image
7    Audience8              Ad3.image
8    Audience9              Ad1.image
9   Audience10              Ad1.image
10  Audience12  Audience11.Ad11.image # -> Audience11 not deleted as 'Audience11' is not in `df['Audience']`

^{}^{}一起使用^{}方法:

s = df['Ad'].str.split('.')
m = s.str[0].isin(df['Audience'])
df['Ad'] = s.where(~m, s.str[1:]).str.join('.')

# print(df)

     Audience         Ad
0   Audience1  Ad1.image
1   Audience2  Ad4.image
2   Audience3  Ad1.image
3   Audience4  Ad3.image
4   Audience5  Ad1.image
5   Audience6  Ad2.image
6   Audience7  Ad1.image
7   Audience8  Ad3.image
8   Audience9  Ad1.image
9  Audience10  Ad1.image
  • Audience转换为^{}以确保没有重复的值
  • ^{}Ad
  • 使用列表理解从Ad列表中删除术语,然后^{}删除术语

    • [y for y in x if y not in aud]是一个list comprehension
      • 每一行都被转换成一个带有.split的列表。这将遍历每个值并检查它是否在aud列表中。是的,那么它就不包括在新的列表中了
      • '.'.join()从列表的元素创建一个字符串
  • 给定一个10e6行的样本数据集(df = pd.concat([pd.DataFrame(data)]*1000000)):

    • 这个答案:Wall time: 16.9 s
    • 来自Shubham SharmaanswerWall time: 27.7 s
    • 来自{a8}的{a7}:{}
      • 这一时间取决于df[Audience]中唯一单词的数量,因为这些单词被连接成一个字符串
import pandas as pd

# data and dataframe
data = {'Audience': ['Audience1', 'Audience2', 'Audience3', 'Audience4', 'Audience5', 'Audience6', 'Audience7', 'Audience8', 'Audience9', 'Audience10'],
        'Ad': ['Audience4.Ad1.image', 'Audience1.Ad4.image', 'Audience7.Ad1.image', 'Audience2.Ad3.image', 'Audience9.Ad1.image', 'Audience4.Ad2.image', 'Audience5.Ad1.image', 'Audience7.Ad3.image', 'Audience8.Ad1.image', 'Audience9.Ad1.image']}

df = pd.DataFrame(data)

# create list of unique words from Audience
aud = set(df.Audience.str.lower())

# remove Audience words from Ad column
df.Ad = df.Ad.str.split('.').apply(lambda x: '.'.join([y for y in x if y.lower() not in aud]))

|    | Audience   | Ad        |
| -:|:     -|:     |
|  0 | Audience1  | Ad1.image |
|  1 | Audience2  | Ad4.image |
|  2 | Audience3  | Ad1.image |
|  3 | Audience4  | Ad3.image |
|  4 | Audience5  | Ad1.image |
|  5 | Audience6  | Ad2.image |
|  6 | Audience7  | Ad1.image |
|  7 | Audience8  | Ad3.image |
|  8 | Audience9  | Ad1.image |
|  9 | Audience10 | Ad1.image |

备选案文2:

  • 从注释更新为新的data
data = {'Audience': ['Football.And.Basketball.Interests', 'Baseball.Interests', 'Cricket.Interests', 'Website.Visitors'],
        'Ad': ['Baseball.Interests.Ad1.image', 'Football.And.Basketball.Interests.Ad4.image', 'Cricket.Interests.Ad1.image', 'Website.Visitors.Ad3.image']}

df = pd.DataFrame(data)

                          Audience                                           Ad
 Football.And.Basketball.Interests                 Baseball.Interests.Ad1.image
                Baseball.Interests  Football.And.Basketball.Interests.Ad4.image
                 Cricket.Interests                  Cricket.Interests.Ad1.image
                  Website.Visitors                   Website.Visitors.Ad3.image

# if Audience contains multiple values
aud = set(df.Audience.str.split('.').explode().str.lower())

# remove Audience words from Ad column
df.Ad = df.Ad.str.split('.').apply(lambda x: '.'.join([y for y in x if y.lower() not in aud]))

                          Audience         Ad
 Football.And.Basketball.Interests  Ad1.image
                Baseball.Interests  Ad4.image
                 Cricket.Interests  Ad1.image
                  Website.Visitors  Ad3.image

相关问题 更多 >

    热门问题