Pandas：检查字符串是否至少包含lis中的两个单词

list_words='foo ber haa' df = pd.DataFrame({'A' : ['foo foor', 'bar bar', 'foo hoo', 'bar haa', 'foo bar', 'bar bur', 'foo fer', 'foo for']}) df Out[113]: A 0 foo foor 1 bar bar 2 foo hoo 3 bar haa 4 foo bar 5 bar bur 6 foo fer 7 foo for df.A.str.contains("|".join(list_words.split(" "))) Out[114]: 0 True 1 False 2 True 3 True 4 True 5 False 6 True 7 True Name: A, dtype: bool

3条回答

网友

1楼 · 编辑于 2024-06-11 05:07:44

您可以将^{}与list comprehension一起使用：

#changed ber to bar
list_words='foo bar haa'

df = pd.DataFrame({'A' : ['foo foor', 'bar bar', 'foo hoo', 'bar haa',
                         'foo bar', 'bar bur', 'foo fer', 'foo for']})  

print (df)
          A
0  foo foor
1   bar bar
2   foo hoo
3   bar haa
4   foo bar
5   bar bur
6   foo fer
7   foo for

print((pd.concat([df.A.str.contains(word,regex=False) for word in list_words.split()],axis=1))
          .sum(1) > 1)

0    False
1    False
2    False
3     True
4     True
5    False
6    False
7    False
dtype: bool

计时：

^{pr2}$

In [292]: %timeit ((pd.concat([df.A.str.contains(word) for word in list_words.split()], axis=1)).sum(1) > 1)
100 loops, best of 3: 16 ms per loop

In [325]: %timeit (jon(df))
100 loops, best of 3: 8.97 ms per loop

In [294]: %timeit ((pd.concat([df.A.str.contains(word,regex=False) for word in list_words.split()], axis=1)).sum(1) > 1)
100 loops, best of 3: 8.13 ms per loop

In [295]: %timeit df['A'].map(lambda x: check(x, list_words))
100 loops, best of 3: 14.7 ms per loop

网友

2楼 · 编辑于 2024-06-11 05:07:44

假设ber应该是bar，那么您可以将.apply与集合一起使用-注意这会整词-而不是子串（例如，foo在foor中找不到）。。。在

import pandas as pd

list_words='foo bar haa'
set_words = set(list_words.split())

df = pd.DataFrame({'A' : ['foo foor', 'bar bar', 'foo hoo', 'bar haa',
                         'foo bar', 'bar bur', 'foo fer', 'foo for']})

df.A.apply(lambda L: len(set(L.split()) & set_words) > 1)

给你：

^{pr2}$

网友

3楼 · 编辑于 2024-06-11 05:07:44

我是熊猫（和python一般来说）的初学者，所以我想把它当作一种挑战，而不是获得赞成票：）。只是使用了我知道的技术，但它们比其他人提出的要慢得多。在

def check(row, string):
    #tokenize string
    string_list = string.split() 
    #tokenize row
    row_list = row.split()

    counter = 0
    used_words = []
    for word in row_list:
        used_words.append(word)
        if word in string_list and not(used_words.count(word) >1):
            counter += 1
    if counter >= 2:
        return True
    else:
        return False

df['check'] = df['A'].map(lambda x: check(x, list_words))

我会检查其他人提出的技术：）

相关问题更多 >

编程相关推荐

热门问题

热门文章