使用函数和for循环将多个文件的文本与列表进行比较

df: 'headline' 'source' targets is making better stars in the bucks target news more diamonds than rocks in saturn rings wishful thinking diamond in the rough employees take too many naps refresh sleep data: 'company' targets stars in the bucks wallymarty velocity global diamond in the rough ccompanies = data['company'].tolist() #convert into list def find(x): #function to compare df['headline'] against list of companies result = [] companies = set(ccompanies) #edit based on comment, saves time for i in companies: if i in x: result.append(x) return result matches = df['headline'].apply(find)

2条回答

网友

1楼 · 编辑于 2024-05-29 03:05:04

在纯熊猫没有迭代和转换成一个列表。你知道吗

首先，将data与df连接起来，这样每个被比较的公司名称的标题都是“复制”的。临时列“key”用于促进此联接。你知道吗

In [60]: data_df = data.to_frame()

In [61]: data_df['key'] = 1

In [63]: df['key'] = 1

In [65]: merged = pd.merge(df, data_df, how='outer', on='key').drop('key', axis=1)

merged看起来像这样。如您所见，根据data的大小，使用此方法可能会得到一个巨大的数据帧。你知道吗

In [66]: merged
Out[66]:
                                             headline            source               company
0         targets is making better stars in the bucks       target news               targets
1         targets is making better stars in the bucks       target news    stars in the bucks
2         targets is making better stars in the bucks       target news            wallymarty
3         targets is making better stars in the bucks       target news       velocity global
4         targets is making better stars in the bucks       target news  diamond in the rough
5            more diamonds than rocks in saturn rings  wishful thinking               targets
6            more diamonds than rocks in saturn rings  wishful thinking    stars in the bucks
7            more diamonds than rocks in saturn rings  wishful thinking            wallymarty
8            more diamonds than rocks in saturn rings  wishful thinking       velocity global
9            more diamonds than rocks in saturn rings  wishful thinking  diamond in the rough
10  diamond in the rough employees take too many naps     refresh sleep               targets
11  diamond in the rough employees take too many naps     refresh sleep    stars in the bucks
12  diamond in the rough employees take too many naps     refresh sleep            wallymarty
13  diamond in the rough employees take too many naps     refresh sleep       velocity global
14  diamond in the rough employees take too many naps     refresh sleep  diamond in the rough

然后在标题中查找文本。如果找到，则在新的“找到”列中输入True，否则输入False。你知道吗

In [67]: merged['found'] = merged.apply(lambda x: x['company'] in x['headline'], axis=1)

然后删除未找到匹配项的标题：

In [68]: found_df = merged.drop(merged[merged['found']==False].index)

In [69]: found_df
Out[69]:
                                             headline         source               company  found
0         targets is making better stars in the bucks    target news               targets   True
1         targets is making better stars in the bucks    target news    stars in the bucks   True
14  diamond in the rough employees take too many naps  refresh sleep  diamond in the rough   True

如有必要，仅对标题和公司进行总结

In [70]: found_df[['headline', 'company']]
Out[70]:
                                             headline               company
0         targets is making better stars in the bucks               targets
1         targets is making better stars in the bucks    stars in the bucks
14  diamond in the rough employees take too many naps  diamond in the rough

快捷方式：步骤67，直到可以使用此命令总结结束

merged.drop(merged[merged.apply(lambda x: x['company'] in x['headline'], axis=1) == False].index)[['headline', 'source']]

网友

2楼 · 编辑于 2024-05-29 03:05:04

... should be using regex in this case or if a simple in statement is sufficient?

使用in很好，因为您显然已经规范化为.lower()，并且删除了标点符号。你知道吗

你真的应该尝试使用更有意义的标识符。例如，通常的习惯用法是for company in companies:，而不是i。你知道吗

你知道如何使用.tolist()，很好。但是您确实希望创建set而不是list，以支持有效的in测试。这是O（1）散列查找和嵌套循环之间的区别，用于列表的线性扫描。你知道吗

这没什么意义：

        for i in ccompanies:
            i = [x]

你开始迭代，但是i本质上变成了一个常量？不清楚你要干什么。你知道吗

如果您将这个项目进行得更深入一点，您可能会考虑将公司与NLTK进行匹配或者来自scikit learn的TFIDF矢量器，或https://pypi.org/project/fuzzywuzzy/

相关问题更多 >

编程相关推荐

热门问题

热门文章