如何使用单词列表计算数据帧中的单词数？

df = pd.DataFrame({ "id":[ "100", "200", "300" ], "text":[ "The best part of Zillow is you can search/view thousands of home within a click of a button without even stepping out of your door.At the comfort of your home you can get all the details such as the floor plan, tax history, neighborhood, mortgage calculator, school ratings etc. and also getting in touch with the contact realtor is just a click away and you are scheduled for the home tour!As a first time home buyer, this website greatly helped me to study the market before making the right choice.", "I love all of the features of the Zillow app, especially the filtering options and the feature that allows you to save customized searches.", "Data is not updated spontaneously. Listings are still shown as active while the Mls shows pending or closed." ], "word":[ "[best, word, door, subway, rain]", "[item, best, school, store, hospital]", "[gym, mall, pool, playground]", ] })

| id | word dict | | -- | ----------------------------------------------- | | 100| {best: 1, word: 0, door: 1, subway: 0 , rain: 0} | | 200| {item: 0, best: 0, school: 0, store: 0, hospital: 0} | | 300| {gym: 0, mall: 0, pool: 0, playground: 0} |

2条回答

网友

1楼 · 编辑于 2024-05-23 08:06:22

我们可以使用re提取list中的所有单词。注意，这将只匹配列表中的单词，而不是数字

然后应用一个函数，该函数返回一个带有列表中每个单词计数的dict。然后，我们可以将此函数应用于df中的一个新列

import re

def count_words(row):
    words = re.findall(r'(\w+)', row['word'])
    return {word: row['text'].count(word) for word in words}

df['word_counts'] = df.apply(lambda x: count_words(x), axis=1)

输出

    id  ...                                        word_counts
0  100  ...  {'best': 1, 'word': 0, 'door': 1, 'subway': 0,...
1  200  ...  {'item': 0, 'best': 0, 'school': 0, 'store': 0...
2  300  ...  {'gym': 0, 'mall': 0, 'pool': 0, 'playground': 0}

[3 rows x 4 columns]

网友

2楼 · 编辑于 2024-05-23 08:06:22

由于word列的类型为string，请先将其转换为列表：

df['word'] = df['word'].str[1:-1].str.split(',')

现在，您可以使用apply for axis=1和逻辑来计算每个单词：

df[['text', 'word']].apply(lambda row: {item:row['text'].count(item) for item in row['word']}, axis=1)

输出：

Out[32]: 
0    {'best': 1, ' word': 0, ' door': 1, ' subway':...
1    {'item': 0, ' best': 0, ' school': 0, ' store'...
2    {'gym': 0, ' mall': 0, ' pool': 0, ' playgroun...
dtype: object

相关问题更多 >

编程相关推荐

热门问题

热门文章