如何使用单词列表计算数据帧中的单词数?

2024-05-23 08:06:22 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个关于使用python进行字数计算的问题

数据框有三列。(id、文本、word)

首先,这是一个示例表

[数据帧]

df = pd.DataFrame({
    "id":[
        "100",
        "200",
        "300"
    ],
    "text":[
        "The best part of Zillow is you can search/view thousands of home within a click of a button without even stepping out of your door.At the comfort of your home you can get all the details such as the floor plan, tax history, neighborhood, mortgage calculator, school ratings etc. and also getting in touch with the contact realtor is just a click away and you are scheduled for the home tour!As a first time home buyer, this website greatly helped me to study the market before making the right choice.",
        "I love all of the features of the Zillow app, especially the filtering options and the feature that allows you to save customized searches.",
        "Data is not updated spontaneously. Listings are still shown as active while the Mls shows pending or closed."
    ],
        "word":[
        "[best, word, door, subway, rain]",
        "[item, best, school, store, hospital]",
        "[gym, mall, pool, playground]",
    ]
    })

我已经把文本拆分成字典了

所以,我想把每行单词列表检查成文本

这就是我想要的结果

| id |                   word dict                          |
| -- | -----------------------------------------------      |
| 100| {best: 1, word: 0, door: 1, subway: 0 , rain: 0}     |         
| 200| {item: 0, best: 0, school: 0, store: 0, hospital: 0} |
| 300| {gym: 0, mall: 0, pool: 0, playground: 0}            |

请检查这个问题


Tags: andofthe数据文本youidhome
2条回答

我们可以使用re提取list中的所有单词。注意,这将只匹配列表中的单词,而不是数字

然后应用一个函数,该函数返回一个带有列表中每个单词计数的dict。然后,我们可以将此函数应用于df中的一个新列

import re

def count_words(row):
    words = re.findall(r'(\w+)', row['word'])
    return {word: row['text'].count(word) for word in words}

df['word_counts'] = df.apply(lambda x: count_words(x), axis=1)

输出


    id  ...                                        word_counts
0  100  ...  {'best': 1, 'word': 0, 'door': 1, 'subway': 0,...
1  200  ...  {'item': 0, 'best': 0, 'school': 0, 'store': 0...
2  300  ...  {'gym': 0, 'mall': 0, 'pool': 0, 'playground': 0}

[3 rows x 4 columns]

由于word列的类型为string,请先将其转换为列表:

df['word'] = df['word'].str[1:-1].str.split(',')

现在,您可以使用apply for axis=1和逻辑来计算每个单词:

df[['text', 'word']].apply(lambda row: {item:row['text'].count(item) for item in row['word']}, axis=1)

输出

Out[32]: 
0    {'best': 1, ' word': 0, ' door': 1, ' subway':...
1    {'item': 0, ' best': 0, ' school': 0, ' store'...
2    {'gym': 0, ' mall': 0, ' pool': 0, ' playgroun...
dtype: object

相关问题 更多 >

    热门问题