在pandas中对split()操作后获取唯一字符串列表

3 投票
3 回答
4438 浏览
提问于 2025-04-18 03:29

我刚开始学习pandas,有一个包含数据的更大表格(DataFrame),里面有一列数据,比如说:

0                  one two
1            two seven six
2           three one five
3    seven five five eight
4                 six four
5                    three
dtype: object

我想把这些单词的序列拆分成各个部分,然后得到一个独特的单词集合或者单词的计数。我可以顺利地完成拆分这一步。

numbers.str.split(' ')

0                    [one, two]
1             [two, seven, six]
2            [three, one, five]
3    [seven, five, five, eight]
4                   [six, four]
5                       [three]
dtype: object

不过,我不太确定接下来该怎么做。再次强调,我想要的输出结果是:

['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight']

或者是以字典的形式显示计数,或者是以Series/DataFrame的形式显示这两种结果中的一种。

到目前为止,我能做到的就是使用apply()和集合(Set)结合,来获取独特的单词。根据我目前的了解,pandas是一个非常优雅的工具,似乎对于比我更熟悉它的人来说,这个问题应该不难解决。

提前谢谢大家!

3 个回答

0

我最近在做一个类似的任务,想要计算用空格分开的字符串。你可以这样来处理你的数据:

import pandas as pd
data = [['one two'],['two seven six'],['three one five'],['seven five five eight'],['six four'],['three']]
numbers = pd.DataFrame(data)

uniq_groups = set(x for l in numbers[0].str.split(' ') for x in l)
#{'eight', 'five', 'four', 'one', 'seven', 'six', 'three', 'two'}

#add a dataframe column for count of each value
for gr in uniq_groups:
   numbers[gr] = numbers[0].map(lambda x: len([i for i in x.split(' ') if i == gr]))

#sum all columns
numbers.loc['Total'] = numbers.sum(axis=0,numeric_only=True)
#pandas display format without decimals
pd.options.display.float_format = '{:,.0f}'.format

最后得到的结果是:

在这里输入图片描述

1

这段代码会创建一个字典,里面记录了你所有单词的出现次数。

x = ['one two', 'two seven six', 'three one five', 'seven five five eight', 'six four', 'three']

#create list comprehension of all elements
x_list = [j for i in x for j in i.split()]
print x_list

# ['one', 'two', 'two', 'seven', 'six', 'three', 'one', 'five', 'seven', 'five', 'five', 'eight', 'six', 'four', 'three']

d = {}

#initialize keys
for e in set(x_list):
    d[e] = 0

#store counts in dict
for e in x_list:
        d[e] += 1

print d

最后的结果是一个包含单词及其出现次数的字典:

{'seven': 2, 'six': 2, 'three': 2, 'two': 2, 'four': 1, 'five': 3, 'eight': 1, 'one': 2}
8

如果我理解得没错,我觉得你可以用pandas这样来做。首先,我会从你分割字符串之前的序列开始:

print s

0                  one two
1            two seven six
2           three one five
3    seven five five eight
4                 six four
5                    three

stacked = pd.DataFrame(s.str.split().tolist()).stack()
print stacked

0  0      one
   1      two
1  0      two
   1    seven
   2      six
2  0    three
   1      one
   2     five
3  0    seven
   1     five
   2     five
   3    eight
4  0      six
   1     four
5  0    three

现在只需要计算这个序列中每个值出现的次数:

print stacked.value_counts()

five     3
one      2
three    2
six      2
two      2
seven    2
eight    1
four     1
dtype: int64

撰写回答