在pandas中对split()操作后获取唯一字符串列表

3 投票

3 回答

4438 浏览

提问于 2025-04-18 03:29

我刚开始学习pandas，有一个包含数据的更大表格（DataFrame），里面有一列数据，比如说：

0                  one two
1            two seven six
2           three one five
3    seven five five eight
4                 six four
5                    three
dtype: object

我想把这些单词的序列拆分成各个部分，然后得到一个独特的单词集合或者单词的计数。我可以顺利地完成拆分这一步。

numbers.str.split(' ')

0                    [one, two]
1             [two, seven, six]
2            [three, one, five]
3    [seven, five, five, eight]
4                   [six, four]
5                       [three]
dtype: object

不过，我不太确定接下来该怎么做。再次强调，我想要的输出结果是：

['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight']

或者是以字典的形式显示计数，或者是以Series/DataFrame的形式显示这两种结果中的一种。

到目前为止，我能做到的就是使用apply()和集合（Set）结合，来获取独特的单词。根据我目前的了解，pandas是一个非常优雅的工具，似乎对于比我更熟悉它的人来说，这个问题应该不难解决。

提前谢谢大家！

集合操作数据处理字符串操作数据分析 pandas 唯一值应用函数数据帧

3 个回答

我最近在做一个类似的任务，想要计算用空格分开的字符串。你可以这样来处理你的数据：

import pandas as pd
data = [['one two'],['two seven six'],['three one five'],['seven five five eight'],['six four'],['three']]
numbers = pd.DataFrame(data)

uniq_groups = set(x for l in numbers[0].str.split(' ') for x in l)
#{'eight', 'five', 'four', 'one', 'seven', 'six', 'three', 'two'}

#add a dataframe column for count of each value
for gr in uniq_groups:
   numbers[gr] = numbers[0].map(lambda x: len([i for i in x.split(' ') if i == gr]))

#sum all columns
numbers.loc['Total'] = numbers.sum(axis=0,numeric_only=True)
#pandas display format without decimals
pd.options.display.float_format = '{:,.0f}'.format

最后得到的结果是：

回答于 2025-04-18 由 Python大师

分享举报

这段代码会创建一个字典，里面记录了你所有单词的出现次数。

x = ['one two', 'two seven six', 'three one five', 'seven five five eight', 'six four', 'three']

#create list comprehension of all elements
x_list = [j for i in x for j in i.split()]
print x_list

# ['one', 'two', 'two', 'seven', 'six', 'three', 'one', 'five', 'seven', 'five', 'five', 'eight', 'six', 'four', 'three']

d = {}

#initialize keys
for e in set(x_list):
    d[e] = 0

#store counts in dict
for e in x_list:
        d[e] += 1

print d

最后的结果是一个包含单词及其出现次数的字典：

{'seven': 2, 'six': 2, 'three': 2, 'two': 2, 'four': 1, 'five': 3, 'eight': 1, 'one': 2}

回答于 2025-04-18 由 Python大师

分享举报

如果我理解得没错，我觉得你可以用pandas这样来做。首先，我会从你分割字符串之前的序列开始：

print s

0                  one two
1            two seven six
2           three one five
3    seven five five eight
4                 six four
5                    three

stacked = pd.DataFrame(s.str.split().tolist()).stack()
print stacked

0  0      one
   1      two
1  0      two
   1    seven
   2      six
2  0    three
   1      one
   2     five
3  0    seven
   1     five
   2     five
   3    eight
4  0      six
   1     four
5  0    three

现在只需要计算这个序列中每个值出现的次数：

print stacked.value_counts()

five     3
one      2
three    2
six      2
two      2
seven    2
eight    1
four     1
dtype: int64

回答于 2025-04-18 由 Python大师

分享举报

在pandas中对split()操作后获取唯一字符串列表

3 个回答

撰写回答