在pandas中对split()操作后获取唯一字符串列表
我刚开始学习pandas,有一个包含数据的更大表格(DataFrame),里面有一列数据,比如说:
0 one two
1 two seven six
2 three one five
3 seven five five eight
4 six four
5 three
dtype: object
我想把这些单词的序列拆分成各个部分,然后得到一个独特的单词集合或者单词的计数。我可以顺利地完成拆分这一步。
numbers.str.split(' ')
0 [one, two]
1 [two, seven, six]
2 [three, one, five]
3 [seven, five, five, eight]
4 [six, four]
5 [three]
dtype: object
不过,我不太确定接下来该怎么做。再次强调,我想要的输出结果是:
['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight']
或者是以字典的形式显示计数,或者是以Series/DataFrame的形式显示这两种结果中的一种。
到目前为止,我能做到的就是使用apply()和集合(Set)结合,来获取独特的单词。根据我目前的了解,pandas是一个非常优雅的工具,似乎对于比我更熟悉它的人来说,这个问题应该不难解决。
提前谢谢大家!
3 个回答
0
我最近在做一个类似的任务,想要计算用空格分开的字符串。你可以这样来处理你的数据:
import pandas as pd
data = [['one two'],['two seven six'],['three one five'],['seven five five eight'],['six four'],['three']]
numbers = pd.DataFrame(data)
uniq_groups = set(x for l in numbers[0].str.split(' ') for x in l)
#{'eight', 'five', 'four', 'one', 'seven', 'six', 'three', 'two'}
#add a dataframe column for count of each value
for gr in uniq_groups:
numbers[gr] = numbers[0].map(lambda x: len([i for i in x.split(' ') if i == gr]))
#sum all columns
numbers.loc['Total'] = numbers.sum(axis=0,numeric_only=True)
#pandas display format without decimals
pd.options.display.float_format = '{:,.0f}'.format
最后得到的结果是:
1
这段代码会创建一个字典,里面记录了你所有单词的出现次数。
x = ['one two', 'two seven six', 'three one five', 'seven five five eight', 'six four', 'three']
#create list comprehension of all elements
x_list = [j for i in x for j in i.split()]
print x_list
# ['one', 'two', 'two', 'seven', 'six', 'three', 'one', 'five', 'seven', 'five', 'five', 'eight', 'six', 'four', 'three']
d = {}
#initialize keys
for e in set(x_list):
d[e] = 0
#store counts in dict
for e in x_list:
d[e] += 1
print d
最后的结果是一个包含单词及其出现次数的字典:
{'seven': 2, 'six': 2, 'three': 2, 'two': 2, 'four': 1, 'five': 3, 'eight': 1, 'one': 2}
8
如果我理解得没错,我觉得你可以用pandas这样来做。首先,我会从你分割字符串之前的序列开始:
print s
0 one two
1 two seven six
2 three one five
3 seven five five eight
4 six four
5 three
stacked = pd.DataFrame(s.str.split().tolist()).stack()
print stacked
0 0 one
1 two
1 0 two
1 seven
2 six
2 0 three
1 one
2 five
3 0 seven
1 five
2 five
3 eight
4 0 six
1 four
5 0 three
现在只需要计算这个序列中每个值出现的次数:
print stacked.value_counts()
five 3
one 2
three 2
six 2
two 2
seven 2
eight 1
four 1
dtype: int64