基于datafram中指定列的groupby

2024-05-15 14:05:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我有下面的数据帧。你知道吗

这些列是lable、body\u text、sentTokenized、小写、stopwordsRemoved、tokenized、lemmatized、bigrams和bigrams\u flattern。下面是bigrams\u flattern列。你知道吗

[(ive, searching), (searching, right), (right, word), (word, thank), (thank, breather), (i, promise), (promise, wont), (wont, take), (take, help), (help, granted), (granted, fulfil), (fulfil, promise), (you, wonderful), (wonderful, blessing), (blessing, time)]                                                              

[(free, entry), (entry, 2), (2, wkly), (wkly, comp), (comp, win), (win, fa), (fa, cup), (cup, final), (final, tkts), (tkts, 21st), (21st, may), (may, 2005), (text, fa), (fa, 87121), (87121, receive), (receive, entry), (entry, questionstd), (questionstd, txt), (txt, ratetcs), (ratetcs, apply), (apply, 08452810075over18s)]

[(nah, dont), (dont, think), (think, go), (go, usf), (usf, life), (life, around), (around, though)]                                                                                                                                                                                                                               

[(even, brother), (brother, like), (like, speak), (speak, me), (they, treat), (treat, like), (like, aid), (aid, patent)]                                                                                                                                                                                                          

[(i, date), (date, sunday), (sunday, will)] 

我想根据“lable”列的值对行进行分组。值为“spam”或“ham”。你知道吗

输出应该是

     lable    corpuses
1    ham     [all the ham bigrams]
2    spam    [all the spam bigrams]

我引用了pandas groupby and join listsSpecifying column order following groupby aggregationhttp://pandas.pydata.org/pandas-docs/stable/groupby.html,然后尝试了这个。你知道吗

 fullCorpus['corpuses'] = fullCorpus.groupby('lable')

我得到错误ValueError('值的长度与“”索引的长度不匹配')。你知道吗

我错在哪里?groupby之后我必须应用任何函数吗?你知道吗

fullCorpus.head(5).to_dict()

{'lable': {0: 'ham', 1: 'spam', 2: 'ham', 3: 'ham', 4: 'ham'}, 'body_text': {0: "I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.", 1: "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", 2: "Nah I don't think he goes to usf, he lives around here though", 3: 'Even my brother is not like to speak with me. They treat me like aids patent.', 4: 'I HAVE A DATE ON SUNDAY WITH WILL!!'}, 'sentTokenized': {0: ['Ive been searching for the right words to thank you for this breather', 'I promise i wont take your help for granted and will fulfil my promise', 'You have been wonderful and a blessing at all times'], 1: ['Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005', 'Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s'], 2: ['Nah I dont think he goes to usf he lives around here though'], 3: ['Even my brother is not like to speak with me', 'They treat me like aids patent'], 4: ['I HAVE A DATE ON SUNDAY WITH WILL', '']}, 'lowerCased': {0: ['ive been searching for the right words to thank you for this breather', 'i promise i wont take your help for granted and will fulfil my promise', 'you have been wonderful and a blessing at all times'], 1: ['free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005', 'text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s'], 2: ['nah i dont think he goes to usf he lives around here though'], 3: ['even my brother is not like to speak with me', 'they treat me like aids patent'], 4: ['i have a date on sunday with will', '']}, 'stopwordsRemoved': {0: ['ive searching right words thank breather', 'i promise wont take help granted fulfil promise', 'you wonderful blessing times'], 1: ['free entry 2 wkly comp win fa cup final tkts 21st may 2005', 'text fa 87121 receive entry questionstd txt ratetcs apply 08452810075over18s'], 2: ['nah dont think goes usf lives around though'], 3: ['even brother like speak me', 'they treat like aids patent'], 4: ['i date sunday will', '']}, 'tokenized': {0: [['ive', 'searching', 'right', 'words', 'thank', 'breather'], ['i', 'promise', 'wont', 'take', 'help', 'granted', 'fulfil', 'promise'], ['you', 'wonderful', 'blessing', 'times']], 1: [['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005'], ['text', 'fa', '87121', 'receive', 'entry', 'questionstd', 'txt', 'ratetcs', 'apply', '08452810075over18s']], 2: [['nah', 'dont', 'think', 'goes', 'usf', 'lives', 'around', 'though']], 3: [['even', 'brother', 'like', 'speak', 'me'], ['they', 'treat', 'like', 'aids', 'patent']], 4: [['i', 'date', 'sunday', 'will'], []]}, 'lemmatized': {0: [['ive', 'searching', 'right', 'word', 'thank', 'breather'], ['i', 'promise', 'wont', 'take', 'help', 'granted', 'fulfil', 'promise'], ['you', 'wonderful', 'blessing', 'time']], 1: [['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005'], ['text', 'fa', '87121', 'receive', 'entry', 'questionstd', 'txt', 'ratetcs', 'apply', '08452810075over18s']], 2: [['nah', 'dont', 'think', 'go', 'usf', 'life', 'around', 'though']], 3: [['even', 'brother', 'like', 'speak', 'me'], ['they', 'treat', 'like', 'aid', 'patent']], 4: [['i', 'date', 'sunday', 'will'], []]}, 'bigrams': {0: [[('ive', 'searching'), ('searching', 'right'), ('right', 'word'), ('word', 'thank'), ('thank', 'breather')], [('i', 'promise'), ('promise', 'wont'), ('wont', 'take'), ('take', 'help'), ('help', 'granted'), ('granted', 'fulfil'), ('fulfil', 'promise')], [('you', 'wonderful'), ('wonderful', 'blessing'), ('blessing', 'time')]], 1: [[('free', 'entry'), ('entry', '2'), ('2', 'wkly'), ('wkly', 'comp'), ('comp', 'win'), ('win', 'fa'), ('fa', 'cup'), ('cup', 'final'), ('final', 'tkts'), ('tkts', '21st'), ('21st', 'may'), ('may', '2005')], [('text', 'fa'), ('fa', '87121'), ('87121', 'receive'), ('receive', 'entry'), ('entry', 'questionstd'), ('questionstd', 'txt'), ('txt', 'ratetcs'), ('ratetcs', 'apply'), ('apply', '08452810075over18s')]], 2: [[('nah', 'dont'), ('dont', 'think'), ('think', 'go'), ('go', 'usf'), ('usf', 'life'), ('life', 'around'), ('around', 'though')]], 3: [[('even', 'brother'), ('brother', 'like'), ('like', 'speak'), ('speak', 'me')], [('they', 'treat'), ('treat', 'like'), ('like', 'aid'), ('aid', 'patent')]], 4: [[('i', 'date'), ('date', 'sunday'), ('sunday', 'will')], []]}, 'bigrams_flattern': {0: [('ive', 'searching'), ('searching', 'right'), ('right', 'word'), ('word', 'thank'), ('thank', 'breather'), ('i', 'promise'), ('promise', 'wont'), ('wont', 'take'), ('take', 'help'), ('help', 'granted'), ('granted', 'fulfil'), ('fulfil', 'promise'), ('you', 'wonderful'), ('wonderful', 'blessing'), ('blessing', 'time')], 1: [('free', 'entry'), ('entry', '2'), ('2', 'wkly'), ('wkly', 'comp'), ('comp', 'win'), ('win', 'fa'), ('fa', 'cup'), ('cup', 'final'), ('final', 'tkts'), ('tkts', '21st'), ('21st', 'may'), ('may', '2005'), ('text', 'fa'), ('fa', '87121'), ('87121', 'receive'), ('receive', 'entry'), ('entry', 'questionstd'), ('questionstd', 'txt'), ('txt', 'ratetcs'), ('ratetcs', 'apply'), ('apply', '08452810075over18s')], 2: [('nah', 'dont'), ('dont', 'think'), ('think', 'go'), ('go', 'usf'), ('usf', 'life'), ('life', 'around'), ('around', 'though')], 3: [('even', 'brother'), ('brother', 'like'), ('like', 'speak'), ('speak', 'me'), ('they', 'treat'), ('treat', 'like'), ('like', 'aid'), ('aid', 'patent')], 4: [('i', 'date'), ('date', 'sunday'), ('sunday', 'will')]}}

Tags: torightsearchinghelplikefaentrythank
2条回答

IIUC,你想根据标签aggregate你的bigram。使用您提供的字典,您可以通过执行.agg(sum)sum()来实现:

df = pd.DataFrame(provided_dict)
df.groupby('lable').bigrams.sum() # or .agg(sum)

收益率

lable
ham     [[(ive, searching), (searching, right), (right...
spam    [[(free, entry), (entry, 2), (2, wkly), (wkly,...
Name: bigrams, dtype: object

然后您可以将其分配给一个新列,以将其存储在df中

df['corpuses'] = df.groupby('lable').bigrams.sum() 

经过大量的搜索,这给了我所需要的。你知道吗

fullCorpusAgg = fullCorpus.groupby('lable').agg({'bigrams_flattern': 'sum'})

相关问题 更多 >