正在计算与另一列匹配的字符串的出现次数

2024-05-20 11:11:54 发布

您现在位置:Python中文网/ 问答频道 /正文

df = {'msg':['i am so happy thank you',
'sticker omitted',
'sticker omitted',
'thank you for your time!'
,'sticker omitted','hello there'],
'number_of_stickers':['2','0','0','1','0','0']} ##This column 'number_of_stickers' is what i am aiming to achieve. Currently, i don't have this column.

df = pd.DataFrame(data=df)

以上是我努力实现的目标。我目前没有“贴纸数量”一栏。这个专栏将是我的最终目标

我试图计算带有“标签省略”的行数,并在“标签省略”链的上方添加带有出现次数的行。我想在新的专栏中添加“贴纸的数量”

为了给你们一些背景知识,我正在分析whatsapp的文本数据,我想看看在聊天发送后有多少贴纸被发送会很有用。这也显示了对话的调性和情感

更新:

我已经发布了一个解决方案(归功于@JacoSolari),可以解决我正在解决的问题。在他的代码上添加了1-2行(if语句),这样我们就不会在数据帧末尾遇到问题(范围问题)


Tags: of数据younumberdf数量column标签
3条回答

检查其他值并使用cumsum来识别块是一种常见的技术:

omitted = df.msg.ne('sticker omitted').cumsum()

df['number_of_stickers'] = np.where(omitted.duplicated(), 0,
                                    omitted.groupby(omitted).transform('size')-1)

到目前为止,您已经完全掌握了它,并且您的数据对于一个简单但功能强大的算法来说是非常重要的

下面是我为这个问题编写的一段代码:

#ss
df = {'msg':['i am so happy thank you',
'sticker omitted',
'sticker omitted',
'thank you for your time!'
,'sticker omitted'],
'number_of_stickers':['2','0','0','1','0']}
j = 0
newarr = [] # new array for use
for i in df["number_of_stickers"]:
    if(not int(i)==0):
       newarr.append([df["msg"][j], int(i)]) # will store each data in a array
       #access the number of it by using element 1(newarr[1]) and the msg by newarr[0]
    j+=1;
#se
#feel free to do whatever you want after ss to se

pd.DataFrame(data=df)

se是代码段结束,ss是代码段开始

希望这有帮助!如果没有,请在下面发表评论

此外,还必须将新数组重新馈送到dict

这段代码应该可以完成这项工作。我找不到一个只使用pandas函数的解决方案(这可能是可行的)。无论如何,我在代码中留下了一些注释来描述我的方法

# create data
df_dict = {'msg':['i am so happy thank you',
'sticker omitted',
'sticker omitted',
'thank you for your time!'
,'sticker omitted']}

df=pd.DataFrame(data=df_dict)

# build column for sticker counts after message 
sticker_counts = []
for index, row in df.iterrows(): # iterating over df rows
    flag = True
    count = 0
    # when a sticker row is encountered, just put 0 in the count column
    # when a non-sticker row is encountered do the following
    if row['msg'] != 'sticker omitted': 
        k = 1 # to check rows after the non-sticker row
        while flag:
            # if the index + k row is a sticker increase the count for index and k
            if df.loc[index + k].msg == 'sticker omitted': 
                count += 1
                k += 1
                # when reached the end of the database, break the loop
                if index + k +1 > len(df):
                    flag = False
            else:
                flag = False
                k = 1
    sticker_counts.append(count)
df['sticker_counts'] = sticker_counts
print(df)

enter image description here

相关问题 更多 >