使用merge和groupby将DF引入新方案

2024-04-18 11:15:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个有很多条目的DF。DF的摘录如下所示。你知道吗

DF_OLD =
...
sID   tID   NER      token           Prediction
274   79    U-Peop   khrushchev      Live_In-ARG2+B
274   79    O        's              Live_IN-ARG2+L
807   53    U-Loc    louisiana       Live_IN-ARG2+U
807   56    B-Peop   earl            Live_IN-ARG1+B
807   57    L-Peop   long            Live_IN-ARG1+L
807   13    B-Peop   dwight          Live_IN-ARG1+B
807   13    I-Peop   d.              Live_IN-ARG1+I
807   13    L-Peop   eisenhower      Live_IN-ARG1+L
...

sID分隔不同的句子。列Prediction显示了机器学习分类器的结果。这些可能是相当荒谬的。我的目标是在一个方案中对所有预测的标签进行分组,例如:

DF_Expected =
...
sID   entity1              tID1    entity2           tID2   Relation
274   NaN                  NaN     khrushchev 's     79     Live_In 
807   earl long            56 57   louisiana         53     Live_In
807   dwight d. eisenhower 13      louisiana         53     Live_In
...

“-ARGX-”部分显示实体在表中的位置,而第一个“-”之前的部分显示关系。如果缺少一个参数部分,则相应的单元格应为空。你知道吗

以下是我尝试的:

DF["Live_In_Predict_Split"] = DF["Prediction"].str.split("+").str[0]
DF["token2"] = DF["token"]
DF["tokenID2"] = DF["tokenID"]
DF["Live_In_Predict2"] = DF["Live_In_Predict"]
data_tokeni_map =   DF.groupby(["Live_In_Predict_Split","sentenceID"],as_index=True, sort=False).agg(" ".join).reset_index()
s = data_tokeni_map.loc[:,['sentenceID','token2',"tokenID2","Live_In_Predict2"]].merge(data_tokeni_map.loc[:,['sentenceID','token',"tokenID","Live_In_Predict"]],on='sentenceID')                      
s = s.loc[s.token2!=s.token].drop_duplicates()

我缺少某种计数器来区分不同的“-ARGX-”和某种GroupBy函数(GroupingBy tokenID不聪明,因为它会产生错误的结果)。因此,我的新DF是错误的:

DF_EDITED =
...
sID   entity1                         tID1      entity2                     tID2   ...
807   dwight d eisenhower earl long  13 56 57   louisiana                    53   
807   louisiana                      13 56 57  dwight d eisenhower earl long 53   

编辑:

我的代码有点变了。现在,所有无用的预测都被删除,但所有类似的预测都被分组在一起。我需要某种数据预处理算法来匹配这种形式的数据,这意味着我需要计算每个sID的所有预测并对它们进行排序。你知道吗

DF_OLD_Edit =
...
sID   tID   NER      token           Prediction
274   79    U-Peop   khrushchev      Live_In-ARG2+B_1
274   79    O        's              Live_IN-ARG2+L_1
807   53    U-Loc    louisiana       Live_IN-ARG2+U_1
807   56    B-Peop   earl            Live_IN-ARG1+B_1
807   57    L-Peop   long            Live_IN-ARG1+L_1
807   13    B-Peop   dwight          Live_IN-ARG1+B_2
807   13    I-Peop   d.              Live_IN-ARG1+I_2
807   13    L-Peop   eisenhower      Live_IN-ARG1+L_2
...

Tags: intokenlivedflongpredictionarg1arg2
2条回答

必须混合使用函数和DF操作。这些方法一点效率都没有,但确实有效。你知道吗

def combine(some_list):
current_group = 0 
g_size = 0 
for elem in some_list:
    g_size += 1
    if elem.endswith('U'): 
        if g_size > 1:  
            g_size = 1 
            current_group += 1 
    yield '{}{}'.format(current_group, elem)
    if elem.endswith(('L', 'U')):
        g_size = 0
        current_group += 1

def splitter(DF):
return re.findall('^\d[\d]?[\d]?', DF)  

# Not very efficient
DF["entity2"] = DF["entity"]    
DF["tID2"] = DF["tID"]
DF["Prediction2"] = DF["Prediction"]
DF["Pred_Group"] = list(combine(DF["Prediction"].tolist()))
DF["Jojo"] = DF["Pred_Group"].apply(splitter)
DF["Jojo"] = DF["Jojo"].astype(str).apply(ast.literal_eval).apply(lambda x: " ".join(x))
dmap = DF.groupby(["Jojo","sID"],as_index=True, sort=False).agg(" ".join).reset_index()
s = dmap.loc[:,['sID','entity2',"tID2","Prediction2"]].merge(dmap.loc[:,['sID','entity',"tID","Prediction"]],on='sID')                                             
s = s.loc[s.entity2!=s.entity].drop_duplicates()    
s = s[s["Prediction"].str.contains(r"-ARG2+")]
DF= s[s["Prediction2"].str.contains(r"-ARG1+")]

数据:

df

   sID  tID     NER       token        Prediction
0  274   79  U-Peop  khrushchev  Live_IN-ARG2+B_1
1  274   79       O          's  Live_IN-ARG2+L_1
2  807   53   U-Loc   louisiana  Live_IN-ARG2+U_1
3  807   56  B-Peop        earl  Live_IN-ARG1+B_1
4  807   57  L-Peop        long  Live_IN-ARG1+L_1
5  807   13  B-Peop      dwight  Live_IN-ARG1+B_2
6  807   13  I-Peop          d.  Live_IN-ARG1+I_2
7  807   13  L-Peop  eisenhower  Live_IN-ARG1+L_2

代码:

import numpy as np
import pandas as pd
import typing

# setting up some columns for groupby
df['arg'] = df.Prediction.apply(lambda x: x.split("_")[1].split("-")[1].split("+")[0])
df['Relation'] = df.Prediction.apply(lambda x: x.split("-")[0])
df['ingroup_id'] = df.Prediction.apply(lambda x: x.split("_")[-1])

# groupby and collect relevant tID and token
df1 = df.groupby(['sID', 'arg', 'ingroup_id']).tID.apply(list)
df2 = df.groupby(['sID', 'arg', 'ingroup_id']).token.apply(list)
df3 = pd.concat([df1, df2], axis=1).reset_index()
df3.tID = df3.tID.apply(lambda x: list(set(x)))

# setting up columns that we finally use
df3.loc[df3.arg == 'ARG1', 'tID1'] = df3.tID
df3.loc[df3.arg == 'ARG2', 'tID2'] = df3.tID
df3.loc[df3.arg == 'ARG1', 'entity1'] = df3.token
df3.loc[df3.arg == 'ARG2', 'entity2'] = df3.token

# sort values and then ffill/bfill within the group
df3 = df3.sort_values(['sID', 'arg']).reset_index(drop=True)
df3.tID1 = df3.groupby(['sID']).tID1.ffill()
df3.entity1 = df3.groupby(['sID']).entity1.ffill()
df3.tID2 = df3.groupby(['sID']).tID2.bfill()
df3.entity2 = df3.groupby(['sID']).entity2.bfill()
df3 = df3[['sID', 'entity1', 'tID1', 'entity2', 'tID2']].set_index('sID')

# converting lists in cells into strings (may be someone can make this as a one liner)
df3.entity1 = df3.entity1.apply(lambda x: ' '.join(x) if isinstance(x, typing.List) else np.nan)
df3.entity2 = df3.entity2.apply(lambda x: ' '.join(x) if isinstance(x, typing.List) else np.nan)
df3.tID1 = df3.tID1.apply(lambda x: ' '.join(str(y) for y in x) if isinstance(x, typing.List) else np.nan)
df3.tID2 = df3.tID2.apply(lambda x: ' '.join(str(y) for y in x) if isinstance(x, typing.List) else np.nan)
df3 = df3.drop_duplicates().reset_index()

df3 = df3.merge(df[['sID', 'Relation']].drop_duplicates(), on='sID', how='left')

输出:

   sID               entity1   tID1        entity2 tID2 Relation
0  274                   NaN    NaN  khrushchev 's   79  Live_IN
1  807             earl long  56 57      louisiana   53  Live_IN
2  807  dwight d. eisenhower     13      louisiana   53  Live_IN

由于缺乏技巧,代码很长,但基本上它所做的是groupbymerge,正如您在标题中所建议的那样。希望这有帮助。你知道吗

相关问题 更多 >