Pandas,搜索真的很难吗?

2024-04-25 13:42:49 发布

您现在位置:Python中文网/ 问答频道 /正文

在这里,我想搜索reference列中paper_title列的值,如果匹配/找到整个文本,则获取该引用行的_id(而不是paper_title行的_id中匹配的_id列,并将其保存在paper_title_in列中

In[1]:

d ={
  "_id":
    [
      "Y100",
      "Y100",
      "Y100",
      "Y101",
      "Y101",
      "Y101",
      "Y102",
      "Y102",
      "Y102"
    ]
  ,
  "paper_title":
    [
      "translation using information on dialogue participants",
      "translation using information on dialogue participants",
      "translation using information on dialogue participants",
      "#emotional tweets",
      "#emotional tweets",
      "#emotional tweets",
      "#supportthecause: identifying motivations to participate in online health campaigns",
      "#supportthecause: identifying motivations to participate in online health campaigns",
      "#supportthecause: identifying motivations to participate in online health campaigns"
    ]
  ,
  "reference":
    [
      "beattie, gs (2005, november) #supportthecause: identifying motivations to participate in online health campaigns may 31, 2017, from",
      "burton, n (2012, june 5) depressive realism retrieved may 31, 2017, from",
      "gotlib, i h, 27 hammen, c l (1992) #supportthecause: identifying motivations to participate in online health campaigns new york: wiley",
      "paul ekman 1992 an argument for basic emotions cognition and emotion, 6(3):169200",
      "saif m mohammad 2012a #tagspace: semantic embeddings from hashtags in mail and books to appear in decision support systems",
      "robert plutchik 1985 on emotion: the chickenand-egg problem revisited motivation and emotion, 9(2):197200",
      "alastair iain johnston, rawi abdelal, yoshiko herrera, and rose mcdermott, editors 2009 translation using information on dialogue participants cambridge university press",
      "j richard landis and gary g koch 1977 the measurement of observer agreement for categorical data biometrics, 33(1):159174",
      "tomas mikolov, kai chen, greg corrado, and jeffrey dean 2013  #emotional tweets arxiv:13013781"
    ]

}

import pandas as pd
df=pd.DataFrame(d)

df

输出:

dataframe

预期成果:

Expected Result

And finally the final result dataframe with unique values as:

注意这里paper_title_in列将所有_id标题作为列表显示在reference列中

Final_dataframe

我尝试了这个方法,但它返回了paper_presented_in中的paper_title列的_id,该列被搜索到,而不是它匹配的reference列。预期结果dataframe给出了更清晰的概念。看看那里

def return_id(paper_title,reference, _id):
    if (paper_title is None) or (reference is None):
        return None
    if paper_title in reference:
        return _id
    else:
        return None

df1['paper_present_in'] = df1.apply(lambda row: return_id(row['paper_title'], row['reference'], row['_id']), axis=1)

Tags: andtoinidtitleononlinepaper
1条回答
网友
1楼 · 发布于 2024-04-25 13:42:49

因此,要解决您的问题,您需要两个字典和一个列表来临时存储一些值

# A list to store unique paper titles
unique_paper_title


# A dict to store mapping of unique paper to unique ids
mapping_dict_paper_to_id = dict()

# A dict to store mapping unique idx to the ids
mapping_id_to_idx = dict()


# This gives us the unique paper title's list
unique_paper_title = df["paper_title"].unique()



# Storing values in the dict mapping_dict_paper_to_id

for value in unique_paper_title:
    mapping_dict_paper_to_id[value] = df["_id"][df["paper_title"]==value].unique()[0]



# Storing values in the dict mapping_id_to_idx

for value in unique_paper_title:

    # this gives us the indexes of the matched string ie. the paper_title
    idx_list = df[df['reference'].str.contains(value)].index

    # Storing values in the dictionary
    for idx in idx_list:
        mapping_id_to_idx[idx] = mapping_dict_paper_to_id[value]


# This loops check if the index have any refernce's id and then updates the paper_present_in field accordingly

for i in df.index:
    if i in mapping_id_to_idx:
        df['paper_present_in'][i] = mapping_id_to_idx[i]
    else:
        df['paper_present_in'][i] = "None"

上面的代码将检查并更新数据框中的搜索值

相关问题 更多 >