遍历嵌套字符串列表以获取第一项

2024-05-15 22:01:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从数据帧中的gen列中提取项(下面的示例)。我的目标是遍历gen中的每一行,进入一个新的dataframe列,其中的项与预定义列表genre_code匹配

df = pd.DataFrame({'id': [620, 843, 986], 'tit': ['AAA', 'BBB', 'CCC'], 'gen': [['Romance', 'Satire', 'Fiction'], ['Science Fiction', 'Novel'], ['Mystery', 'Novel']]})

genre_code = ['Science Fiction', 'Mystery', 'Non-fiction']

到目前为止,我能够得出以下结论:

new_gen = []
for i in df['gen']:
  for j in i:
    if j in genre_code:
      new_gen.append(j)
    else:
      new_gen.append('NA')
df['gen'] = new_gen

它确实会遍历列,但结果new_gen的长度与原始数据帧行长度不匹配

/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in sanitize_index(data, index)
    746     if len(data) != len(index):
    747         raise ValueError(
--> 748             "Length of values "
    749             f"({len(data)}) "
    750             "does not match length of index "

ValueError: Length of values (30004) does not match length of index (12841)

我知道这一定是一些非常基本的东西,但有人能告诉我我遗漏了什么吗


Tags: ofindfnewfordataindexlen
2条回答

我会将列表转换为字符串,然后使用series.str.findall返回匹配的类型代码:

df['new_gen'] = df['gen'].astype(str).str.findall('|'.join(genre_code))

print(df)

    id  tit                         gen            new_gen
0  620  AAA  [Romance, Satire, Fiction]                 []
1  843  BBB    [Science Fiction, Novel]  [Science Fiction]
2  986  CCC            [Mystery, Novel]          [Mystery]

如果要根据列表筛选gen列,可以执行以下操作:

df["gen"] = df["gen"].apply(lambda x: [g for g in x if g in genre_code])
print(df)

印刷品:

    id  tit                gen
0  620  AAA                 []
1  843  BBB  [Science Fiction]
2  986  CCC          [Mystery]

附言:为了加快这个过程,您可以在以下步骤之前将genre_code转换为set()

genre_code = set(["Science Fiction", "Mystery", "Non-fiction"])
df["gen"] = df["gen"].apply(lambda x: [g for g in x if g in genre_code])

相关问题 更多 >