如何(在Python中)获取列中字符串的一部分并在列表中进行转换?

2024-05-19 20:27:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我刚刚开始抓取网页,并决定在经典的IMDb数据集上试一试。我的一个列(“actors”)应该包含几个actor的名称。这就是它现在的样子:

"Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloë Grace Moretz"

我的目标是排除导演部分,只保留演员作为列表(用于某些数据分析):

["Zooey Deschanel", Joseph Gordon-Levitt", "Geoffrey Arend", "Chloe Grace Moretz"]

使用Python在所有行上实现此结果的最佳方法是什么?谢谢大家!


Tags: 数据网页actors集上imdb经典geoffreygordon
3条回答

假设你有一个字符串数组,包含你在你的问题中描述的数据,那么你可以考虑做如下的事情:

import pprint

imdb_actors_columns = [
        "Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloë Grace Moretz",
        "Director: Alfred Hitchcock | Stars: James Stewart, Kim Novak, Barbara Bel Geddes",
        # ... etc...
]


def does_look_like_stars_col(possible_stars_column_val):
    if possible_stars_column_val:
        return possible_stars_column_val.strip().lower().startswith('stars:')
    return False


# Split apart strings into ['Director: ...', 'Stars: ...']
tokenized_columns = map(lambda s: s.split('|'), imdb_actors_columns)
# Run through generated lists of [
#   ['Director:...', 'Stars:...'], ['Director:...', 'Stars:...'],
#   ...
# ]
# And filter sublists so that we only retain the
#   [['Stars: ...'], ['Stars: ...']]
# Then use mapping to extract the first 'Stars:...' entries to top-level like:
#   ['Stars: ...', 'Stars: ...']
star_actor_columns = map(lambda a: a[0],
                         filter(bool,
                                map(lambda all_columns: list(filter(
                                    does_look_like_stars_col, all_columns)),
                                    tokenized_columns
                                    )
                                )
                         )
# Loop through all the "Stars: Name1, Name2, ..." strings, get the "Name"
#   portions, and then strip away any leading or trailing spaces so that the
#   final result is [['Name1', 'Name2', ...], ['OtherName1', 'OtherName2', ...]]
all_stars = [list(map(
    lambda s: s.strip(),
    raw_star_list.strip().replace('Stars:', '').split(',')
)) for raw_star_list in star_actor_columns]

pprint.pprint(all_stars)

执行会产生以下结果:

[['Zooey Deschanel',
  'Joseph Gordon-Levitt',
  'Geoffrey Arend',
  'Chloë Grace Moretz'],
 ['James Stewart', 'Kim Novak', 'Barbara Bel Geddes']]

你可以check this solution out on IDEOne

您只需split()字符串:

data = "Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloë Grace Moretz"
actors =[x.strip() for x in data.split('|')[1].split(':')[1].split(',')]
print(actors)

输出:

["Zooey Deschanel", "Joseph Gordon-Levitt", "Geoffrey Arend", "Chloe Grace Moretz"]

假设您的字符串存储为s = "Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloe Grace Moretz",那么您可以按如下方式轻松拆分字符串-

将字符串拆分为参与者:

  1. my_list = str.split('|'):这将拆分&;通过在|处分隔字符串,将其转换为列表

输出:['Director: Marc Webb ', ' Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloe Grace Moretz']

  1. my_list = my_list [1].split(':')

输出:[' Stars', ' Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloe Grace Moretz']

  1. actors = my_list [1].split(',')

输出:[' Zooey Deschanel', ' Joseph Gordon-Levitt', ' Geoffrey Arend', ' Chloe Grace Moretz']


现在,您已经将字符串转换为所需的列表格式。以下为同一项目的代码:

s = "Director: Marc Webb | Stars: Zooey Deschanel, Joseph Gordon-Levitt, Geoffrey Arend, Chloe Grace Moretz"
my_list = s.split('|')   # <- this would separate director and stars

actors = my_list [1].split(':')[1].split(',') # <- this would split the elements into actors in each index


print(actors)

上面的代码只会打印列表中的参与者

相关问题 更多 >