如何在python/pandas中以段落形式读取数据?

1 投票
1 回答
986 浏览
提问于 2025-04-18 17:51

我有一些数据,这些数据的格式是这样的,每年之间用一个空行分开:

    2014
    #34 - Show Title
    Ensemble: ActorFirst1 ActorLast1, ActorFirst2 ActorLast2, ActorFirst3 ActorLast3, ActorFirst4 ActorLast4, ActorFirst5 ActorLast5, ActorFirst6 ActorLast6, and ActorFirst7 ActorLast7
    Director: DirectorFirst1 DirectorLast1
    Music Director: MDFirst1 MDFirst1
    Stage Manager: SMFirst1 SMFirst1
    Producer: ProducerFirst1 ProducerFirst1
    Opening Night: December 16, 2014

我想把这些数据整理成一个表格的形式,像这样:

年份, 人名1 姓氏1, 职位, 节目标题

我不知道该怎么做,总是走入死胡同。

1 个回答

2

你没有具体说明你的数据是怎么存储的,所以我假设你有一堆和上面类似的项目,这些项目是通过读取一个文件并按行分割得到的。这样你就得到了一个像你示例数据那样的字符串。

把这个字符串传递给一个函数:

def read_year(f):
    f = iter(f.split('\n'))  # split the string, going to iterate over it

    year = next(f).strip()
    title = next(f).strip()  # maybe split at the '-', dunno if that's part of the title

    # don't need opening date, so iterate till we get there.
    people = takewhile(lambda x: not x.strip().startswith('Opening'), f)
    # setting up for getting (kind, individual)
    people = (tuple(x.split(':')) for x in people) 

    both = []
    for kind, persons in people:
        kind_ = kind.strip()
        for person in persons.strip().split(' '):
            if person != 'and':
                both.append((kind_, person.strip().strip(',')))

    df = pd.DataFrame(both, columns=['person_title', 'person'])
    df['year'] = int(year)
    df['movie_title'] = title
    return df

基本上,你会对每个字符串调用这个函数,以获得

In [153]: df = read_year(s)

In [154]: df.head()
Out[154]: 
  person_title       person  year       movie_title
0     Ensemble  ActorFirst1  2014  #34 - Show Title
1     Ensemble   ActorLast1  2014  #34 - Show Title
2     Ensemble  ActorFirst2  2014  #34 - Show Title
3     Ensemble   ActorLast2  2014  #34 - Show Title
4     Ensemble  ActorFirst3  2014  #34 - Show Title

然后用 pd.concat 把它们合并在一起,并设置 ignore_index=True

撰写回答