如何在python/pandas中以段落形式读取数据?
我有一些数据,这些数据的格式是这样的,每年之间用一个空行分开:
2014
#34 - Show Title
Ensemble: ActorFirst1 ActorLast1, ActorFirst2 ActorLast2, ActorFirst3 ActorLast3, ActorFirst4 ActorLast4, ActorFirst5 ActorLast5, ActorFirst6 ActorLast6, and ActorFirst7 ActorLast7
Director: DirectorFirst1 DirectorLast1
Music Director: MDFirst1 MDFirst1
Stage Manager: SMFirst1 SMFirst1
Producer: ProducerFirst1 ProducerFirst1
Opening Night: December 16, 2014
我想把这些数据整理成一个表格的形式,像这样:
年份, 人名1 姓氏1, 职位, 节目标题
我不知道该怎么做,总是走入死胡同。
1 个回答
2
你没有具体说明你的数据是怎么存储的,所以我假设你有一堆和上面类似的项目,这些项目是通过读取一个文件并按行分割得到的。这样你就得到了一个像你示例数据那样的字符串。
把这个字符串传递给一个函数:
def read_year(f):
f = iter(f.split('\n')) # split the string, going to iterate over it
year = next(f).strip()
title = next(f).strip() # maybe split at the '-', dunno if that's part of the title
# don't need opening date, so iterate till we get there.
people = takewhile(lambda x: not x.strip().startswith('Opening'), f)
# setting up for getting (kind, individual)
people = (tuple(x.split(':')) for x in people)
both = []
for kind, persons in people:
kind_ = kind.strip()
for person in persons.strip().split(' '):
if person != 'and':
both.append((kind_, person.strip().strip(',')))
df = pd.DataFrame(both, columns=['person_title', 'person'])
df['year'] = int(year)
df['movie_title'] = title
return df
基本上,你会对每个字符串调用这个函数,以获得
In [153]: df = read_year(s)
In [154]: df.head()
Out[154]:
person_title person year movie_title
0 Ensemble ActorFirst1 2014 #34 - Show Title
1 Ensemble ActorLast1 2014 #34 - Show Title
2 Ensemble ActorFirst2 2014 #34 - Show Title
3 Ensemble ActorLast2 2014 #34 - Show Title
4 Ensemble ActorFirst3 2014 #34 - Show Title
然后用 pd.concat
把它们合并在一起,并设置 ignore_index=True
。