如何使用Python从文本表中抓取数据?
我有一段文本,想把里面的数据提取出来,然后保存到Excel里。请问在Python中有没有办法做到这一点?
text = """
ANNUAL COMPENSATION LONG-TERM COMPENSATION
--------------------------------------- -------------------------------------
AWARDS PAYOUTS
-------------------------- ----------
SECURITIES
OTHER RESTRICTED UNDERLYING ALL OTHER
NAME AND PRINCIPAL ANNUAL STOCK OPTIONS/ LTIP COPMPENSA-
POSITION YEAR SALARY ($) BONUS ($) COMPENSATION ($) AWARD(S) ($)(1) SAR'S (#) PAYOUTS($) TION($)(3)
------------------ ---- ---------- --------- ---------------- --------------- ---------- ---------- ----------
JOHN W. WOODS 1993 $595,000 $327,250 There is no $203,190.63 18,000 $ 29,295
Chairman, President, & 1992 $545,000 $245,250 compensation 166,287.50 18,825 (2) Not $ 29,123
Chief Executive Officer 1991 $515,000 $283,251 required to be 45,000 Applicable
of AmSouth & AmSouth disclosed in
Bank N.A. this column.
C. STANLEY BAILEY 1993 $266,667(4) $133,333 117,012.50 4,500 $ 11,648
Vice Chairman, AmSouth 1992 $210,000 $ 84,000 42,400.00 4,800 $ 12,400
& AmSouth Bank N.A. 1991 $186,750 $ 82,170 161,280.00 9,750
C. DOWD RITTER 1993 $266,667(4) $133,333 117,012.50 4,500 $ 13,566
Vice Chairman, AmSouth 1992 $210,000 $ 84,000 42,400.00 4,800 $ 12,920
& AmSouth Bank N.A. 1991 $188,625 $ 82,995 161,280.00 9,750
WILLIAM A. POWELL, JR. 1993 $211,335 $ 95,101 11,000 $124,548
President, AmSouth 1992 $330,000 $132,000 98,050.00 11,100 $ 22,225
and Vice Chairman, 1991 $308,000 $169,401 24,000
AmSouth Bank N.A.
Retired in 1993
A. FOX DEFUNIAK, III 1993 $217,000 $ 75,950 52,971.88 4,500 $ 11,122
Senior Executive Vice 1992 $200,000 $ 62,000 42,400.00 4,800 $ 11,240
President, Birmingham 1991 $177,500 $ 78,100 161,280.00 9,750
Banking Group,
AmSouth Bank N.A.
E. W. STEPHENSON, JR. 1993 $177,833 $ 71,133 52,971.88 3,400 $ 9,256
Senior Executive Vice 1992 $150,000 $ 45,000 27,825.00 3,150 $ 8,560
President, AmSouth 1991 $140,000 $ 52,488 107,520.00 6,750
and Chairman & Chief
Executive Officer,
AmSouth Bank of Florida
"""
现在,我只是想把数据整理成一种csv格式,用'|'符号来分隔数据项,然后再手动把这些数据提取到Excel里:
tmp = open('tmp.txt','w')
tmp.write(text)
tmp.close()
data1 = []
for line in open('tmp.txt'):
line = line.lower()
if 'SALARY' in line:
line = line.replace(' ','|')
line = line.replace('--', '')
line = line.replace('- -', '')
line = line.replace('- -', '')
line = line.replace('(1)', '')
line = line.replace('(2)', '')
line = line.replace('(3)', '')
line = line.replace('(4)', '')
line = line.replace('(5)', '')
line = line.replace('(6)', '')
line = line.replace('(7)', '')
line = line.replace('(8)', '')
line = line.replace('(9)', '')
line = line.replace('(10)', '')
line = line.replace('(11)', '')
line = line.replace('(S)', '')
line = line.replace('($)', '')
line = line.replace('(#)', '')
line = line.replace('$', '')
line = line.replace('-0-', '0')
line = line.replace(')', '|')
line = line.replace('(', '|-')
line = re.sub(r'\s(\d)', '|\\1', line)
line = line.replace(' ', '')
line = line.replace('||', '|')
data1.append(line)
data = ''.join(data1)
问题是我需要做这个工作上千次,逐个表格去提取需要的项目实在是太耗时间了。有办法创建一个字典,记录每个在最左边一列列出的个人的年份、薪水、奖金、其他年度补偿等信息吗?
3 个回答
1
把列分开很简单,因为它们的宽度是固定的,所以你可以这样做:
cells = [rowtext[0:24], rowtext[25:29], ...]
分开行就有点难了。看起来你可以单独处理标题,然后检查一下
cells[0] == cells[0].upper()
看看是否需要开始一个新的行块(也就是说,当行的第一个单元格是全大写字母时)。当然,我假设你成千上万的文件格式基本上都是一样的。
一旦你把数据整理成可以用的格式,在Python中整理这些数据就简单多了。你可以把它们放进一个字典里,或者如果数据太大,就把它写成一个大的CSV文件,或者存成一个sqlite数据库。
2
你需要写一系列的生成器,来对数据进行多次处理,以减少噪音和复杂性。
这个问题在任何编程语言中都不是容易解决的。
def strip_top( source_text ):
src= iter( source_text )
for line in src:
if line.rstrip().startswith("AWARDS"):
next( src )
break
for line in src:
yield line
def columnize( source_text ):
"""Assumes strip_top or similar to remove confusing extra headers"""
for line in src:
yield line[0:24], line[25:30], ... for each coumn
def collapse_headers( source_text ):
"""Assumes columnize( strip_top())."""
src= iter( source_text )
headings= [ [] for i in range(9) ]
for line in src:
if line[0] == "------------------":
break
for col in range(9):
headings[col].append(line[col].strip())
yield [ " ".join(h) for h in headings ]
for line in src:
yield line
etc.
然后,你的“主”程序会把这些转换组合成一个处理流程。
with open("some file","r") as text:
for line in collapse_headers( columnize( strip_top( text ) ) ):
# further cleanup?
# actual processing
这样你就可以单独调整每个转换步骤。
3
这里有一些代码可以帮助你入门:
text = """JOHN ...""" # text without the header
# These can be inferred if necessary
cols = [0, 24, 29, 39, 43, 52, 71, 84, 95, 109, 117]
db = []
row = []
for line in text.strip().split("\n"):
data = [line[cols[i]:cols[i+1]] for i in xrange((len(cols)-1))]
if data[0][0] != " ":
if row:
db.append(row)
row = map(lambda x: [x], data)
else:
for i, c in enumerate(data):
row[i].append(c)
print db
这段代码会生成一个数组,每个人对应一个元素。每个元素又是一个数组,里面包含了所有的列,而这些列又会包含所有的行。这样你就可以很方便地访问不同的年份,或者做一些操作,比如把一个人的称谓连接起来:
for person in db:
print "Name:", person[0][0]
print " ".join(s.strip() for s in person[0][1:])
print
结果将会是:
Name: JOHN W. WOODS
Chairman, President, & Chief Executive Officer of AmSouth & AmSouth Bank N.A.
Name: C. STANLEY ...