如何使用Python抓取在Excel文件中每x列后拆分列?
我正在尝试从一个网站上抓取数据,并把这些数据保存为xls格式。但是当我把数据保存为xls格式时,所有的数据都被放在了一行里。有没有人能帮我,教我怎么在每10个数据后分开一行?我的代码是
from bs4 import BeautifulSoup
import urllib2
pageSource=urllib2.urlopen('http://www.livescore.com/worldcup/tables/').read()
soup = BeautifulSoup(pageSource)
alltables = soup.findAll( "table", {"class":"league-wc table bh"} )
results=[]
for table in alltables:
rows = table.findAll('tr')
_table = []
for tr in rows[1:]:
_row = []
cols = tr.findAll('td')
for td in cols:
if td.findAll('a'):
text=td.a.renderContents().strip()
else:
text=td.renderContents().strip()
_row.append(text)
_table.append(_row)
results.append(_table)
# print results
index = 1
f=open('world.xls','w')
for table in results:
for row in table:
print ','.join([str(index)] + row[1:])
f.write("\t".join([str(index)] + row[1:]))
index += 1
f.close()
输出:
1,1,Australia,0,0,0,0,0,0,0,0
2,1,Chile,0,0,0,0,0,0,0,0
3,1,Netherlands,0,0,0,0,0,0,0,0
4,1,Spain,0,0,0,0,0,0,0,0
5,1,Colombia,0,0,0,0,0,0,0,0
6,1,Greece,0,0,0,0,0,0,0,0
7,1,Ivory Coast,0,0,0,0,0,0,0,0
8,1,Japan,0,0,0,0,0,0,0,0
9,1,Costa Rica,0,0,0,0,0,0,0,0
10,1,England,0,0,0,0,0,0,0,0
11,1,Italy,0,0,0,0,0,0,0,0
12,1,Uruguay,0,0,0,0,0,0,0,0
13,1,Ecuador,0,0,0,0,0,0,0,0
14,1,France,0,0,0,0,0,0,0,0
15,1,Honduras,0,0,0,0,0,0,0,0
16,1,Switzerland,0,0,0,0,0,0,0,0
17,1,Argentina,0,0,0,0,0,0,0,0
18,1,Bosnia-Herzegovina,0,0,0,0,0,0,0,0
19,1,Iran,0,0,0,0,0,0,0,0
20,1,Nigeria,0,0,0,0,0,0,0,0
21,1,Germany,0,0,0,0,0,0,0,0
22,1,Ghana,0,0,0,0,0,0,0,0
23,1,Portugal,0,0,0,0,0,0,0,0
24,1,USA,0,0,0,0,0,0,0,0
25,1,Algeria,0,0,0,0,0,0,0,0
26,1,Belgium,0,0,0,0,0,0,0,0
27,1,Russia,0,0,0,0,0,0,0,0
28,1,South Korea,0,0,0,0,0,0,0,0
29,1,Brazil,0,0,0,0,0,0,0,0
30,1,Cameroon,0,0,0,0,0,0,0,0
31,1,Croatia,0,0,0,0,0,0,0,0
32,1,Mexico,0,0,0,0,0,0,0,0
在Excel中的输出:
1 1 Australia 0 0 0 0 0 0 0 2 1 Chile 0 0 0 0 0 0 0 3 1 Netherlands 0 0 0 0 0 0 0 4 1 Spain 0 0 0 0 0 0 0 5 1 Colombia 0 0 0 0 0 0 0 6 1 Greece 0 0 0 0 0 0 0 7 1 Ivory Coast 0 0 0 0 0 0 0 8 1 Japan 0 0 0 0 0 0 0 9 1 Costa Rica 0 0 0 0 0 0 0 10 1 England 0 0 0 0 0 0 0 11 1 Italy 0 0 0 0 0 0 0 12 1 Uruguay 0 0 0 0 0 0 0 13 1 Ecuador 0 0 0 0 0 0 0 14 1 France 0 0 0 0 0 0 0 15 1 Honduras 0 0 0 0 0 0 0 16 1 Switzerland 0 0 0 0 0 0 0 17 1 Argentina 0 0 0 0 0 0 0 18 1 Bosnia-Herzegovina 0 0 0 0 0 0 0 19 1 Iran 0 0 0 0 0 0 0 20 1 Nigeria 0 0 0 0 0 0 0 21 1 Germany 0 0 0 0 0 0 0 22 1 Ghana 0 0 0 0 0 0 0 23 1 Portugal 0 0 0 0 0 0 0 24 1 USA 0 0 0 0 0 0 0 25 1 Algeria 0 0 0 0 0 0 0 26 1 Belgium 0 0 0 0 0 0 0 27 1 Russia 0 0 0 0 0 0 0 28 1 South Korea 0 0 0 0 0 0 0 29 1 Brazil 0 0 0 0 0 0 0 30 1 Cameroon 0 0 0 0 0 0 0 31 1 Croatia 0 0 0 0 0 0 0 32 1 Mexico 0 0 0 0 0 0 0 0
在Excel的输出中,所有上面显示的数据都存储在一行里,但我想在每10列后分开数据,比如在Excel输出中的“chile”之前。这样输出的结果是
1 1 Australia 0 0 0 0 0 0 0 0 2
Chile 0 0 0 0 0 0 0 3 1
后面的数据也是如此。
1 个回答
0
你只需要在每行的末尾加一个 "\n"
字符,这样就能让你的脚本正常工作。因为用 write()
写入文件时,它不会自动添加换行符。例如:
f.write("\t".join([str(index)] + row[1:]) + "\n")
不过,你可能会发现 csv
模块更好用。此外,你最好不要把你的制表符分隔值文件命名为 .xls
,因为它实际上并不是一个Excel文件,尽管Excel可以打开它。用 .tsv
或 .tab
这样的后缀会更合适,甚至可以用 .txt
。