如何使用Python抓取在Excel文件中每x列后拆分列?

1 投票
1 回答
543 浏览
提问于 2025-04-18 09:43

我正在尝试从一个网站上抓取数据,并把这些数据保存为xls格式。但是当我把数据保存为xls格式时,所有的数据都被放在了一行里。有没有人能帮我,教我怎么在每10个数据后分开一行?我的代码是

from bs4 import BeautifulSoup
import urllib2

pageSource=urllib2.urlopen('http://www.livescore.com/worldcup/tables/').read()

soup = BeautifulSoup(pageSource)
alltables = soup.findAll( "table", {"class":"league-wc table bh"} )

results=[]

for table in alltables:
    rows = table.findAll('tr')
    _table = []
    for tr in rows[1:]:
        _row = []
        cols = tr.findAll('td')
        for td in cols:
            if td.findAll('a'):
                text=td.a.renderContents().strip()
            else:
                text=td.renderContents().strip()
            _row.append(text)
        _table.append(_row)
    results.append(_table)


# print results

index = 1
f=open('world.xls','w')
for table in results:
    for row in table:
        print ','.join([str(index)] + row[1:])
        f.write("\t".join([str(index)] + row[1:]))
        index += 1

f.close()

输出:

1,1,Australia,0,0,0,0,0,0,0,0
2,1,Chile,0,0,0,0,0,0,0,0
3,1,Netherlands,0,0,0,0,0,0,0,0
4,1,Spain,0,0,0,0,0,0,0,0
5,1,Colombia,0,0,0,0,0,0,0,0
6,1,Greece,0,0,0,0,0,0,0,0
7,1,Ivory Coast,0,0,0,0,0,0,0,0
8,1,Japan,0,0,0,0,0,0,0,0
9,1,Costa Rica,0,0,0,0,0,0,0,0
10,1,England,0,0,0,0,0,0,0,0
11,1,Italy,0,0,0,0,0,0,0,0
12,1,Uruguay,0,0,0,0,0,0,0,0
13,1,Ecuador,0,0,0,0,0,0,0,0
14,1,France,0,0,0,0,0,0,0,0
15,1,Honduras,0,0,0,0,0,0,0,0
16,1,Switzerland,0,0,0,0,0,0,0,0
17,1,Argentina,0,0,0,0,0,0,0,0
18,1,Bosnia-Herzegovina,0,0,0,0,0,0,0,0
19,1,Iran,0,0,0,0,0,0,0,0
20,1,Nigeria,0,0,0,0,0,0,0,0
21,1,Germany,0,0,0,0,0,0,0,0
22,1,Ghana,0,0,0,0,0,0,0,0
23,1,Portugal,0,0,0,0,0,0,0,0
24,1,USA,0,0,0,0,0,0,0,0
25,1,Algeria,0,0,0,0,0,0,0,0
26,1,Belgium,0,0,0,0,0,0,0,0
27,1,Russia,0,0,0,0,0,0,0,0
28,1,South Korea,0,0,0,0,0,0,0,0
29,1,Brazil,0,0,0,0,0,0,0,0
30,1,Cameroon,0,0,0,0,0,0,0,0
31,1,Croatia,0,0,0,0,0,0,0,0
32,1,Mexico,0,0,0,0,0,0,0,0

在Excel中的输出:

1   1   Australia   0   0   0   0   0   0   0   2   1   Chile   0   0   0   0   0   0   0   3   1   Netherlands 0   0   0   0   0   0   0   4   1   Spain   0   0   0   0   0   0   0   5   1   Colombia    0   0   0   0   0   0   0   6   1   Greece  0   0   0   0   0   0   0   7   1   Ivory Coast 0   0   0   0   0   0   0   8   1   Japan   0   0   0   0   0   0   0   9   1   Costa Rica  0   0   0   0   0   0   0   10  1   England 0   0   0   0   0   0   0   11  1   Italy   0   0   0   0   0   0   0   12  1   Uruguay 0   0   0   0   0   0   0   13  1   Ecuador 0   0   0   0   0   0   0   14  1   France  0   0   0   0   0   0   0   15  1   Honduras    0   0   0   0   0   0   0   16  1   Switzerland 0   0   0   0   0   0   0   17  1   Argentina   0   0   0   0   0   0   0   18  1   Bosnia-Herzegovina  0   0   0   0   0   0   0   19  1   Iran    0   0   0   0   0   0   0   20  1   Nigeria 0   0   0   0   0   0   0   21  1   Germany 0   0   0   0   0   0   0   22  1   Ghana   0   0   0   0   0   0   0   23  1   Portugal    0   0   0   0   0   0   0   24  1   USA 0   0   0   0   0   0   0   25  1   Algeria 0   0   0   0   0   0   0   26  1   Belgium 0   0   0   0   0   0   0   27  1   Russia  0   0   0   0   0   0   0   28  1   South Korea 0   0   0   0   0   0   0   29  1   Brazil  0   0   0   0   0   0   0   30  1   Cameroon    0   0   0   0   0   0   0   31  1   Croatia 0   0   0   0   0   0   0   32  1   Mexico  0   0   0   0   0   0   0   0

在Excel的输出中,所有上面显示的数据都存储在一行里,但我想在每10列后分开数据,比如在Excel输出中的“chile”之前。这样输出的结果是

1   1   Australia 0 0   0 0 0   0 0  0 2 

Chile           0   0        0  0   0   0   0       3   1   

后面的数据也是如此。

1 个回答

0

你只需要在每行的末尾加一个 "\n" 字符,这样就能让你的脚本正常工作。因为用 write() 写入文件时,它不会自动添加换行符。例如:

f.write("\t".join([str(index)] + row[1:]) + "\n")

不过,你可能会发现 csv 模块更好用。此外,你最好不要把你的制表符分隔值文件命名为 .xls,因为它实际上并不是一个Excel文件,尽管Excel可以打开它。用 .tsv.tab 这样的后缀会更合适,甚至可以用 .txt

撰写回答