BeautifulSoup 数据抓取与数据库
我正在使用BeautifulSoup来解析一个网站。
现在我遇到的问题是:我想把这些信息写入数据库(比如sqlite),并记录进球发生的分钟数(这个信息我可以从我得到的链接中获取)。但是只有在进球数不是? - ?
的情况下才能做到,因为那样表示没有进球。
from pprint import pprint
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.livescore.com/soccer/champions-league/'))
data = []
for match in soup.select('table.league-table tr'):
try:
team1, team2 = match.find_all('td', class_=['fh', 'fa'])
except ValueError: # helps to skip irrelevant rows
continue
score = match.find('a', class_='scorelink').text.strip()
data.append({
'team1': team1.text.strip(),
'team2': team2.text.strip(),
'score': score
})
pprint(data)
href_tags = soup.find_all('a', {'class':"scorelink"})
links = []
for x in xrange(1, len(href_tags)):
insert = href_tags[x].get("href");links.append(insert)
print links
1 个回答
1
首先,分数有什么意义,如果没有比赛中参与的球队呢?
这个想法是遍历每一个有 league-table
类的表格中的每一行。对于每一行,获取球队名称和分数。然后把结果收集到一个字典的列表里:
from pprint import pprint
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.livescore.com/soccer/champions-league/'))
data = []
for match in soup.select('table.league-table tr'):
try:
team1, team2 = match.find_all('td', class_=['fh', 'fa'])
except ValueError: # helps to skip irrelevant rows
continue
score = match.find('a', class_='scorelink').text.strip()
data.append({
'team1': team1.text.strip(),
'team2': team2.text.strip(),
'score': score
})
pprint(data)
输出结果是:
[
{'score': u'? - ?', 'team1': u'Atletico Madrid', 'team2': u'Malmo FF'},
{'score': u'? - ?', 'team1': u'Olympiakos', 'team2': u'Juventus'},
{'score': u'? - ?', 'team1': u'Liverpool', 'team2': u'Real Madrid'},
{'score': u'? - ?', 'team1': u'PFC Ludogorets Razgrad', 'team2': u'Basel'},
...
]
注意,目前它会把每场比赛都添加进来,即使比赛还没进行。如果你只想收集有分数的比赛,可以简单地检查一下 score
是否不等于 ? - ?
:
if score != '? - ?':
data.append({
'team1': team1.text.strip(),
'team2': team2.text.strip(),
'score': score
})
在这种情况下,输出结果会是:
[{'score': u'2 - 2', 'team1': u'CSKA Moscow', 'team2': u'Manchester City'},
{'score': u'3 - 0', 'team1': u'Zenit St. Petersburg', 'team2': u'Standard Liege'},
{'score': u'4 - 0', 'team1': u'APOEL Nicosia', 'team2': u'AaB'},
{'score': u'3 - 0', 'team1': u'BATE Borisov', 'team2': u'Slovan Bratislava'},
{'score': u'0 - 1', 'team1': u'Celtic', 'team2': u'Maribor'},
{'score': u'2 - 0', 'team1': u'FC Porto', 'team2': u'Lille'},
{'score': u'1 - 0', 'team1': u'Arsenal', 'team2': u'Besiktas'},
{'score': u'3 - 1', 'team1': u'Athletic Bilbao', 'team2': u'SSC Napoli'},
{'score': u'4 - 0', 'team1': u'Bayer Leverkusen', 'team2': u'FC Koebenhavn'},
{'score': u'3 - 0', 'team1': u'Malmo FF', 'team2': u'Salzburg'},
{'score': u'1 - 0', 'team1': u'PFC Ludogorets Razgrad *', 'team2': u'Steaua Bucuresti'}]
至于“写入数据库”的部分,你可以使用 sqlite3
模块和 executemany()
方法,配合 named parameters
:
import sqlite3
conn = sqlite3.connect('data.db')
conn.execute("""
CREATE TABLE IF NOT EXISTS matches (
id integer primary key autoincrement not null,
team1 text,
team2 text,
score text
)""")
cursor = conn.cursor()
cursor.executemany("""
INSERT INTO
matches (team1, team2, score)
VALUES
(:team1, :team2, :score)""", data)
conn.commit()
conn.close()
当然,还有其他方面可以改进或讨论,但我觉得这对你来说是一个不错的开始。