如何在Python中使用BeautifulSoup获取特定的标记属性文本?

2024-04-29 03:28:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在用Python和BS4编写一个小scraper,以便从中获取MLB调度数据ESPN.com网站在

快结束了,但我有个小问题:

snippet

<div class="teams" data-behavior="fix_broken_images"><a name="&amp;lpos=mlb:schedule:team" href="/mlb/team/_/name/kc"><img src="http://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/scoreboard/kc.png&amp;h=50" class="schedule-team-logo"></a></div><a name="&amp;lpos=mlb:schedule:team" class="team-name" href="/mlb/team/_/name/kc"><span>Kansas City</span> <abbr title="Kansas City Royals">KC</abbr></a>

我实际上可以阅读<span> </span>内容,但我想在<abbr title>中获得完整的团队名称

不知道我错过了什么,我还没想好怎么做

谢谢!在


Tags: namedivcomimgteamclassampschedule
1条回答
网友
1楼 · 发布于 2024-04-29 03:28:14

对于您的代码段,您需要来自锚点内的abbr标记的title属性,类为team-name

h = """<div class="teams" data-behavior="fix_broken_images"><a name="&amp;lpos=mlb:schedule:team" href="/mlb/team/_/name/kc"><img src="http://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/scoreboard/kc.png&amp;h=50" class="schedule-team-logo"></a></div><a name="&amp;lpos=mlb:schedule:team" class="team-name" href="/mlb/team/_/name/kc"><span>Kansas City</span> <abbr title="Kansas City Royals">KC</abbr></a>"""


soup = BeautifulSoup(h)

print(soup.select_one("a.team-name abbr")["title"])

这给了你:

^{pr2}$

或使用“查找”:

h = """<div class="teams" data-behavior="fix_broken_images"><a name="&amp;lpos=mlb:schedule:team" href="/mlb/team/_/name/kc"><img src="http://a.espncdn.com/combiner/i?img=/i/teamlogos/mlb/500/scoreboard/kc.png&amp;h=50" class="schedule-team-logo"></a></div><a name="&amp;lpos=mlb:schedule:team" class="team-name" href="/mlb/team/_/name/kc"><span>Kansas City</span> <abbr title="Kansas City Royals">KC</abbr></a>"""

soup = BeautifulSoup(h)

print(soup.find("a", attrs={"class":"team-name"}).abbr["title"])

这将从站点获取所有名称:

from bs4 import BeautifulSoup
import  requests
url = "http://espn.go.com/mlb/schedule"

soup = BeautifulSoup(requests.get(url).content)

table = soup.select_one("table.schedule.has-team-logos")

print([a["title"] for a in table.select("a.team-name abbr")])

输出:

['Detroit Tigers', 'Washington Nationals', 'Kansas City Royals', 'New York Yankees', 'Oakland Athletics', 'Boston Red Sox', 'Pittsburgh Pirates', 'Cincinnati Reds', 'Milwaukee Brewers', 'Miami Marlins', 'Chicago White Sox', 'Texas Rangers', 'San Diego Padres', 'Chicago Cubs', 'Baltimore Orioles', 'Minnesota Twins', 'Cleveland Indians', 'Houston Astros', 'Arizona Diamondbacks', 'Colorado Rockies', 'Tampa Bay Rays', 'Seattle Mariners', 'New York Mets', 'Los Angeles Dodgers', 'Toronto Blue Jays', 'San Francisco Giants']

相关问题 更多 >