如何从IGN网站提取URL链接

import webbrowser, bs4, requests, re webPage = requests.get("http://uk.ign.com/games/reviews", headers={'User- Agent': 'Mozilla/5.0'}) webPage.raise_for_status() webPage = bs4.BeautifulSoup(webPage.text, "html.parser") #Me trying different selections to try extract the right part of the page webLinks = webPage.select(".item-title") webLinks2 = webPage.select("h3") webLinks3 = webPage.select("div item-title") print(type(webLinks)) print(type(webLinks2)) print(type(webLinks3)) #I think this is where I've gone wrong. These all returning empty lists. #What am I doing wrong? lenLinks = min(5, len(webLinks)) for i in range(lenLinks): webbrowser.open('http://uk.ign.com/' + webLinks[i].get('href'))

1条回答

网友

1楼 · 发布于 2024-04-25 09:17:16

使用bs4、BeautifulSoup和它返回的soup对象（作为webPage），可以调用：

webLinks = webPage.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

find_all返回一个基于标题的元素列表（在您的例子中是a。这些是HTML元素；要获得链接，您需要更进一步。您可以像访问dict一样访问HTML元素的属性（在本例中，您需要href）：

for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

详见BeautifulSoup getting href。当然，还有docs

pspython通常使用snake\u case而不是CamelCase来编写：）

相关问题更多 >

编程相关推荐

热门问题

热门文章