无法使用beautiful soup检索此特定页的href

# -*- coding: ascii -*- # import libraries from bs4 import BeautifulSoup import urllib2 import re def gethyperLinks(url): html_page = urllib2.urlopen(url) soup = BeautifulSoup(html_page, "html.parser") hyperlinks = [] for link in soup.findAll('div', attrs={'class': 'ess-product-desc'}): hyperlinks.append(link.get('href')) return hyperlinks print( gethyperLinks("http://biggestbook.com/ui/catalog.html#/search?cr=1&rs=12&st=BM&category=1") )

<div class="ess-product-desc" ng-hide="currentView == 'detail' `&& deviceType=='mobile'" ui-sref="detail({itemId: 'BWK6400', uom: 'CT', cm_sp:'', merchPreference:''})" href="#/itemDetail?`itemId=BWK6400&uom=CT" aria-hidden="false"> <span>Center-Pull Hand Towels, 2-Ply, Perforated, 7 7/8 x 10, White, 600/RL, 6 RL/CT</span> </div>

2条回答

网友

1楼 · 编辑于 2024-04-18 11:10:30

也许，你应该用“html5lib”而不是html.parser语法分析器，如下所示：

from bs4 import BeautifulSoup
html="""
<div 
    class="ess-product-desc" ng-hide="currentView == 'detail' `&amp;&amp; deviceType=='mobile'" 
    ui-sref="detail({itemId: 'BWK6400', uom: 'CT', cm_sp:'', merchPreference:''})" 
    href="#/itemDetail?`itemId=BWK6400&amp;uom=CT" aria-hidden="false">
        <span>Center-Pull Hand Towels, 2-Ply, Perforated, 7 7/8 x 10, White, 600/RL, 6 RL/CT</span>
</div>
"""
soup = BeautifulSoup(html,"html5lib")
links = soup.findAll('div', attrs={'class': 'ess-product-desc'})
links[0].get("href")

您将获得：

'#/itemDetail?`itemId=BWK6400&uom=CT'

网友

2楼 · 编辑于 2024-04-18 11:10:30

页面的值需要运行javascript。如果您检查响应（至少是请求），这一点应该很清楚。我展示了一个使用selenium的示例，以便javascript有时间运行。您可以将其转换为使用函数在刮取会话期间从导航到的页面返回数据。你知道吗

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(' headless')

driver = webdriver.Chrome(chrome_options=chrome_options) 
driver.get("http://biggestbook.com/ui/catalog.html#/search?cr=1&rs=12&st=BM&category=1")
links = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".ess-product-brand + [href]")))
results = [link.get_attribute('href') for link in links]
print(results)

有一个名为的API，带有查询字符串参数，它以json格式返回数据。你必须通过推荐人和代币。如果您能够获取令牌，或者在会话中传递令牌（并且它仍然有效），并且能够破译查询字符串参数，那么这可能是基于请求的方法的方法。不确定urllib。你知道吗

https://api.essendant.com/digital/digitalservices/search/v1/search?cr=1&fc=1&listKey=I:D2F9CC81D2919D8712B61A3176A518622A2764B16287CA6576B9CF0C9B5&listKey=I:A81AAA8BD639792D923386B93AC32AC535673530AFBB7A25CAB5AB2E933EAD1&rs=12&st=BM&vc=n

相关问题更多 >

编程相关推荐

热门问题

热门文章