Python Html表在s上运行时找不到数据

import time import sys import urllib import re from bs4 import BeautifulSoup, NavigableString print "Initializing Python Script" print "The passed arguments are " urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/", "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/", "https://www.alternate.nl/GIGABYTE/GV-N78TOC-3GD-grafische-kaart/html/product/1115798", "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"] i =0 regex = '<title>(.+?)</title>' pattern = re.compile(regex) word = "tweakers" alternate = "alternate" while i<len(urls): dataraw = urllib.urlopen(urls[i]) data = dataraw.read() soup = BeautifulSoup(data) table = soup.find("table", {"class" : "spec-detail"}) print table i+=1

Traceback (most recent call last): File "python_script.py", line 35, in soup = BeautifulSoup(urllib2.urlopen(url), 'html.parser') File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 406, in open response = meth(req, response) File "/usr/lib/python2.7/urllib2.py", line 519, in http_response 'http', request, response, code, msg, hdrs) File "/usr/lib/python2.7/urllib2.py", line 444, in error return self._call_chain(*args) File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) urllib2.HTTPError: HTTP Error 403: Forbidden

1条回答

网友

1楼 · 发布于 2024-06-11 14:00:14

我怀疑你正在经历differences between parsers。你知道吗

显式指定解析器对我有效：

import urllib2
from bs4 import BeautifulSoup

urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/",
        "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/",
        "https://www.alternate.nl/GIGABYTE/GV-N78TOC-3GD-grafische-kaart/html/product/1115798",
        "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"]

for url in urls:
    soup = BeautifulSoup(urllib2.urlopen(url), 'html.parser')
    table = soup.find("table", {"class": "spec-detail"})
    print table

在本例中，我使用的是html.parser，但是您可以随意使用并指定lxml或html5lib。你知道吗

注意，第三个url不包含带有class="spec-detail"的table，因此，它为它打印None。你知道吗

我还介绍了一些改进：

已删除未使用的导入
用索引替换了while循环和nice for循环
删除无关代码
将urllib替换为urllib2

您还可以使用^{}模块并设置适当的User-Agent头，假装是真正的浏览器：

from bs4 import BeautifulSoup
import requests

urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/",
        "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/",
        "https://www.alternate.nl/GIGABYTE/GV-N78TOC-3GD-grafische-kaart/html/product/1115798",
        "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"]

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36'}
for url in urls:
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    table = soup.find("table", {"class": "spec-detail"})
    print table

希望有帮助。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章