Python Html表在s上运行时找不到数据

2024-06-11 14:00:14 发布

您现在位置:Python中文网/ 问答频道 /正文

嗨,我的代码在实际联机运行时不起作用,当我使用Find时它返回None如何修复这个问题?你知道吗

这是我的密码

import time
import sys

import urllib
import re
from bs4 import BeautifulSoup, NavigableString

print "Initializing Python Script"

print "The passed arguments are "
urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/", "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/", "https://www.alternate.nl/GIGABYTE/GV-N78TOC-3GD-grafische-kaart/html/product/1115798", "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"]
i =0
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)
word = "tweakers"
alternate = "alternate"
while i<len(urls):

  dataraw = urllib.urlopen(urls[i])
  data = dataraw.read()
  soup = BeautifulSoup(data)
  table = soup.find("table", {"class" : "spec-detail"})
  print table
  i+=1

结果如下:

Initializing Python Script
The passed arguments are 
None
None
None
None


Script finalized

我试过用芬德尔和其他方法。。但我似乎不明白为什么它是在我的命令行上工作,而不是在服务器本身。。。 有什么帮助吗?你知道吗

编辑

Traceback (most recent call last):
  File "python_script.py", line 35, in 
soup = BeautifulSoup(urllib2.urlopen(url), 'html.parser')
  File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

Tags: inpyimportnonehttpresponselibusr
1条回答
网友
1楼 · 发布于 2024-06-11 14:00:14

我怀疑你正在经历differences between parsers。你知道吗

显式指定解析器对我有效:

import urllib2
from bs4 import BeautifulSoup

urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/",
        "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/",
        "https://www.alternate.nl/GIGABYTE/GV-N78TOC-3GD-grafische-kaart/html/product/1115798",
        "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"]

for url in urls:
    soup = BeautifulSoup(urllib2.urlopen(url), 'html.parser')
    table = soup.find("table", {"class": "spec-detail"})
    print table

在本例中,我使用的是html.parser,但是您可以随意使用并指定lxmlhtml5lib。你知道吗

注意,第三个url不包含带有class="spec-detail"table,因此,它为它打印None。你知道吗

我还介绍了一些改进:

  • 已删除未使用的导入
  • 用索引替换了while循环和nice for循环
  • 删除无关代码
  • urllib替换为urllib2

您还可以使用^{}模块并设置适当的User-Agent头,假装是真正的浏览器:

from bs4 import BeautifulSoup
import requests

urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/",
        "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/",
        "https://www.alternate.nl/GIGABYTE/GV-N78TOC-3GD-grafische-kaart/html/product/1115798",
        "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"]

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36'}
for url in urls:
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    table = soup.find("table", {"class": "spec-detail"})
    print table

希望有帮助。你知道吗

相关问题 更多 >