urllib2中的未知网址类型错误
我在StackOverflow上搜索了很多类似的问题,但没有找到完全符合我情况的。
我正在尝试用Python 2.7下载一个视频。
这是我用来下载视频的代码:
import urllib2
from bs4 import BeautifulSoup as bs
with open('video.txt','r') as f:
last_downloaded_video = f.read()
webpage = urllib2.urlopen('http://*.net/watch/**-'+last_downloaded_video)
soup = bs(webpage)
a = []
for link in soup.find_all('a'):
if link.has_attr('data-video-id'):
a.append(link)
#try just with first data-video-id
id = a[0]['data-video-id']
webpage2 = urllib2.urlopen('http://*/video/play/'+id)
soup = bs(webpage2)
string = str(soup.find_all('script')[2])
print string
url = string.split(': ')[1].split(',')[0]
url = url.replace('"','')
print url
print type(url)
video = urllib2.urlopen(url).read()
filename = "video.mp4"
with open(filename,'wb') as f:
f.write(video)
这段代码出现了一个未知的URL类型错误。错误追踪信息是:
Traceback (most recent call last):
File "naruto.py", line 26, in <module>
video = urllib2.urlopen(url).read()
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 427, in _open
'unknown_open', req)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1247, in unknown_open
raise URLError('unknown url type: %s' % type)
urllib2.URLError: <urlopen error unknown url type: 'http>
不过,当我把同样的URL存储在一个变量里,然后从终端下载时,没有出现错误。我对这个问题感到困惑。
我在Python邮件列表上看到过一个类似的问题。
1 个回答
7
在没有看到你正在抓取的页面的HTML代码之前,很难判断问题所在。不过,URL开头如果多了一个'
(单引号)字符,可能就是问题的根源,这会导致同样的错误:
>>> import urllib2
>>> urllib2.urlopen("'http://blah.com")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "urllib2.py", line 404, in open
response = self._open(req, data)
File "urllib2.py", line 427, in _open
'unknown_open', req)
File "urllib2.py", line 382, in _call_chain
result = func(*args)
File "urllib2.py", line 1249, in unknown_open
raise URLError('unknown url type: %s' % type)
urllib2.URLError: <urlopen error unknown url type: 'http>
所以,试着整理一下你的URL,去掉多余的引号。
根据提问者的反馈更新:
打印出来的结果显示,URL的开头和结尾都有一个单引号字符。当你把URL传给urlopen()
时,周围不应该有任何类型的引号。你可以用下面的代码去掉URL字符串开头和结尾的引号(包括单引号和双引号):
url = url.strip('\'"')