Python urllib 跳过HTTP或URL上的链接

2024-04-24 14:04:12 发布

您现在位置:Python中文网/ 问答频道 /正文

如果连接超时或无效,如何修改脚本以跳过URL/404?在

Python

#!/usr/bin/python

#parser.py: Downloads Bibles and parses all data within <article> tags.

__author__      = "Cody Bouche"
__copyright__   = "Copyright 2012 Digital Bible Society"

from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re

print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
    url = link.get('href')
    name = urlparse.urlparse(url).path.split('/')[-1]
    dirname = urlparse.urlparse(url).path.split('.')[-1]
    f = urllib2.urlopen(url)
    s = f.read()
    if (os.path.isdir(dirname) == 0):
        os.mkdir(dirname)
    soup = BeautifulSoup(s)
    articleTag = soup.html.body.article
    converted = str(articleTag)
    full_path = os.path.join(dirname, name)
    open(full_path, 'wb').write(converted)
    print(name)
print("DOWNLOADS COMPLETE!")

Tags: andpathnameimporturloshtmlarticle
2条回答

试着把你的urlopen线放在Try catch语句下。查一下这个:

在docs.python.org/tutorial/errors.html第8.3节

看看不同的异常,当遇到异常时,使用continue语句重新启动循环

要将超时应用于请求,请将timeout变量添加到对urlopen的调用中。从docs

The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt (if not specified, the global default timeout setting will be used). This actually only works for HTTP, HTTPS and FTP connections.

请参阅本指南中有关how to handle exceptions with urllib2的部分。实际上,我发现整个指南都很有用。在

request timeout异常代码是408。总结一下,如果要处理超时异常,您将:

try:
    response = urlopen(req, 3) # 3 seconds
except URLError, e:
    if hasattr(e, 'code'):
        if e.code==408:
            print 'Timeout ', e.code
        if e.code==404:
            print 'File Not Found ', e.code
        # etc etc

相关问题 更多 >