如何将“捕获所有”异常条款应用于复杂的 Python 网页抓取脚本?

0 投票
2 回答
675 浏览
提问于 2025-04-15 14:50

我有一个包含100个网站的CSV格式列表。所有网站的格式都差不多,里面都有一个包含7列的大表格。我写了一个脚本,用来提取每个网站第7列的数据,然后把这些数据写入文件。下面的脚本部分有效,不过在运行脚本后打开输出文件时发现,有些数据被跳过了,因为只显示了98条写入记录(显然脚本还记录了一些异常情况)。如果能提供一些关于如何在这种情况下处理“异常”的建议,我将非常感激。谢谢!

import csv, urllib2, re
def replace(variab): return variab.replace(",", " ")

urls = csv.reader(open('input100.txt', 'rb'))  #access list of 100 URLs
for url in urls:
    html = urllib2.urlopen(url[0]).read()  #get HTML starting with the first URL
    col7 = re.findall('td7.*?td', html)  #use regex to get data from column 7
    string = str(col7)  #stringify data
    neat = re.findall('div3.*?div', string)  #use regex to get target text  
    result = map(replace, neat)  #apply function to remove','s from elements
    string2 = ", ".join(result)  #separate list elements with ', ' for export to csv
    output = open('output.csv', 'ab') #open file for writing 
    output.write(string2 + '\n') #append output to file and create new line
    output.close()

返回结果:

Traceback (most recent call last):
 File "C:\Python26\supertest3.py", line 6, in <module>
  html = urllib2.urlopen(url[0]).read()
 File "C:\Python26\lib\urllib2.py", line 124, in urlopen
  return _opener.open(url, data, timeout)
 File "C:\Python26\lib\urllib2.py", line 383, in open
  response = self._open(req, data)
 File "C:\Python26\lib\urllib2.py", line 401, in _open
  '_open', req)
 File "C:\Python26\lib\urllib2.py", line 361, in _call_chain
  result = func(*args)
 File "C:\Python26\lib\urllib2.py", line 1130, in http_open
  return self.do_open(httplib.HTTPConnection, req)
 File "C:\Python26\lib\urllib2.py", line 1103, in do_open
  r = h.getresponse()
 File "C:\Python26\lib\httplib.py", line 950, in getresponse
  response.begin()
 File "C:\Python26\lib\httplib.py", line 390, in begin
  version, status, reason = self._read_status()
 File "C:\Python26\lib\httplib.py", line 354, in _read_status
  raise BadStatusLine(line)
BadStatusLine
>>>>

2 个回答

1

我建议你去看看这份Python文档中的错误和异常,特别是第8.3节——处理异常的部分。

2

把你的 for 循环的主体部分改成:

for url in urls:
  try:
    ...the body you have now...
  except Exception, e:
    print>>sys.stderr, "Url %r not processed: error (%s) % (url, e)

(或者,如果你已经在用标准库里的 logging 模块的话,可以用 logging.error 来代替那个奇怪的 print>>,而且你应该这样做;-))

撰写回答