使用 Beautifulsoup/urlib 完成错误处理和解析字符串的帮助

import urllib.request, time, unicodedata from bs4 import BeautifulSoup num = 0 def index(): index = open('index.html', 'w') for x in range(len(titles)-1): index.write("<a href="+'"'+tocrawl[x]+'"'+" "+"target=" "blank"" >"+titles[x+1]+"</a></br>\n") index.close() return 'Index Created' def crawl(args): page = urllib.request.urlopen(args).read() soup = BeautifulSoup(page) soup.prettify().encode('UTF-8') titles.append(str(soup.title.string.encode('utf-8'),encoding='utf-8')) for anchor in soup.findAll('a', href=True): if str(anchor['href']).startswith(https) or str(anchor['href']).startswith(http): if anchor['href'] not in tocrawl: if anchor['href'].endswith(searchfor): print(anchor['href']) if not anchor['href'].endswith('.png') and not anchor['href'].endswith('.jpg'): tocrawl.append(anchor['href']) tocrawl, titles, descriptions, scripts, results = [], [], [], [], [] https = 'https://' http = 'http://' next = 3 crawl('http://google.com/') while 1: crawl(tocrawl[num]) num = num + 1 if num==next: index() next = next + 3

1条回答

网友

1楼 · 发布于 2024-04-20 10:55:57

处理错误代码：
当您试图打开一个URL并遇到错误时，您将得到一个HTTPError，其中方便地包含HTTP状态码和原因（例如，某个字符串）。如果要忽略错误，可以将函数包装在try / except块中并忽略错误：

try:
    page = urllib.request.urlopen(args).read()
    # ...
except urllib.error.HTTPError as e:
    # we don't care about no stinking errors
    # ... but if we did, e.code would have the http status code...
    # ... and e.reason would have an explanation of the error (hopefully)
    pass

在页面中搜索字符串：
BeautifulSoup功能强大得难以置信；它的find方法（和find_all方法）支持关键字参数text，该参数使用正则表达式来查找页面中的文本。在您的例子中，因为您只需要确保文本存在，所以您可能只需要确保通过^{}方法返回结果就可以了。你知道吗

if soup.find(text=re.compile('my search string')):
    # do something

有关text参数can be found in the documentation的更多详细信息。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章