Python脚本抓取HTML输出时有时有效有时无效
我正在尝试用以下的Python代码从Yahoo的搜索结果中抓取链接。我使用mechanize来模拟浏览器实例,使用Beautiful Soup来解析HTML代码。
问题是,这个脚本有时候运行得很好,有时候却会出现以下错误:
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
很明显,这个问题和编码、解码或者gzip压缩有关,但为什么有时候能正常工作,有时候又不行呢?怎么才能让它每次都能正常工作呢?
以下是代码。运行7到8次,你就会注意到这个问题。
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import mechanize
import urllib
from bs4 import BeautifulSoup
import re
#mechanize emulates a Browser
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent','chrome')]
term = "stock market".replace(" ","+")
query = "https://search.yahoo.com/search?q=" + term
htmltext = br.open(query).read()
htm = str(htmltext)
soup = BeautifulSoup(htm)
#Since all results are located in the ol tag
search = soup.findAll('ol')
searchtext = str(search)
#Using BeautifulSoup to parse the HTML source
soup1 = BeautifulSoup(searchtext)
#Each search result is contained within div tag
list_items = soup1.findAll('div', attrs={'class':'res'})
#List of first search result
list_item = str(list_items)
for li in list_items:
list_item = str(li)
soup2 = BeautifulSoup(list_item)
link = soup2.findAll('a')
print link[0].get('href')
print ""
这里有一个输出的截图: http://pokit.org/get/img/1d47e0d0dc08342cce89bc32ae6b8e3c.jpg
1 个回答
2
我在一个项目中遇到了编码的问题,所以我写了一个函数来获取我正在抓取的页面的编码方式。这样你就可以把它解码成unicode格式,这样在使用你的函数时可以尽量避免这些错误。至于压缩的问题,你需要让你的代码能够处理压缩文件,这样如果遇到压缩文件时,它就能正常工作。
from bs4 import BeautifulSoup, UnicodeDammit
import chardet
import re
def get_encoding(soup):
"""
This is a method to find the encoding of a document.
It takes in a Beautiful soup object and retrieves the values of that documents meta tags
it checks for a meta charset first. If that exists it returns it as the encoding.
If charset doesnt exist it checks for content-type and then content to try and find it.
"""
encod = soup.meta.get('charset')
if encod == None:
encod = soup.meta.get('content-type')
if encod == None:
content = soup.meta.get('content')
match = re.search('charset=(.*)', content)
if match:
encod = match.group(1)
else:
dic_of_possible_encodings = chardet.detect(unicode(soup))
encod = dic_of_possible_encodings['encoding']
return encod
这里有一个处理压缩数据的链接 http://www.diveintopython.net/http_web_services/gzip_compression.html
这个问题的链接 检查GZIP文件在Python中是否存在
if any(os.path.isfile, ['bob.asc', 'bob.asc.gz']):
print 'yay'