Python脚本抓取HTML输出时有时有效有时无效

1 投票

1 回答

523 浏览

提问于 2025-04-18 02:15

我正在尝试用以下的Python代码从Yahoo的搜索结果中抓取链接。我使用mechanize来模拟浏览器实例，使用Beautiful Soup来解析HTML代码。

问题是，这个脚本有时候运行得很好，有时候却会出现以下错误：

WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

很明显，这个问题和编码、解码或者gzip压缩有关，但为什么有时候能正常工作，有时候又不行呢？怎么才能让它每次都能正常工作呢？

以下是代码。运行7到8次，你就会注意到这个问题。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import mechanize
import urllib
from bs4 import BeautifulSoup
import re

#mechanize emulates a Browser
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent','chrome')]

term = "stock market".replace(" ","+")
query = "https://search.yahoo.com/search?q=" + term

htmltext = br.open(query).read()
htm = str(htmltext)

soup = BeautifulSoup(htm)
#Since all results are located in the ol tag
search = soup.findAll('ol')

searchtext = str(search)

#Using BeautifulSoup to parse the HTML source
soup1 = BeautifulSoup(searchtext)
#Each search result is contained within div tag
list_items = soup1.findAll('div', attrs={'class':'res'})


#List of first search result
list_item = str(list_items)

for li in list_items:
    list_item = str(li)
    soup2 = BeautifulSoup(list_item)
    link = soup2.findAll('a')
    print link[0].get('href')
    print ""

这里有一个输出的截图： http://pokit.org/get/img/1d47e0d0dc08342cce89bc32ae6b8e3c.jpg

错误处理数据提取网页抓取 html解析 beautiful soup mechanize 编码问题 gzip压缩

1 个回答

我在一个项目中遇到了编码的问题，所以我写了一个函数来获取我正在抓取的页面的编码方式。这样你就可以把它解码成unicode格式，这样在使用你的函数时可以尽量避免这些错误。至于压缩的问题，你需要让你的代码能够处理压缩文件，这样如果遇到压缩文件时，它就能正常工作。

from bs4 import BeautifulSoup, UnicodeDammit
import chardet
import re

def get_encoding(soup):
    """
    This is a method to find the encoding of a document.
    It takes in a Beautiful soup object and retrieves the values of that documents meta tags
    it checks for a meta charset first. If that exists it returns it as the encoding.
    If charset doesnt exist it checks for content-type and then content to try and find it.
    """
    encod = soup.meta.get('charset')
    if encod == None:
        encod = soup.meta.get('content-type')
        if encod == None:
            content = soup.meta.get('content')
            match = re.search('charset=(.*)', content)
            if match:
                encod = match.group(1)
            else:
                dic_of_possible_encodings = chardet.detect(unicode(soup))
                encod = dic_of_possible_encodings['encoding'] 
    return encod

这里有一个处理压缩数据的链接 http://www.diveintopython.net/http_web_services/gzip_compression.html

这个问题的链接检查GZIP文件在Python中是否存在

if any(os.path.isfile, ['bob.asc', 'bob.asc.gz']):
    print 'yay'

回答于 2025-04-18 由 Python大师

分享举报

Python脚本抓取HTML输出时有时有效有时无效

1 个回答

撰写回答