Python HTMLParser:属性

2024-06-16 11:33:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用HTMLParser(python2.7)来解析使用urllib2下拉的页面,当我想将数据存储到feed方法的列表中时,会遇到AttributeError异常。但是,如果注释掉\uuqinit\uu方法,则异常就消失了


在主.py

# -*- coding: utf-8 -*-
from HTMLParser import HTMLParser
import urllib2
import sys
reload(sys)
sys.setdefaultencoding('utf-8')


class MyHTMLParser(HTMLParser):
    def __init__(self):
        self.terms = []
        self.definitions = []

    def handle_starttag(self, tag, attrs):
        # retrive the terms
        if tag == 'div':
            for attribute, value in attrs:
                if value == 'word':
                    self.terms.append(attrs[1][1])
        # retrive the definitions
                if value == 'desc':
                    if attrs[1][1]:
                        self.definitions.append(attrs[1][1])
                    else:
                        self.definitions.append(None)


parser = MyHTMLParser()
# open page and retrive source page
response = urllib2.urlopen('http://localhost/')
html = response.read().decode('utf-8')
response.close()

# extract the terms and definitions
parser.feed(html)

输出

^{pr2}$

Tags: theimportselfifvalueresponsesysurllib2
2条回答

我认为你没有正确初始化HTMLParser。也许你根本不需要初始化它。这对我有用:

# -*- coding: utf-8 -*-
from HTMLParser import HTMLParser
import urllib2
import sys
reload(sys)
sys.setdefaultencoding('utf-8')


class MyHTMLParser(HTMLParser):  
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag
        # retrive the terms
        if tag == 'div':
            for attribute, value in attrs:
                if value == 'word':
                    self.terms.append(attrs[1][1])
        # retrive the definitions
                if value == 'desc':
                    if attrs[1][1]:
                        self.definitions.append(attrs[1][1])
                    else:
                        self.definitions.append(None)


parser = MyHTMLParser()
# open page and retrive source page
response = urllib2.urlopen('http://localhost/')
html = response.read().decode('utf-8')
response.close()

# extract the terms and definitions
parser.feed(html)

更新

^{pr2}$

输出:

['center','left']

['center','left']

好的,我得到了解决方案,super().__init__不能工作,必须硬编码名称

def __init__(self):
        HTMLParser.__init__(self)

相关问题 更多 >