Python HTML解析器

Question

我正在用HTMLParser解析一个HTML文档，我想打印出

标签之间的内容。

看看我的代码片段。

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            print "TODO: print the contents"

Answer 1

我发现这样做对我的代码没有用，所以我在外面定义了一个 tag_stack = []，就像是一个全局变量。

from html.parser import HTMLParser
    tag_stack = []
    class MONanalyseur(HTMLParser):

    def handle_starttag(self, tag, attrs):
        tag_stack.append(tag.lower())
    def handle_endtag(self, tag):
        tag_stack.pop()
    def handle_data(self, data):
        if tag_stack[-1] == 'head':
            print(data)

parser=MONanalyseur()
parser.feed()

Answer 2

根据@tauran发的内容，你可能想要做类似这样的事情：

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def print_p_contents(self, html):
        self.tag_stack = []
        self.feed(html)

    def handle_starttag(self, tag, attrs):
        self.tag_stack.append(tag.lower())

    def handle_endtag(self, tag):
        self.tag_stack.pop()

    def handle_data(self, data):
        if self.tag_stack[-1] == 'p':
            print data

p = MyHTMLParser()
p.print_p_contents('<p>test</p>')

现在，你可能想把所有的<p>内容放进一个列表里，然后把这个列表作为结果返回，或者做其他类似的事情。

今天学到的：在使用这种库时，你需要考虑栈的概念！

Answer 3

我扩展了来自文档的示例：

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        print "Encountered the beginning of a %s tag" % tag

    def handle_endtag(self, tag):
        print "Encountered the end of a %s tag" % tag

    def handle_data(self, data):
        print "Encountered data %s" % data

p = MyHTMLParser()
p.feed('<p>test</p>')

-

Encountered the beginning of a p tag
Encountered data test
Encountered the end of a p tag

Python HTML解析器

3 个回答

撰写回答