在Python中中止HTMLParser处理

5 投票

3 回答

2733 浏览

提问于 2025-04-15 17:32

在Python中使用HTMLParser类时，能不能在handle_*函数里中止处理？在处理的早期，我就得到了我需要的所有数据，所以继续处理似乎有点浪费。下面是一个提取文档元描述的例子。

from HTMLParser import HTMLParser

class MyParser(HTMLParser):

    def handle_start(self, tag, attrs):
        in_meta = False
        if tag == 'meta':
          for attr in attrs:
              if attr[0].lower() == 'name' and attr[1].lower() == 'description':
                  in_meta = True
              if attr[0].lower() == 'content':
                  print(attr[1])
                  # Would like to tell the parser to stop now,
                  # since I have all the data that I need

数据提取 htmlparser 类的使用处理优化

3 个回答

在@shylent的回答基础上，我来分享一下我的解决方案：

class MyParser(HTMLParser):

    boolean_flag = False

    def handle_starttag(self, tag, attrs):
        # for example:
        self.boolean_flag = (tag == "sometag" and ("id", "someid") in attrs)

    def handle_endtag(self, tag):
        pass

    def handle_data(self, data):
        if self.boolean_flag:
            raise DataParsedException(data)


class DataParsedException(Exception):
    def __init__(self, data):
        self.data = data

使用方法：

try:
    parser.feed(html.decode())
except DataParsedException as dataParsed:
    vars.append(dataParsed.data)

这个方法可以完成任务。

回答于 2025-04-15 由 Python大师

分享举报

如果你使用pyparsing的scanString方法，你就能更好地控制你在输入字符串中实际处理的范围。在你的例子中，我们创建了一个可以匹配<meta>标签的表达式，并添加了一个解析动作，确保我们只匹配带有name="description"的标签。这段代码假设你已经把页面的HTML内容读入了变量htmlsrc中：

from pyparsing import makeHTMLTags, withAttribute

# makeHTMLTags creates both open and closing tags, only care about the open tag
metaTag = makeHTMLTags("meta")[0]
metaTag.setParseAction(withAttribute(name="description"))

try:
    # scanString is a generator that returns each match as it is found
    # in the input
    tokens,startloc,endloc = metaTag.scanString(htmlsrc).next()

    # attributes can be accessed like object attributes if they are 
    # valid Python names
    print tokens.content

    # if the attribute name clashes with a Python keyword, or is 
    # otherwise unsuitable as an identifier, use dict-like access instead
    print tokens["content"]

except StopIteration:
    print "no matching meta tag found"

回答于 2025-04-15 由 Python大师

分享举报

你可以抛出一个异常，并把你的 .feed() 调用放在一个尝试块里。

当你决定完成时，也可以调用 self.reset()（我其实没有试过，但根据文档的说法，“重置实例。会丢失所有未处理的数据。” - 这正是你需要的）。

回答于 2025-04-15 由 Python大师

分享举报

在Python中中止HTMLParser处理

3 个回答

撰写回答