如何从html页面中排除内容并仅保留html标记？

1条回答

网友

1楼 · 发布于 2024-06-10 13:25:40

正如Ivar所评论的，HTML解析器是正确处理此类问题的唯一方法：

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.indent = -1

    def handle_starttag(self, tag, attrs):
        self.indent += 1
        print(2 * self.indent * ' ', sep='', end='')
        print(f'<{tag}', sep='', end='')
        for attr in attrs:
            print(f' {attr[0]}="{attr[1]}"', sep='', end='')
        print('>', sep='')

    def handle_endtag(self, tag):
        print(2 * self.indent * ' ', sep='', end='')
        print(f'</{tag}>')
        self.indent -= 1

parser = MyHTMLParser()
parser.feed("""<html>
  <head>
    <title>Test</title>
  </head>
  <body>
    <h1>Heading!</h1>
    <p style="font-weight: bold; color: red;">
       Some text
       <BR/>
       Some more text
    </p>
    <ol>
       <li>Item 1</li>
       <li>Item 2</li>
     </ol>
  </body>
</html>
""")

印刷品：

<html>
  <head>
    <title>
    </title>
  </head>
  <body>
    <h1>
    </h1>
    <p style="font-weight: bold; color: red;">
      <br>
      </br>
    </p>
    <ol>
      <li>
      </li>
      <li>
      </li>
    </ol>
  </body>
</html>

See Python Demo

更新

如果HTML不是太大的文件，将整个文件读入内存并传递给解析器是有意义的，因此：

parser = MyHTMLParser()
with open('test.html') as f:
    html = f.read()
    parser.feed(html)

如果输入在一个非常大的文件中，那么逐行或分块“输入”解析器可能是有意义的，而不是尝试将整个文件读入内存：

逐行：

parser = MyHTMLParser()
with open('test.html') as f:
    for line in f:
        parser.feed(line)

或者更有效地：

以32K为单位阅读：

CHUNK_SIZE = 32 * 1024
parser = MyHTMLParser()
with open('test.html') as f:
    while True:
        chunk = f.read(CHUNK_SIZE)
        if chunk == '':
            break
        parser.feed(chunk)

当然，您可以选择更大的块大小

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何从html页面中排除内容并仅保留html标记？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >