从HTML标记中提取文本？

2条回答

网友

1楼 · 编辑于 2024-04-26 11:02:09

你试着去研究一个HTML解析器。如果您只希望html页面的核心部分没有标记符号，您可以使用：

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.tags = []
        self.attrs = []
    def handle_starttag(self, tag, attrs):
        self.tags.append(tag)
        self.attrs.append(attrs)
    def handle_endtag(self, tag):
        if tag not in self.tags:return
        for x in reversed(self.tags):
            self.tags.pop()
            self.attrs.pop()
            if tag == x:return
    def handle_data(self, data):
        print data

parser = MyHTMLParser()
f = file("temp.html")
parser.feed(f.read())
f.close()

这将解析html页面中的数据。<div><h1>This is my webpage</h1><div></div></div>将被打印为This is my webpage。你可以修改任何你想显示不同部分，不同格式等的方法。只要改变你喜欢的基本类，我的代码应该只是让你开始在正确的道路上。在

网友

2楼 · 编辑于 2024-04-26 11:02:09

听起来你想将HTML呈现为文本，而不是提取各种标记的内容。在

如果是这样的话，可以考虑从Python代码中以subprocess的形式运行其中一个：

links -html-numbered-links 1 -html-images 1 -dump "file://$@"
lynx -force_html -dump "$@"
w3m -T text/html -F -dump "$@"

相关问题更多 >

编程相关推荐

热门问题

热门文章

从HTML标记中提取文本？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >