在Python中解析嵌套HTML <blockquote> 标签？

0 投票

2 回答

573 浏览

提问于 2025-04-18 22:59

我有一个网页应用，它可以读取Tumblr的API，并重新格式化“转发链”的显示方式。

在Tumblr上，帖子评论是以HTML的引用格式存储的。当用户对上面的评论进行回复时，就会在引用链中增加一个层级，最终形成许多嵌套的转发链。

下面是一个“转发链”在普通HTML中的样子：

<p><a class="tumblr_blog" href="http://chainsaw-police.tumblr.com/post/96158438802/example-tumblr-post">chainsaw-police</a>:</p><blockquote>

    <p><a class="tumblr_blog" href="http://example-blog-domain.tumblr.com/post/96158384215/example-tumblr-post">example-blog-domain</a>:</p><blockquote>
        <p>Here is an example of a Tumblr post.</p> <p>It can have multiple &lt;p&gt; elements sometimes. It may only have one, though, at other times.</p>
    </blockquote>

    <p>This is an example of a user “reblogging” a post. As you can see, the previous comment is stored above as a &lt;blockquote&gt;.</p>
</blockquote>

<p>This is another reblog. As you can see, all of the previous comments are stored as blockquotes, with earlier ones being residing deeper in the nest of blockquotes.</p>

这是它渲染后的样子。

我想把转发链重新格式化，让它看起来更像这样：

example-blog-domain: 这是一个Tumblr帖子的示例。

有时候它可以有多个元素，但有时也可能只有一个。

chainsaw-police: 这是一个用户“转发”帖子的示例。你可以看到，之前的评论以<blockquote>的形式存储在上面。

example-blog-domain: 这是另一个转发。你可以看到，所有之前的评论都以引用的形式存储，较早的评论在引用的嵌套中更深。

我知道，这个结构非常复杂，所以我想写点东西让它更易读。

有没有办法解析这些HTML，把转发分成单独的“评论”？比如说，有一个数组或字典，里面包含用户名和评论内容，这样就足够了。不过，我已经用lxml和BeautifulSoup折腾了几个月，真是快要崩溃了。

如果能用CSS做到这一点，我也不太相信，但那也可以。

谢谢大家的帮助！

lxml html解析 API集成数据格式化 beautfulsoup 嵌套标签用户评论转发链

2 个回答

我想CSS是没有这样的功能的。你需要用lxml把内容解析成一个结构，然后再进行渲染。这样做会简单一些。你也可以使用正则表达式创建一个过滤器，来过滤掉不正确的HTML代码。

回答于 2025-04-18 由 Python大师

分享举报

reddit 用户 /u/joyeusenoelle 在 /r/LearnPython 上回答了我的问题，使用了很多复杂的正则表达式，结果看起来更像是巫术咒语，而不是一个文本处理脚本。

经过很多次的正则表达式尝试，我觉得我已经解决了这个问题，可以处理任意深度的评论链。
import re

with open("tcomment.txt","r") as tf:
 text = ""
 for line in tf:
 text += line
tf.close()
text = text.replace("\n","")
text = text.replace(">",">\n")
text = text.replace("<","\n<")
text = re.sub("\s*"," ", text)
text = text.replace("\n", "")
text = text.replace("\n","\n")
text = re.sub("<[/]{0,1}blockquote>","<chunk>",text)
text = re.sub("<a class=\"tumblr_blog\"[^>]+?>","<chunk>",text)
text = text.replace("</a>","")
text = re.sub("\n+","", text)
text = re.sub("\s{2,}"," ", text)
text = re.sub("<chunk>\s*<chunk>","<chunk>",text)
bits = text.split("<chunk>")
bits[0] = "Latest:"
comments = []
for i in range(len(bits)):
 temp = ""
 j = 0 - (i+1)
 if (len(bits)-i) > i:
 temp = "" + bits[i] + " " + bits[j]
 comments.append(temp)

comments.reverse()
for comment in comments:
 print("%s" % (comment))
 print()
这一行 bits[0] = "Latest:" 可以改成你想要的任何内容，用来显示最新的评论，你可能还想改变文本是如何进入脚本的。

根据你提供的文本，这样处理后我得到了：
example-blog-domain: Here is an example of a Tumblr post. It can have multiple &lt;p&gt; elements sometimes. It may
不过有时候只有一个。
chainsaw-police: This is an example of a user "reblogging" a post. As you can see, the previous comment is stored
上面作为一个 <blockquote>。
Latest: This is another reblog. As you can see, all of the previous comments are stored as blockquotes, with earlier ones
在更深的引用嵌套中。

补充一下：这是用 Python 3 写的，但除了打印语句，其他部分应该在 Python 2 中也能正常工作。我尽量使用 text.split()，因为直接操作字符串通常比使用正则表达式要快，但在这里可能不太合适。最后，我可能在替换部分做了比必要更多的工作，但到现在为止我看这段代码太久了，没法判断是否可以简化。

回答于 2025-04-18 由 Python大师

分享举报

在Python中解析嵌套HTML <blockquote> 标签？

2 个回答

撰写回答