在Python中解释嵌套的HTML<blockquote>s?

2024-03-29 12:05:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个从Tumblr API读取并重新格式化“reblog链”格式的web应用程序。在

在Tumblr中,文章的注释存储为HTML块引号。当用户对上面的评论做出回应时,另一个层次被添加到blockquote链中,最终导致许多嵌套的重新登录链。在


下面是一个“重新登录链”在纯HTML中的外观示例:

<p><a class="tumblr_blog" href="http://chainsaw-police.tumblr.com/post/96158438802/example-tumblr-post">chainsaw-police</a>:</p><blockquote>

    <p><a class="tumblr_blog" href="http://example-blog-domain.tumblr.com/post/96158384215/example-tumblr-post">example-blog-domain</a>:</p><blockquote>
        <p>Here is an example of a Tumblr post.</p> <p>It can have multiple &lt;p&gt; elements sometimes. It may only have one, though, at other times.</p>
    </blockquote>

    <p>This is an example of a user “reblogging” a post. As you can see, the previous comment is stored above as a &lt;blockquote&gt;.</p>
</blockquote>

<p>This is another reblog. As you can see, all of the previous comments are stored as blockquotes, with earlier ones being residing deeper in the nest of blockquotes.</p>

And this is what it looks like when rendered.


我希望能够重新格式化reblog链,使其看起来更像:

示例: 下面是一个Tumblr帖子的例子。在

有时它可以有多个<;p>;元素。不过,其他时候可能只有一个。在

电锯警察: 这是一个用户“重新记录”一篇文章的例子。如您所见,前面的注释作为<;blockquote>;存储在上面;。在

示例博客域: 这是另一个重播。如您所见,前面所有的注释都存储为blockquote,而早期的注释则位于blockquote嵌套的更深位置。在


所以我想让它变得更易懂。在

有什么方法可以解释HTML并将重新记录分成单独的“注释”吗?例如,有一个包含用户名和注释的数组或dict就足够了。然而,在和lxml和beauthoulsoup混了几个月之后,我真是束手无策了。在

如果有一种方法可以用CSS来实现,我非常怀疑,那就好了。在

提前谢谢大家!在


Tags: oftheltgt示例isexamplehtml
2条回答

reddit user /u/joyeusenoelle has answered my question over at /r/LearnPython使用大量复杂的正则表达式,这些正则表达式最终看起来更像是一个巫毒咒语,而不是文本操作脚本。在

Lots of regexes later, I think I've solved this for an arbitrarily-deep comment chain.

import re

with open("tcomment.txt","r") as tf:
    text = ""
    for line in tf:
        text += line
tf.close()
text = text.replace("\n","")
text = text.replace(">",">\n")
text = text.replace("<","\n<")
text = re.sub("</p>\s*<p>","<br><br>", text)
text = text.replace("<p>\n", "")
text = text.replace("</p>\n","\n")
text = re.sub("<[/]{0,1}blockquote>","<chunk>",text)
text = re.sub("<a class=\"tumblr_blog\"[^>]+?>","<chunk>",text)
text = text.replace("</a>","")
text = re.sub("\n+","", text)
text = re.sub("\s{2,}"," ", text)
text = re.sub("<chunk>\s*<chunk>","<chunk>",text)
bits = text.split("<chunk>")
bits[0] = "Latest:"
comments = []
for i in range(len(bits)):
    temp = ""
    j = 0 - (i+1)
    if (len(bits)-i) > i:
        temp = "<b>" + bits[i] + "</b> " + bits[j]
        comments.append(temp)

comments.reverse()
for comment in comments:
    print("<p>%s</p>" % (comment))
    print()

The line bits[0] = "Latest:" can be changed to whatever you want the most recent comment to display, and you'll probably want to change how the text comes into the script.

For the text you provided, this gives me:

<p><b>example-blog-domain:</b>  Here is an example of a Tumblr post.<br><br>It can have multiple &lt;p&gt; elements sometimes. It may

only have one, though, at other times.

<p><b>chainsaw-police:</b>  This is an example of a user "reblogging" a post. As you can see, the previous comment is stored

above as a <blockquote>.

<p><b>Latest:</b> This is another reblog. As you can see, all of the previous comments are stored as blockquotes, with earlier ones

being residing deeper in the nest of blockquotes.

e: Some thoughts: this is in Python 3, but everything but the print statements should work in Python 2, I think. I used text.split() whenever possible because direct string manipulation is typically faster than regular expressions are, but that may not be appropriate here. And finally, it's possible that I'm making more work for myself than I need to in the substitutions section, but at this point I've looked at the code too long to figure out if it could be slimmed down.

我猜CSS没有这样的功能。 你需要用lxml来解析一个结构。。。然后渲染它。这是更简单的方法。您还可以使用regexp创建一个过滤器,它不会传递错误的html代码项。在

相关问题 更多 >