在Python中解释嵌套的HTML<blockquote>s？

<a class="tumblr_blog" href="http://chainsaw-police.tumblr.com/post/96158438802/example-tumblr-post">chainsaw-police</a>:<blockquote> <a class="tumblr_blog" href="http://example-blog-domain.tumblr.com/post/96158384215/example-tumblr-post">example-blog-domain</a>:<blockquote> Here is an example of a Tumblr post. It can have multiple elements sometimes. It may only have one, though, at other times. </blockquote> This is an example of a user “reblogging” a post. As you can see, the previous comment is stored above as a <blockquote>. </blockquote> This is another reblog. As you can see, all of the previous comments are stored as blockquotes, with earlier ones being residing deeper in the nest of blockquotes.

2条回答

网友

1楼 · 编辑于 2024-05-13 23:27:29

reddit user /u/joyeusenoelle has answered my question over at /r/LearnPython使用大量复杂的正则表达式，这些正则表达式最终看起来更像是一个巫毒咒语，而不是文本操作脚本。在

Lots of regexes later, I think I've solved this for an arbitrarily-deep comment chain.
import re

with open("tcomment.txt","r") as tf:
 text = ""
 for line in tf:
 text += line
tf.close()
text = text.replace("\n","")
text = text.replace(">",">\n")
text = text.replace("<","\n<")
text = re.sub("\s*"," ", text)
text = text.replace("\n", "")
text = text.replace("\n","\n")
text = re.sub("<[/]{0,1}blockquote>","<chunk>",text)
text = re.sub("<a class=\"tumblr_blog\"[^>]+?>","<chunk>",text)
text = text.replace("</a>","")
text = re.sub("\n+","", text)
text = re.sub("\s{2,}"," ", text)
text = re.sub("<chunk>\s*<chunk>","<chunk>",text)
bits = text.split("<chunk>")
bits[0] = "Latest:"
comments = []
for i in range(len(bits)):
 temp = ""
 j = 0 - (i+1)
 if (len(bits)-i) > i:
 temp = "" + bits[i] + " " + bits[j]
 comments.append(temp)

comments.reverse()
for comment in comments:
 print("%s" % (comment))
 print()
The line bits[0] = "Latest:" can be changed to whatever you want the most recent comment to display, and you'll probably want to change how the text comes into the script.
For the text you provided, this gives me:
example-blog-domain: Here is an example of a Tumblr post. It can have multiple &lt;p&gt; elements sometimes. It may
only have one, though, at other times.
chainsaw-police: This is an example of a user "reblogging" a post. As you can see, the previous comment is stored
above as a <blockquote>.
Latest: This is another reblog. As you can see, all of the previous comments are stored as blockquotes, with earlier ones
being residing deeper in the nest of blockquotes.
e: Some thoughts: this is in Python 3, but everything but the print statements should work in Python 2, I think. I used text.split() whenever possible because direct string manipulation is typically faster than regular expressions are, but that may not be appropriate here. And finally, it's possible that I'm making more work for myself than I need to in the substitutions section, but at this point I've looked at the code too long to figure out if it could be slimmed down.

网友

2楼 · 编辑于 2024-05-13 23:27:29

我猜CSS没有这样的功能。你需要用lxml来解析一个结构。。。然后渲染它。这是更简单的方法。您还可以使用regexp创建一个过滤器，它不会传递错误的html代码项。在

相关问题更多 >

编程相关推荐

热门问题

热门文章