<p><a href="http://www.reddit.com/r/learnpython/comments/2ezvjz/interpreting_tumblr_commentary_in_python/ck4qsfr?context=2" rel="nofollow">reddit user /u/joyeusenoelle has answered my question over at /r/LearnPython</a>使用大量复杂的正则表达式,这些正则表达式最终看起来更像是一个巫毒咒语,而不是文本操作脚本。在</p>
<blockquote>
<p>Lots of regexes later, I think I've solved this for an
arbitrarily-deep comment chain.</p>
<pre><code>import re
with open("tcomment.txt","r") as tf:
text = ""
for line in tf:
text += line
tf.close()
text = text.replace("\n","")
text = text.replace(">",">\n")
text = text.replace("<","\n<")
text = re.sub("</p>\s*<p>","<br><br>", text)
text = text.replace("<p>\n", "")
text = text.replace("</p>\n","\n")
text = re.sub("<[/]{0,1}blockquote>","<chunk>",text)
text = re.sub("<a class=\"tumblr_blog\"[^>]+?>","<chunk>",text)
text = text.replace("</a>","")
text = re.sub("\n+","", text)
text = re.sub("\s{2,}"," ", text)
text = re.sub("<chunk>\s*<chunk>","<chunk>",text)
bits = text.split("<chunk>")
bits[0] = "Latest:"
comments = []
for i in range(len(bits)):
temp = ""
j = 0 - (i+1)
if (len(bits)-i) > i:
temp = "<b>" + bits[i] + "</b> " + bits[j]
comments.append(temp)
comments.reverse()
for comment in comments:
print("<p>%s</p>" % (comment))
print()
</code></pre>
<p>The line <code>bits[0] = "Latest:"</code> can be changed to whatever you want the
most recent comment to display, and you'll probably want to change how
the text comes into the script.</p>
<p>For the text you provided, this gives me:</p>
<pre><code><p><b>example-blog-domain:</b> Here is an example of a Tumblr post.<br><br>It can have multiple &lt;p&gt; elements sometimes. It may
</code></pre>
<p>only have one, though, at other times. </p>
<pre><code><p><b>chainsaw-police:</b> This is an example of a user "reblogging" a post. As you can see, the previous comment is stored
</code></pre>
<p>above as a <blockquote>.</p>
<pre><code><p><b>Latest:</b> This is another reblog. As you can see, all of the previous comments are stored as blockquotes, with earlier ones
</code></pre>
<p>being residing deeper in the nest of blockquotes.</p>
<p>e: Some thoughts: this is in Python 3, but everything but the print
statements should work in Python 2, I think. I used <code>text.split()</code>
whenever possible because direct string manipulation is typically
faster than regular expressions are, but that may not be appropriate
here. And finally, it's possible that I'm making more work for myself
than I need to in the substitutions section, but at this point <em>I've</em>
looked at the code too long to figure out if it could be slimmed down.</p>
</blockquote>