<p>我在这里留下我的答案,因为这正是OP所要求的。正确的方法是按照下面的<a href="https://stackoverflow.com/a/10337343/1290420">the answer by @ChristophD</a>中的建议使用<code>python-wikitools</code>。在</p>
<hr/>
<p>我稍微修改了您问题中的代码以使用<a href="http://www.crummy.com/software/BeautifulSoup/" rel="nofollow noreferrer">BeautifulSoup</a>。还有其他选择。您可能还想试试<a href="http://lxml.de/" rel="nofollow noreferrer">lxml</a>。在</p>
<pre><code>import urllib2
import re, sys
from HTMLParser import HTMLParser
# EDIT 1: import the packag
from BeautifulSoup import BeautifulSoup
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def stripHTMLTags(html):
html = re.sub(r'<{1}br{1}>', '\n', html)
s = MLStripper()
s.feed(html)
text = s.get_data()
if "External links" in text:
text, sep, tail = text.partition('External links')
if "External Links" in text:
text, sep, tail = text.partition('External Links')
text = text = text.replace("See also","\n\n See Also - \n")
text = text.replace("*","- ")
text = text.replace(".", ". ")
text = text.replace(" "," ")
text = text.replace(""" /
/ ""","")
return text
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()
# EDIT 2: convert the page and extract text from the first <p> tag
soup = BeautifulSoup(page)
para = soup.findAll("p", limit=1)[0].text
print stripHTMLTags(para)
</code></pre>