使用Python的urllib2和Beautifulsoup抓取维基百科时去除HTML标签
我正在尝试抓取维基百科的数据,以便进行文本挖掘。我使用的是Python的urllib2和BeautifulSoup库。我的问题是:有没有简单的方法可以去掉我读取的文本中那些不必要的标签(比如链接的'a'标签或'span'标签)呢?
在这个情况下:
import urllib2
from BeautifulSoup import *
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open("http://en.wikipedia.org/w/index.php?title=data_mining&printable=yes")pool = BeautifulSoup(infile.read())
res=pool.findAll('div',attrs={'class' : 'mw-content-ltr'}) # to get to content directly
paragrapgs=res[0].findAll("p") #get all paragraphs
我得到的段落中有很多引用标签,比如:
paragrapgs[0] =
<p><b>Data mining</b> (the analysis step of the <b>knowledge discovery in databases</b> process,<sup id="cite_ref-Fayyad_0-0" class="reference"><a href="#cite_note-Fayyad-0"><span>[</span>1<span>]</span></a></sup> or KDD), a relatively young and interdisciplinary field of <a href="/wiki/Computer_science" title="Computer science">computer science</a><sup id="cite_ref-acm_1-0" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup><sup id="cite_ref-brittanica_2-0" class="reference"><a href="#cite_note-brittanica-2"><span>[</span>3<span>]</span></a></sup> is the process of discovering new patterns from large <a href="/wiki/Data_set" title="Data set">data sets</a> involving methods at the intersection of <a href="/wiki/Artificial_intelligence" title="Artificial intelligence">artificial intelligence</a>, <a href="/wiki/Machine_learning" title="Machine learning">machine learning</a>, <a href="/wiki/Statistics" title="Statistics">statistics</a> and <a href="/wiki/Database_system" title="Database system">database systems</a>.<sup id="cite_ref-acm_1-1" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup> The goal of data mining is to extract knowledge from a data set in a human-understandable structure<sup id="cite_ref-acm_1-2" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup> and involves database and <a href="/wiki/Data_management" title="Data management">data management</a>, <a href="/wiki/Data_Pre-processing" title="Data Pre-processing">data preprocessing</a>, <a href="/wiki/Statistical_model" title="Statistical model">model</a> and <a href="/wiki/Statistical_inference" title="Statistical inference">inference</a> considerations, interestingness metrics, <a href="/wiki/Computational_complexity_theory" title="Computational complexity theory">complexity</a> considerations, post-processing of found structure, <a href="/wiki/Data_visualization" title="Data visualization">visualization</a> and <a href="/wiki/Online_algorithm" title="Online algorithm">online updating</a>.<sup id="cite_ref-acm_1-3" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup></p>
有没有什么好主意可以去掉这些标签,只保留纯文本呢?
3 个回答
0
这些代码似乎是针对Beautiful Soup中的标签节点进行操作的。父节点会被修改,这样相关的标签就会被移除。同时,找到的标签也会以列表的形式返回给调用者。
@staticmethod
def seperateCommentTags(parentNode):
commentTags = []
for descendant in parentNode.descendants:
if isinstance(descendant, element.Comment):
commentTags.append(descendant)
for commentTag in commentTags:
commentTag.extract()
return commentTags
@staticmethod
def seperateScriptTags(parentNode):
scripttags = parentNode.find_all('script')
scripts = []
for scripttag in scripttags:
script = scripttag.extract()
if script is not None:
scripts.append(script)
return scripts
3
for p in paragraphs(text=True):
print p
另外,你可以用 api.php
来替代 index.php
:
#!/usr/bin/env python
import sys
import time
import urllib, urllib2
import xml.etree.cElementTree as etree
# prepare request
maxattempts = 5 # how many times to try the request before giving up
maxlag = 5 # seconds http://www.mediawiki.org/wiki/Manual:Maxlag_parameter
params = dict(action="query", format="xml", maxlag=maxlag,
prop="revisions", rvprop="content", rvsection=0,
titles="data_mining")
request = urllib2.Request(
"http://en.wikipedia.org/w/api.php?" + urllib.urlencode(params),
headers={"User-Agent": "WikiDownloader/1.2",
"Referer": "http://stackoverflow.com/q/8044814"})
# make request
for _ in range(maxattempts):
response = urllib2.urlopen(request)
if response.headers.get('MediaWiki-API-Error') == 'maxlag':
t = response.headers.get('Retry-After', 5)
print "retrying in %s seconds" % (t,)
time.sleep(float(t))
else:
break # ready to read
else: # exhausted all attempts
sys.exit(1)
# download & parse xml
tree = etree.parse(response)
# find rev data
rev_data = tree.findtext('.//rev')
if not rev_data:
print 'MediaWiki-API-Error:', response.headers.get('MediaWiki-API-Error')
tree.write(sys.stdout)
print
sys.exit(1)
print(rev_data)
输出
{{Distinguish|analytics|information extraction|data analysis}}
'''Data mining''' (the analysis step of the '''knowledge discovery in databases..
3
这是你可以使用lxml
(还有很棒的requests
库)来实现的方法:
import requests
import lxml.html as lh
from BeautifulSoup import UnicodeDammit
URL = "http://en.wikipedia.org/w/index.php?title=data_mining&printable=yes"
HEADERS = {'User-agent': 'Mozilla/5.0'}
def lhget(*args, **kwargs):
r = requests.get(*args, **kwargs)
html = UnicodeDammit(r.content).unicode
tree = lh.fromstring(html)
return tree
def remove(el):
el.getparent().remove(el)
tree = lhget(URL, headers=HEADERS)
el = tree.xpath("//div[@class='mw-content-ltr']/p")[0]
for ref in el.xpath("//sup[@class='reference']"):
remove(ref)
print lh.tostring(el, pretty_print=True)
print el.text_content()