擅长:python、mysql、java
<p>使用<a href="http://codespeak.net/lxml/" rel="noreferrer">lxml</a>,您可以尝试以下方法:</p>
<pre><code>import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
import lxml.html.clean as clean
url="http://www.yahoo.com"
ignore_tags=('script','noscript','style')
with contextlib.closing(webdriver.Firefox()) as browser:
browser.get(url) # Load page
content=browser.page_source
cleaner=clean.Cleaner()
content=cleaner.clean_html(content)
with open('/tmp/source.html','w') as f:
f.write(content.encode('utf-8'))
doc=LH.fromstring(content)
with open('/tmp/result.txt','w') as f:
for elt in doc.iterdescendants():
if elt.tag in ignore_tags: continue
text=elt.text or ''
tail=elt.tail or ''
words=' '.join((text,tail)).strip()
if words:
words=words.encode('utf-8')
f.write(words+'\n')
</code></pre>
<p>这似乎得到了www.yahoo.com上几乎所有的文本,除了图像中的文本和一些随时间变化的文本(可能使用javascript和refresh完成)。</p>