<p>以下是<a href="https://stackoverflow.com/questions/7947579/getting-all-visible-text-from-a-webpage-using-selenium/7947811#7947811">@unutbu's answer</a>的变体:</p>
<pre><code>#!/usr/bin/env python
import sys
from contextlib import closing
import lxml.html as html # pip install 'lxml>=2.3.1'
from lxml.html.clean import Cleaner
from selenium.webdriver import Firefox # pip install selenium
from werkzeug.contrib.cache import FileSystemCache # pip install werkzeug
cache = FileSystemCache('.cachedir', threshold=100000)
url = sys.argv[1] if len(sys.argv) > 1 else "https://stackoverflow.com/q/7947579"
# get page
page_source = cache.get(url)
if page_source is None:
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get(url)
page_source = browser.page_source
cache.set(url, page_source, timeout=60*60*24*7) # week in seconds
# extract text
root = html.document_fromstring(page_source)
# remove flash, images, <script>,<style>, etc
Cleaner(kill_tags=['noscript'], style=True)(root) # lxml >= 2.3.1
print root.text_content() # extract text
</code></pre>
<p>我把你的任务分成两部分:</p>
<ul>
<li>获取页面(包括javascript生成的元素)</li>
<li>提取文本</li>
</ul>
<p>代码只通过缓存连接。您可以在一个进程中获取页面并在另一个进程中提取文本,或者使用不同的算法延迟以后执行。</p>