<p>首先,我还是坚持使用selenium,因为这是一个相当“javascript重”的网站。请注意,如果需要,可以使用无头浏览器(<a href="http://phantomjs.org/" rel="nofollow noreferrer">^{<cd1>}</a>或与<a href="https://stackoverflow.com/questions/6183276/how-do-i-run-selenium-in-xvfb">virtual display</a>)一起使用。在</p>
<p>这里的想法是按每页100行分页,单击“>>;”链接,直到它不在页面上,这意味着我们已经到达最后一页,没有更多的结果要处理。为了使解决方案可靠,我们需要使用<a href="https://selenium-python.readthedocs.org/waits.html#explicit-waits" rel="nofollow noreferrer">Explicit Waits</a>:每次我们进入下一页-等待加载微调器不可见。在</p>
<p>工作实施:</p>
<pre><code># -*- coding: utf-8 -*-
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.maximize_window()
driver.get('https://polon.nauka.gov.pl/opi/aa/drh/zestawienie?execution=e1s1')
wait = WebDriverWait(driver, 30)
# paginate by 100
select = Select(driver.find_element_by_id("drhPageForm:drhPageTable:j_idt211:j_idt214:j_idt220"))
select.select_by_visible_text("100")
while True:
# wait until there is no loading spinner
wait.until(EC.invisibility_of_element_located((By.ID, "loadingPopup_content_scroller")))
current_page = driver.find_element_by_class_name("rf-ds-act").text
print("Current page: %d" % current_page)
# TODO: collect the results
# proceed to the next page
try:
next_page = driver.find_element_by_link_text(u"»")
next_page.click()
except NoSuchElementException:
break
</code></pre>