用Python抓取JavaScript网页

3条回答

网友

1楼 · 编辑于 2024-06-06 19:05:00

2017年12月30日编辑：这个答案出现在谷歌搜索的最热门结果中，所以我决定更新它。旧的答案还在后面。

dryscape不再维护，dryscape开发人员推荐的库仅为Python2。我发现使用Selenium的python库和Phantom JS作为web驱动程序已经足够快和容易完成工作。

安装Phantom JS后，请确保phantomjs二进制文件在当前路径中可用：

phantomjs --version
# result:
2.1.1

示例

举个例子，我创建了一个包含以下HTML代码的示例页面。（link）：

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>Javascript scraping test</title>
</head>
<body>
  <p id='intro-text'>No javascript support</p>
  <script>
     document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
  </script> 
</body>
</html>

如果没有javascript，它会说：No javascript support，如果有javascript，它会说：Yay! Supports javascript

无JS支持的刮削：

import requests
from bs4 import BeautifulSoup
response = requests.get(my_url)
soup = BeautifulSoup(response.text)
soup.find(id="intro-text")
# Result:
<p id="intro-text">No javascript support</p>

支持JS的刮削：

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(my_url)
p_element = driver.find_element_by_id(id_='intro-text')
print(p_element.text)
# result:
'Yay! Supports javascript'

您还可以使用Python库dryscrape来清理javascript驱动的网站。

支持JS的刮削：

import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")
# Result:
<p id="intro-text">Yay! Supports javascript</p>

网友

2楼 · 编辑于 2024-06-06 19:05:00

我们没有得到正确的结果，因为任何javascript生成的内容都需要在DOM上呈现。当我们获取一个HTML页面时，我们获取初始的，未被javascript修改的，DOM。

因此，我们需要在抓取页面之前呈现javascript内容。

由于在这个线程中已经多次提到硒（有时还提到硒的速度有多慢），我将列出另外两个可能的解决方案。

解决方案1:这是一个关于how to use Scrapy to crawl javascript generated content的非常好的教程，我们将按照它进行。

我们需要什么：

Docker安装在我们的机器中。到目前为止，这是一个优于其他解决方案的优势，因为它使用了一个独立于操作系统的平台。
Install Splash遵循相应操作系统的说明。
引用splash文档：
Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5.
实际上，我们将使用Splash来呈现Javascript生成的内容。
运行启动服务器：sudo docker run -p 8050:8050 scrapinghub/splash。
安装scrapy-splash插件：pip install scrapy-splash
假设我们已经创建了一个不完整的项目（如果没有，let's make one），我们将遵循指南并更新settings.py：
Then go to your scrapy project’s settings.py and set these middlewares:
```
DOWNLOADER_MIDDLEWARES = {
      'scrapy_splash.SplashCookiesMiddleware': 723,
      'scrapy_splash.SplashMiddleware': 725,
      'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
```
The URL of the Splash server(if you’re using Win or OSX this should be the URL of the docker machine: How to get a Docker container's IP address from the host?):
```
SPLASH_URL = 'http://localhost:8050'
```
And finally you need to set these values too:
```
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
```

最后，我们可以使用^{}：

In a normal spider you have Request objects which you can use to open URLs. If the page you want to open contains JS generated data you have to use SplashRequest(or SplashFormRequest) to render the page. Here’s a simple example:
class MySpider(scrapy.Spider):
    name = "jsscraper"
    start_urls = ["http://quotes.toscrape.com/js/"]

    def start_requests(self):
        for url in self.start_urls:
        yield SplashRequest(
            url=url, callback=self.parse, endpoint='render.html'
        )

    def parse(self, response):
        for q in response.css("div.quote"):
        quote = QuoteItem()
        quote["author"] = q.css(".author::text").extract_first()
        quote["quote"] = q.css(".text::text").extract_first()
        yield quote
SplashRequest renders the URL as html and returns the response which you can use in the callback(parse) method.

解决方案2：让我们现在称之为实验（2018年5月）…
此解决方案仅适用于Python的3.6版（目前）。

你知道requests模块吗（谁不知道）？
现在它有一个网络爬行的小兄弟：requests-HTML：

This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.

安装请求html:pipenv install requests-html

请求页面的url:

from requests_html import HTMLSession

session = HTMLSession()
r = session.get(a_page_url)

呈现响应以获取Javascript生成的位：
```
r.html.render()
```

最后，模块似乎提供了scraping capabilities。
或者，我们也可以尝试使用我们刚刚呈现的r.html对象的文档化方法of using BeautifulSoup。

网友

3楼 · 编辑于 2024-06-06 19:05:00

或许selenium可以做到。

from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get(url)
time.sleep(5)
htmlSource = driver.page_source

示例

无JS支持的刮削：

支持JS的刮削：

支持JS的刮削：

相关问题更多 >

编程相关推荐

热门问题

热门文章