如何使用Python使用POST方法刮取页面？

2条回答

网友

1楼 · 编辑于 2024-05-23 17:59:49

这将模拟单击下一页，将代码放在scrapy蜘蛛scrapy docs中

# -*- coding: utf-8 -*-
import scrapy
from scrapy.utils.response import open_in_browser
import pandas as pd
class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['heavens-above.com']

    def start_requests(self):
        url = "https://heavens-above.com/StarlinkLaunchPasses.aspx?lat=45.61&lng=15.312&loc=Somewhere&alt=0&tz=CET"
        yield scrapy.Request(url,callback=self.parse)
    def parse(self, response):
        #open_in_browser(response) see the response
        table=response.xpath('//table[@class="standardTable"]').extract_first()
        df = pd.read_html(table)
        #do what you want the df
        #going to next page
        to_post = response.urljoin(response.xpath('//form[@name="aspnetForm"]/@action').extract_first())
        data = {
          '__EVENTTARGET': '',
          '__EVENTARGUMENT': '',
          '__LASTFOCUS': '',
          '__VIEWSTATE':response.xpath('//*[@id="__VIEWSTATE"]/@value').extract_first(),
          '__VIEWSTATEGENERATOR':response.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value').extract_first(),
          'utcOffset':response.xpath('//*[@id="utcOffset"]/@value').extract_first(),
          'ctl00$ddlCulture': 'en',
          'ctl00$cph1$hidStartUtc':response.xpath('//*[@id="ctl00_cph1_hidStartUtc"]/@value').extract_first(),
          'ctl00$cph1$ddlLaunches':response.xpath('//*[@id="ctl00$cph1$ddlLaunches"]/@value').extract_first(),
          'ctl00$cph1$ddlLaunches':response.xpath('//option[@selected="selected"]/@value').extract()[-1],
          'ctl00$cph1$btnNext': '>',
          }
        yield scrapy.http.FormRequest(to_post,callback=self.parse,formdata=data,)

网友

2楼 · 编辑于 2024-05-23 17:59:49

对于这样一个页面，您不需要使用Scrapy或Selenium

您可以使用requests、bs4和pandas实现您的目标

现在，让我们把计划付诸实施：

1。我们将检查您的browser{a2}下的Network Monitor，看看更改日期后会发生什么

如您所见，我们注意到已向 host 具有多个Form data
问：为什么你的url呼叫没有得到响应传递POST数据
答：因为host实际上设置了一个特定的日期，从drop down到static，这是18 March 2020 12:16，一旦打开url就可以看到

Notes:

您不需要解析HTML并搜索表来用Pandas读取它，因为您可以在一次调用中完成！aspandas有一个名为read_html的函数，它将解析HTML并将tables作为列表为您读取。可以通过切片[]在它们之间移动

import pandas as pd

df = pd.read_html(
    "https://heavens-above.com/StarlinkLaunchPasses.aspx?lat=50&lng=12&loc=Somewhere")[0]

print(df)

您根本不需要使用raw stringPython raw string treats backslash（）as a literal character，在某些情况下需要将其传递给host

2。我们将查看Form data中的所有parameters，丢弃空值""，并检查哪个values是filled。现在如果我们刷新页面，我们会注意到有一些values被更改了。因此，我们将检查HTML源代码，看看是否可以找到这些values

正如您所看到的，我们在前面的screen-shot的这一部分中找到了parameters和values

这里是drop-down选项的important部分的值，我们需要将它传递给这个parameter{}

3。现在，我们需要通过维护session对象发出GET请求来解析url并收集所有必需的parameters{}，然后发出post请求。而我们将用Pandas阅读它

问：为什么我们不直接使用Pandas来读取HTML表？答：因为Pandas没有传递Form data的选项，所以我们使用requests并通过data=传递Form data，然后通过read_html读取content

最后，我们将使用每个表的名称将其保存到csv文件中

最终代码

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re


def Main(url):
    with requests.Session() as req:
        r = req.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')
        times = [item.get("value") for item in soup.findAll(
            "option", value=re.compile(r"\d{6}"))]
        vs = soup.find("input", id="__VIEWSTATE").get("value")
        vsg = soup.find("input", id="__VIEWSTATEGENERATOR").get("value")
        ut = soup.find("input", id="ctl00_cph1_hidStartUtc").get("value")
        for time in times:
            data = {
                '__EVENTTARGET': 'ctl00$cph1$ddlLaunches',
                '__EVENTARGUMENT': '',
                '__LASTFOCUS': '',
                '__VIEWSTATE': vs,
                '__VIEWSTATEGENERATOR': vsg,
                'utcOffset': '0',
                'ctl00$ddlCulture': 'en',
                'ctl00$cph1$hidStartUtc': ut,
                'ctl00$cph1$ddlLaunches': time
            }
            r = req.post(url, data=data)
            df = pd.read_html(r.content)[0]
            df.to_csv(f"{time}.csv", index=False)


Main("https://heavens-above.com/StarlinkLaunchPasses.aspx?lat=50&lng=12&loc=Somewhere")

相关问题更多 >

编程相关推荐

热门问题

热门文章