如何使用Python使用POST方法刮取页面？问题的回答

如何使用Python使用POST方法刮取页面？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我想把一些报道星链星座段落的页面合并在一起。现在，我需要手动访问每个页面，不能根据时间和可见性进行筛选 基本页面是<a href="https://heavens-above.com/StarlinkLaunchPasses.aspx?lat=50&lng=12&loc=Somewhere" rel="nofollow noreferrer">https://heavens-above.com/StarlinkLaunchPasses.aspx?lat=50&lng=12&loc=Somewhere</a> <a href="https://stackoverflow.com/questions/57239651/scrape-peekyou-com-having-post-method">Scrape peekyou.com ( having POST METHOD)</a>给了我一些提示，但还不足以让我站起来 这是抓取第一页（最后一次Starlink启动）的GET代码： <pre class="lang-py prettyprint-override"><code>import pandas as pd import requests from bs4 import BeautifulSoup res = requests.get(r"https://heavens-above.com/StarlinkLaunchPasses.aspx?lat=45.61&lng=15.312&loc=Somewhere&alt=0&tz=CET") soup = BeautifulSoup(res.content, 'lxml') table = str(soup.find_all("table", {"class": "standardTable"})) df = pd.read_html(table)[0] cols = "date satellite mag s_time s_altitude s_azimuth h_time h_altitude h_azimuth e_time e_altitude e_azimuth".split() df.columns = cols print(df) </code></pre> 单击下拉列表，通过POST方法请求其他页面。这里停止我的（肤浅的）网络刮知识 我看到返回的<code>res.text</code>包含我可以用于下一个请求的表单数据，但我不知道如何提取它们： <pre><code><form name="aspnetForm" method="post" action="/StarlinkLaunchPasses.aspx?lat=48.55&amp;lng=11.53&amp;loc=Somewhere&amp;alt=0&amp;tz=CET" id="aspnetForm"> <input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="DprSo0lEG4wbQojWQ3ub7mILDflL+omP+KQ .../> ... <input type="hidden" name="__VIEWSTATEGENERATOR" id="__VIEWSTATEGENERATOR" value="9E5B71D1" /> <input type="hidden" name="utcOffset" id="utcOffset" value="7200000" /> ... <input type="hidden" name="ctl00$cph1$hidStartUtc" id="ctl00_cph1_hidStartUtc" value="637211090517289358" /> ... # and here is the dropdown list: <select name="ctl00$cph1$ddlLaunches" id="ctl00_cph1_ddlLaunches"> <option selected="selected" value="2020019">Starlink 5, 18 March 2020 12:16</option> <option value="2020012">Starlink 4, 17 February 2020 15:06</option> <option value="2020006">Starlink 3, 29 January 2020 14:07</option> <option value="2020001">Starlink 2, 07 January 2020 02:19</option> <option value="2019074">Starlink 1, 11 November 2019 14:56</option> <option value="2019029">Starlink 0, 24 May 2019 02:30</option> </select> </code></pre> 你能帮我找到一个可能的解决办法吗 先谢谢你

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

对于这样一个页面，您不需要使用<code>Scrapy</code>或<code>Selenium</code> 您可以使用<code>requests</code>、<code>bs4</code>和<code>pandas</code>实现您的目标 现在，让我们把计划付诸实施： <hr/> 1。我们将检查您的<code>browser</code>{a2}下的<a href="https://developer.mozilla.org/en-US/docs/Tools/Network_Monitor" rel="nofollow noreferrer">Network Monitor</a>，看看更改日期后会发生什么 <a href="https://i.stack.imgur.com/4Ce2v.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/4Ce2v.png" alt="enter image description here"/></a> <ul> <li>如您所见，我们注意到已向 <a href="https://heavens-above.com/StarlinkLaunchPasses.aspx?lat=50&lng=12&loc=Somewhere" rel="nofollow noreferrer">host</a> 具有多个<code>Form data</code> 问：为什么你的url呼叫没有得到响应传递POST数据 答：因为<code>host</code>实际上设置了一个特定的日期，从<code>drop down</code>到<code>static</code>，这是<code>18 March 2020 12:16</code>，一旦打开url就可以看到</li> </ul> <blockquote> Notes: </blockquote> <ol> <li>您不需要解析<code>HTML</code>并搜索表来用<code>Pandas</code>读取它，因为您可以在一次调用中完成！as<code>pandas</code>有一个名为<code>read_html</code>的函数，它将解析<code>HTML</code>并将<code>tables</code>作为列表为您读取。可以通过切片<code>[]</code>在它们之间移动</李> </ol> <pre class="lang-py prettyprint-override"><code>import pandas as pd df = pd.read_html( "https://heavens-above.com/StarlinkLaunchPasses.aspx?lat=50&lng=12&loc=Somewhere")[0] print(df) </code></pre> <ol start=“2”> <li>您根本不需要使用<a href="https://www.journaldev.com/23598/python-raw-string" rel="nofollow noreferrer">raw string</a><code>Python raw string treats backslash</code>（）<code>as a literal character</code>，在某些情况下需要将其传递给<code>host</code></李> </ol> <hr/> 2。我们将查看<code>Form data</code>中的所有<code>parameters</code>，丢弃空值<code>""</code>，并检查哪个<code>values</code>是<code>filled</code>。现在如果我们刷新页面，我们会注意到有一些<code>values</code>被更改了。因此，我们将检查<code>HTML</code>源代码，看看是否可以找到这些<code>values</code> <a href="https://i.stack.imgur.com/q2HCj.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/q2HCj.png" alt="enter image description here"/></a> 正如您所看到的，我们在前面的<code>screen-shot</code>的这一部分中找到了<code>parameters</code>和<code>values</code> 这里是<code>drop-down</code>选项的<code>important</code>部分的值，我们需要将它传递给这个<code>parameter</code>{<cd37>} <h2><a href="https://i.stack.imgur.com/RUJjt.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/RUJjt.png" alt="enter image description here"/></a></h2> 3。现在，我们需要通过维护<code>session</code>对象发出<code>GET</code>请求来解析<code>url</code>并收集所有必需的<code>parameters</code>{<cd26>}，然后发出<code>post</code>请求。而我们将用<code>Pandas</code>阅读它 <ul> <li>问：为什么我们不直接使用Pandas来读取HTML表？答：因为<code>Pandas</code>没有传递<code>Form data</code>的选项，所以我们使用<code>requests</code>并通过<code>data=</code>传递<code>Form data</code>，然后通过<code>read_html</code>读取<code>content</code></李> </ul> 最后，我们将使用每个表的名称将其保存到<code>csv</code>文件中 最终代码 <pre class="lang-py prettyprint-override"><code>import requests from bs4 import BeautifulSoup import pandas as pd import re def Main(url): with requests.Session() as req: r = req.get(url) soup = BeautifulSoup(r.content, 'html.parser') times = [item.get("value") for item in soup.findAll( "option", value=re.compile(r"\d{6}"))] vs = soup.find("input", id="__VIEWSTATE").get("value") vsg = soup.find("input", id="__VIEWSTATEGENERATOR").get("value") ut = soup.find("input", id="ctl00_cph1_hidStartUtc").get("value") for time in times: data = { '__EVENTTARGET': 'ctl00$cph1$ddlLaunches', '__EVENTARGUMENT': '', '__LASTFOCUS': '', '__VIEWSTATE': vs, '__VIEWSTATEGENERATOR': vsg, 'utcOffset': '0', 'ctl00$ddlCulture': 'en', 'ctl00$cph1$hidStartUtc': ut, 'ctl00$cph1$ddlLaunches': time } r = req.post(url, data=data) df = pd.read_html(r.content)[0] df.to_csv(f"{time}.csv", index=False) Main("https://heavens-above.com/StarlinkLaunchPasses.aspx?lat=50&lng=12&loc=Somewhere") </code></pre>

如何使用Python使用POST方法刮取页面？

1 个回答

相关Python问题