使用Python进行Web抓取有时会获取结果n

2024-03-28 17:52:54 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试着从reddit网页上获取视频。我用Python和靓汤来做工作。那个下面的代码有时会返回结果,有时在我重新运行代码时不会返回结果。我不确定我哪里出错了。有人能帮忙吗?我是python的新手,所以请容忍我。在

import requests
from bs4 import BeautifulSoup


page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/')

soup = BeautifulSoup(page.text, 'html.parser')

source_tags = soup.find_all('source')

print(source_tags)

Tags: 代码fromimport网页sourceget视频tags
2条回答

如果您在page = requests.get('https:/.........')之后执行print (page),您将看到您获得一个成功的<Response [200]>

但是如果您再次快速运行它,您将得到<Response [429]>

“HTTP 429 Too Many Requests response status code(HTTP 429 Too Many Requests response status code)表示用户在给定时间内发送了太多请求(“速率限制”)。“Source here

另外,如果您查看html源代码,您会看到:

<h1>whoa there, pardner!</h1>
<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>
<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>
<p>please wait 6 second(s) and try again.</p>
<p>as a reminder to developers, we recommend that clients make no
    more than <a href="http://github.com/reddit/reddit/wiki/API">one
    request every two seconds</a> to avoid seeing this message.</p>

要添加标题并避免使用429附加模块:

^{pr2}$

完整代码:

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}

page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', headers=headers)
print (page)

soup = BeautifulSoup(page.text, 'html.parser')

source_tags = soup.find_all('source')

print(source_tags)

输出:

<Response [200]>
[<source src="https://v.redd.it/et9so1j0z6a21/HLSPlaylist.m3u8" type="application/vnd.apple.mpegURL"/>]

并且在等待一两秒钟后重新运行没有问题

我尝试了下面的代码,它是为我的每一个请求,增加了30秒的超时。在

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', timeout=30)
if page.status_code == 200:
    soup = BeautifulSoup(page.text, 'lxml')
    source_tags = soup.find_all('source')
    print(source_tags)
else:
    print(page.status_code, page)

相关问题 更多 >