使用Python进行Web抓取有时会获取结果n

import requests from bs4 import BeautifulSoup page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/') soup = BeautifulSoup(page.text, 'html.parser') source_tags = soup.find_all('source') print(source_tags)

2条回答

网友

1楼 · 编辑于 2024-05-18 23:30:58

如果您在page = requests.get('https:/.........')之后执行print (page)，您将看到您获得一个成功的<Response [200]>

但是如果您再次快速运行它，您将得到<Response [429]>

“HTTP 429 Too Many Requests response status code（HTTP 429 Too Many Requests response status code）表示用户在给定时间内发送了太多请求（“速率限制”）。“Source here

另外，如果您查看html源代码，您会看到：

<h1>whoa there, pardner!</h1>
<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>
<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>
<p>please wait 6 second(s) and try again.</p>
<p>as a reminder to developers, we recommend that clients make no
    more than <a href="http://github.com/reddit/reddit/wiki/API">one
    request every two seconds</a> to avoid seeing this message.</p>

要添加标题并避免使用429附加模块：

^{pr2}$

完整代码：

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}

page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', headers=headers)
print (page)

soup = BeautifulSoup(page.text, 'html.parser')

source_tags = soup.find_all('source')

print(source_tags)

输出：

<Response [200]>
[<source src="https://v.redd.it/et9so1j0z6a21/HLSPlaylist.m3u8" type="application/vnd.apple.mpegURL"/>]

并且在等待一两秒钟后重新运行没有问题

网友

2楼 · 编辑于 2024-05-18 23:30:58

我尝试了下面的代码，它是为我的每一个请求，增加了30秒的超时。在

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', timeout=30)
if page.status_code == 200:
    soup = BeautifulSoup(page.text, 'lxml')
    source_tags = soup.find_all('source')
    print(source_tags)
else:
    print(page.status_code, page)

相关问题更多 >

编程相关推荐

热门问题

热门文章