Python requests.get(url) 在Colab中返回空内容
我正在通过请求来抓取一个网站,但尽管response.status_code
返回的是200,表示请求成功,但在response.text
或response.content
中却没有任何内容。
另外一个网站用同样的代码可以正常工作,在本地的Jupyter环境中也没问题,但在'Colab'中,我却无法通过下面的防火墙网址。
你能给我一些建议吗?
问题网址:https://gall.dcinside.com/board/view/?id=piano&no=1&exception_mode=notice&page=1
import requests
from bs4 import BeautifulSoup as bs
url = 'https://gall.dcinside.com/board/view/?id=piano&no=1&exception_mode=notice&page=1'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Whale/3.25.232.19 Safari/537.36'}
response = requests.get(url, headers=headers, data={'buscar':100000})
soup = bs(response.content, "html.parser")
soup
<br/>
<br/>
<center>
<h2>
The request / response that are contrary to the Web firewall security policies have been blocked.
</h2>
<table>
<tr>
<td>Detect time</td>
<td>2024-03-12 21:52:05</td>
</tr>
<tr>
<td>Detect client IP</td>
<td>35.236.245.49</td>
</tr>
<tr>
<td>Detect URL</td>
<td>https://gall.dcinside.com/board/view/</td>
</tr>
</table>
</center>
<br/>
我尝试过更改用户代理,从https改成http,还有其他类似问题的建议,但都没有效果。
1 个回答
0
如果你在使用requests模块在Google Colab中发送HTTP请求时遇到问题,可能有几个原因导致这种情况。
1. 防火墙或网络限制:有时候,网络或防火墙的限制可能会阻止你的笔记本访问外部资源。如果你在一个代理或防火墙后面,你可能需要在笔记本中设置代理。
可以使用下面的代码片段来设置笔记本中的代理:
import os
os.environ['HTTP_PROXY'] = 'http://your_proxy_address:your_proxy_port'
os.environ['HTTPS_PROXY'] = 'http://your_proxy_address:your_proxy_port'
2. 被屏蔽的网站:如果你尝试访问的网站在Colab环境中被屏蔽,你将无法向它发送请求。
另外,请添加所有可能的请求头,以避免被屏蔽。这里是修改后的代码版本:
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urlparse, parse_qs
import os
# Please add your proxy address and port to use given proxy while making a request.
# Note: I'm using scrapeops proxy here, you can also get a trail plan and replace the api_key with a valid key
api_key = "0565b10e-c1b5-418c-b15d-02d4ebd5d6a2"
proxy_value = f"http://scrapeops:{api_key}@proxy.scrapeops.io:5353"
os.environ['HTTP_PROXY'] = proxy_value
os.environ['HTTPS_PROXY'] = proxy_value
def get_response_by_passing_headers(url):
# We are parsing query parameters from the URL to pass it to the request
parsed_url = urlparse(url)
query_params = parse_qs(parsed_url.query)
params = {key: value[0] for key, value in query_params.items()}
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Accept-Language': 'en-GB,en;q=0.9',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
'sec-ch-ua': '"Chromium";v="122", "Not(A:Brand";v="24", "Google Chrome";v="122"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
}
# Making a request with all the headers and parameters
response = requests.get('https://gall.dcinside.com/board/view/', params=params, headers=headers, verify=False)
return response
url = 'https://gall.dcinside.com/board/view/?id=piano&no=1&exception_mode=notice&page=1'
response = get_response_by_passing_headers(url)
soup = bs(response.content, "html.parser")
print(soup)