无法使用beautifulsoup刮取日本网站

3条回答

网友

1楼 · 编辑于 2024-06-08 22:54:04

您遇到的问题是，由于该站点将您的请求标识为来自bot，因此该站点正在阻止您的请求。你知道吗

通常的技巧是附加浏览器在请求中发送的相同头（包括cookies）。如果您转到Inspect > Network > Request > Copy > Copy as Curl，您可以看到Chrome正在发送的所有头文件。你知道吗

运行脚本时，将得到以下结果：

You reached this page when attempting to access https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono from 152.172.223.133 on 2019-09-18 02:21:34 UTC.

网友

2楼 · 编辑于 2024-06-08 22:54:04

在最新版本的代码中，对soup进行解码后，将无法在BeautifulSoup中使用find和find_all等函数。但我们稍后再谈。你知道吗

首先

拿到汤后，你可以打印汤，你会看到：（只显示关键部分）

<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<meta content="0" http-equiv="expires"/>
<meta content="Tue, 01 Jan 1980 1:00:00 GMT" http-equiv="expires"/>
<meta content="10; url=/distil_r_captcha.html?requestId=2ac19293-8282-4602-8bf5-126d194a4827&amp;httpReferrer=%2Fchintai%2F1001303243%2F%3FDOWN%3D2%26BKLISTID%3D002LPC%26sref%3Dlist_simple%26bi%3Dtatemono" http-equiv="refresh"/>

这意味着你没有获得足够的元素，你被检测为一个爬虫。你知道吗

因此，@KunduK的答案中缺少了一些东西，与find函数没有任何关系。你知道吗

主要部分

首先，您需要使python脚本不那么像爬虫程序。你知道吗

标题

收割台通常用于检测cralwer。在原始请求中，当您从请求中获取会话时，可以通过以下方式检查标头：

>>> s = requests.session()
>>> print(s.headers)
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

您可以看到，这里的头将告诉服务器您是一个爬虫程序，即python-requests/2.22.0。你知道吗

因此，您需要通过更新头来修改User-Agent。你知道吗

s = requests.session()
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
s.headers.update(headers)

但是，在测试cralwer时，它仍然被检测为crawerl。因此，我们需要在标题部分进一步挖掘。（但可能是其他原因，如IP阻止程序或Cookie原因。我稍后再提。）

在Chrome中，我们打开开发者工具，并打开网站。（假装这是你第一次访问网站，你最好先clear the cookies）清除cookies后，刷新页面。我们可以在开发者工具的网卡上看到，它显示了很多来自Chrome的请求。

通过输入第一个属性https://www.athome.co.jp/，我们可以在右侧看到一个详细的表，其中请求头是Chrome生成的头，用于请求目标站点的服务器。

为了确保每件事都能正常工作，你可以把这个Chrome标题中的每件事都添加到你的crawler中，它就不能再发现你是真正的Chrome或crawler了。（对于大多数站点，但我也发现一些站点使用starnge设置，要求在每个请求中都有一个特殊的头）

我已经挖掘出，在添加accept-language之后，网站的反cralwer功能会让你通过。你知道吗

因此，总的来说，你需要像这样更新你的头。你知道吗

headers = {
    'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
s.headers.update(headers)

曲奇

对于cookie的解释，您可以参考wiki。要得到饼干，有一个简单的方法。首先，初始化一个会话并更新头，如我上面提到的。第二，请求获取页面https://www.athome.co.jp，一旦获取页面，您将获得服务器发布的cookie。你知道吗

s.get(url='https://www.athome.co.jp')

优势请求.会话会话将帮助您维护cookie，因此您的下一个请求将自动使用此cookie。你知道吗

您只需使用以下方法检查获得的cookie：

print(s.cookies)

我的结果是：

<RequestsCookieJar[Cookie(version=0, name='athome_lab', value='ffba98ff.592d4d027d28b', port=None, port_specified=False, domain='www.athome.co.jp', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=1884177606, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)]>

您不需要解析这个页面，因为您只需要cookie而不是内容。你知道吗

获取内容

您可以使用获得的会话来请求您提到的wiki page。你知道吗

wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
page = s.get(wiki)

现在，你想要的一切都会被服务器发送给你，你可以用BeautifulSoup解析它们。你知道吗

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

在获得想要的内容之后，可以使用BeautifulSoup来获取目标元素。你知道吗

soup.find('dl', attrs={'class': 'data payments'})

你会得到：

<dl class="data payments">
<dt>賃料：</dt>
<dd><span class="num">7.3万円</span></dd>
</dl>

你可以从中提取你想要的信息。你知道吗

target_content = soup.find('dl', attrs={'class': 'data payments'})
dt = target_content.find('dt').get_text()
dd = target_content.find('dd').get_text()

将其格式化为一行。你知道吗

print('payment: {dt} is {dd}'.format(dt=dt[:-1], dd=dd))

一切都已完成。

摘要

我将粘贴下面的代码。你知道吗

# Import packages you want.
import requests
from bs4 import BeautifulSoup

# Initiate a session and update the headers.
s = requests.session()
headers = {
    'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
s.headers.update(headers)

# Get the homepage of the website and get cookies.
s.get(url='https://www.athome.co.jp')
"""
# You might need to use the following part to check if you have successfully obtained the cookies. 
# If not, you might be blocked by the anti-cralwer.
print(s.cookies)
"""
# Get the content from the page.
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
page = s.get(wiki)

# Parse the webpage for getting the elements.
soup = BeautifulSoup(page.content, 'html.parser')
target_content = soup.find('dl', attrs={'class': 'data payments'})
dt = target_content.find('dt').get_text()
dd = target_content.find('dd').get_text()

# Print the result.
print('payment: {dt} is {dd}'.format(dt=dt[:-1], dd=dd))

在爬虫领域，还有很长的路要走。你知道吗

你最好上网，充分利用浏览器中的开发工具。你知道吗

您可能需要找出内容是由JavaScript加载的，还是在iframe中加载的。你知道吗

更重要的是，你可能会被发现是一个爬虫和超链接已被服务器锁定。反爬虫技术只能通过更频繁的编码来实现。你知道吗

我建议你从一个没有反爬虫功能的更简单的网站开始。你知道吗

网友

3楼 · 编辑于 2024-06-08 22:54:04

试试下面的代码使用用标记查找元素的类名。你知道吗

from bs4 import BeautifulSoup
import requests
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
headers = requests.utils.default_headers()
headers.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
page = requests.get(wiki,headers=headers)

soup = BeautifulSoup(page.content, 'lxml')

for i in soup.find_all("dl",class_="data payments"):
   print(i.find('dt').text)
   print(i.find('span').text)

输出：

賃料：
7.3万円

如果你想操纵你的期望输出。试试看那个。你知道吗

from bs4 import BeautifulSoup
import requests
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
headers = requests.utils.default_headers()
headers.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
page = requests.get(wiki,headers=headers)

soup = BeautifulSoup(page.content, 'lxml')

for i in soup.find_all("dl",class_="data payments"):
   print("Payment: " + i.find('dt').text.split('：')[0] + " is " + i.find('span').text)

输出：

Payment: 賃料 is 7.3万円

首先

主要部分

标题

曲奇

获取内容

摘要

相关问题更多 >

编程相关推荐

热门问题

热门文章