无法加载该网页https://www.riachuelo.com.br/feminino/colecaofeminino 使用Selenium和Python

2024-05-15 09:03:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我一直在尝试用Selenium清理这个页面(https://www.riachuelo.com.br/feminino/colecao-feminino),但我无法访问html,因为它从未加载。我尝试过使用随机用户代理和其他浏览器,但问题仍然存在。你知道为什么会这样吗

代码如下:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from fake_useragent import UserAgent
URL = "https://www.riachuelo.com.br/feminino/colecao-feminino"
options = Options()
ua = UserAgent()
userAgent = ua.random
options.add_argument(f'user-agent={userAgent}')
driver = webdriver.Chrome(chrome_options=options,executable_path=r"C:\Program Files (x86)\chromedriver.exe")
driver.get(URL)


Tags: fromhttpsbrimportcomwwwseleniumchrome
1条回答
网友
1楼 · 发布于 2024-05-15 09:03:44

我使用Seleniumhttps://www.riachuelo.com.br/feminino/colecao-feminino处执行您的用例以加载网页,如下所示:

from selenium import webdriver

options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.riachuelo.com.br/feminino/colecao-feminino')

同样,根据您的观察,我遇到了网页从未加载的相同障碍:

riachuelo


分析

在检查网页DOM Tree时,您会发现一些<iframe><script>标记引用了关键字dist。例如:

  • src="https://dtbot.directtalk.com.br/1.0/staticbot/dist/js/../index.html#!/?token=c243ce95-db6c-4ab6-9f2b-bf60d69c2d3d&widget=true&top=40&text=Alguma%20d%C3%BAvida%3F&textcolor=ffffff&bgcolor=4E1D3A&from=bottomRigth"
  • <script id="dtbot-script" src="https://dtbot.directtalk.com.br/1.0/staticbot/dist/js/dtbot.js?token=c243ce95-db6c-4ab6-9f2b-bf60d69c2d3d&amp;widget=true&amp;top=40&amp;text=Alguma%20d%C3%BAvida%3F&amp;textcolor=ffffff&amp;bgcolor=4E1D3A&amp;from=bottomRigth"></script>

这清楚地表明网站受到机器人管理服务提供商Distil Networks的保护,并且ChromeDriver的导航被检测到,随后被阻止


蒸馏

根据第There Really Is Something About Distil.it...条:

Distil protects sites against automatic content scraping bots by observing site behavior and identifying patterns peculiar to scrapers. When Distil identifies a malicious bot on one site, it creates a blacklisted behavioral profile that is deployed to all its customers. Something like a bot firewall, Distil detects patterns and reacts.

此外

"One pattern with Selenium was automating the theft of Web content", Distil CEO Rami Essaid said in an interview last week. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".


参考文献

您可以在以下内容中找到一些详细的讨论:

相关问题 更多 >