来自thenewboston的Python网络爬虫程序

import requests from bs4 import BeautifulSoup def creepy_crawly(max_pages): page = 1 #requests.get('https://www.thenewboston.com/', verify = True) while page <= max_pages: url = "https://www.thenewboston.com/trade/search.php?pages=" + str(page) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text) for link in soup.findAll('a', {'class' : 'item-name'}): href = "https://www.thenewboston.com" + link.get('href') print(href) page += 1 creepy_crawly(1)

1条回答

网友

1楼 · 发布于 2024-04-27 09:58:28

我用urllib做了一个网络爬虫，它可以更快，访问https页面没有问题，但有一点是它不验证服务器证书，这使它更快但更危险（容易受到mitm攻击）。下面是该库的一个用法示例：

link = 'https://www.stackoverflow.com'    
html = urllib.urlopen(link).read()
print(html)

从一个页面抓取HTML只需要3行，很简单，不是吗？在

有关urllib的详细信息：https://docs.python.org/2/library/urllib.html

我还建议您在HTML上使用regex来获取其他链接，例如（使用re-library）可以是：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章