如何使python webcrawler无限大并记录链接

import requests from bs4 import BeautifulSoup def spider(max_pages): page = 1 while page <= max_pages: url = '' source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text, "html.parser") for link in soup.findAll("a"): href = link.get("href") title = link.get("title") links = [] #print(href) #print(title) try: get_single_user_data(href) except: pass page += 1 def get_single_user_data(user_url): source_code = requests.get(user_url) plain_text = source_code.text soup = BeautifulSoup(plain_text, "html.parser") #for item_name in soup.findAll('span', {'id':'mm-saleDscPrc'}): # print(item_name.string) for link in soup.findAll("a"): href = link.get("href") print(href) spider(1)

1条回答

网友

1楼 · 发布于 2024-04-19 23:26:07

I've tried to make it infinite as in it will get every link on every link every recorded

除非你有一个大小合适的数据中心，否则这是不会发生的。但为了它。你只需要一个更大的网站开始池爬行链接到其他网站，你会得到足够的。从Reddit或其他东西的所有出站链接开始。你知道吗

I also have a problem of recording the same link more than once?

我建议您使用hash table记录您访问过的网站的链接，并在访问之前检查链接是否存在。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章