如何进行增量抓取RealTim附近的大型站点

1条回答

网友

1楼 · 发布于 2024-06-06 10:05:54

For example: I have a site with 100 pages and 10 records each. So I scrape page 1, and then go to page 2. But on fast growing sites, at the time I do the request for page 2, there might be 10 new records, so I would get the same items again. Nevertheless I would get all items in the end. BUT next time scraping this site, how would I know where to stop? I can't stop at the first record I already have in my database, because this might be suddenly on the first page, because there a new reply was made.

通常每个记录都有一个唯一的链接（permalink），例如，只需输入https://stackoverflow.com/questions/39805237/&忽略除此之外的文本即可访问上述问题。你必须为每一条记录存储唯一的URL，当你下次进行抓取时，忽略你已经拥有的那些。在

如果您以Stackoverflow上的标记python为例，您可以在这里查看问题：https://stackoverflow.com/questions/tagged/python，但是不能依赖排序顺序来确保条目的唯一性。其中一种方法是根据最新的问题进行排序，并根据URL忽略重复的问题。在

你可以有一个算法，每“x”分钟刮一次前“n”页，直到它碰到现有记录为止。整个流程有点特定于站点，但是随着您获取更多站点，您的算法将变得更加通用和健壮，以处理边缘情况和新站点。在

另一种方法是不自己运行scrapy，而是使用分布式spider服务。它们通常有多个IP，可以在几分钟内搜索大型站点。只要确保你尊重网站的机器人.txt归档，不要意外地对它们进行DDoS攻击。在

相关问题更多 >

编程相关推荐

热门问题

热门文章