刮蹭蜘蛛：不要在lis中抓取网站

1条回答

网友
1楼 · 发布于 2024-04-25 18:56:01

SgmlLinkExtractor接受process_value可调用：
a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return None to ignore the link altogether. If not given, process_value defaults to lambda x: x.
所以这样做应该有帮助：
def process_value(value): unique_id = re.search(r"/item/(\d+)", value).group(1) if unique_id in already_crawled_site_ids: return None return value rules = [Rule(SgmlLinkExtractor(allow=['/item/\d+']), 'parse_item', process_value=process_value)]

编程相关推荐

java实现的一个简单算法（计算概率）
更改应用程序背景动画的java首选项
java捕获图像并通过socket发送
基于双精度数组的对象的java排序Arraylist？
java似乎无法获得前面数字的正确总和
java卡住了Tomcat线程。日食乐观锁定
java是一个异步的Throwable类的printStackTrace（）
java随机错误（可能是）Android支持库
java我应该在代码中自动创建DB表，还是在安装过程中使用preinit？
在被调用的方法/函数Java/Android中处理异常？

相关问题更多 >

编程相关推荐

热门问题

热门文章