python web scraping和scrapy spid - 问答 - Python中文网

python web scraping和scrapy spid

2024-04-27 00:52:36 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我正在用scrapy编写一个简单的spider，我想添加一些机制来找出我在爬行什么样的内容。你知道吗

例如，假设我有字符串列表：

The resource you are looking for has expired
The resource is not available

就像我有成千上万根弦。现在我要检查爬网的内容是否有这样的内容。我怎样才能做这条Python？你知道吗

def process_item(self, item, spider):
    try:
        content = items['body']
       ----------------------------- // How can i proceed further.
    except pymssql.Error, e:
        print ("error")

在“内容”我有爬网信息。你知道吗

我有：

使用字符串比较
必须创建查找文件并进行匹配

但我想知道他们有什么办法能有效地做到这一点吗？你知道吗

Tags： the 字符串 you 内容列表 for item are

1条回答

网友

1楼 · 发布于 2024-04-27 00:52:36

定义要检查并使用内置^{}函数的字符串列表：

terms = [
    'The resource you are looking for has expired',
    'The resource is not available'
]

has_terms = any(term in content for term in terms)

请注意，terms列表应该在process_item()之外定义，以避免每次调用process_item()时都重新定义它。一个好主意是在项目设置中配置它。你知道吗

另外，如果要跳过具有任何已定义术语的项，请考虑将检查移到spider级别。这将有助于避免项目从spider传递到管道的开销。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章