我正在写一个抓痒蜘蛛,它应该能找到网站内容(文本)中是否存在特定的字符串。我有许多网站(几千个)和许多需要查找的字符串,因此我在代码中使用绑定到变量的列表的原因。有些列表是从其他python文件导入的
我遇到的问题是,尽管在使用开发工具手动检查URL后,我在URL中找不到字符串,但代码似乎产生了积极的“命中”。下面是代码和结果示例
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from list_loop import *
import re
word_to_find = 'pharmacy'
class TestSpider(CrawlSpider):
name = 'test'
# these are lists of a lot of domains imported from another
# file called list_loop.py
allowed_domains = strip_url
start_urls = merch_url
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
# Here I clean up the parsed text not to include /n or whitespace.
words = response.xpath("//a//text()").getall()
cleaned_words = [word.strip() for word in words]
cleaned_words = [word.lower() for word in cleaned_words if len(word) > 0]
# Then I loop through the cleaned_words in order to find a match
for single_word in cleaned_words:
re.search(r'\b%s\b' % word_to_find, single_word)
yield{
'Matching': 'Found the word {} in {}'.format(word_to_find, response.url)
}
else:
pass
allowed_domains
和start_urls
列表中有阿里巴巴.com以及许多其他网站。运行spider后,我得到了这样一个结果输出:
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
{"Matching": "Found the word pharmacy in https://www.alibaba.com/?from_http=1"},
许多其他网站的内容或HTML中实际上没有“pharmacy”一词,情况也是如此。你知道这里怎么了吗
我相信你错过了一个if声明。在您的代码中,无论是否存在匹配项,您都将生成该语句
我相信你想要这样的东西:
相关问题 更多 >
编程相关推荐