我正在利用“使用python计算简介”中的代码制作一个web爬虫程序。我希望做的是避开某些网站,如谷歌或雅虎,因为它的规模和它的潜力,让我直接到仙女座。在
因此,我创造了自我禁止的用于筛选某些网页的部分。 但是,它不起作用。你有什么建议要修吗? 提前致谢。在
def analyze(url):
'''returns the list of http links
in absolute format in the web page with URL url'''
print('Visiting: ', url) # for testing
# obtain links in the web page
content=urlopen(url).read().decode()
collector=Collector(url)
collector.feed(content)
urls = collector.getLink()
# compute word frequencies
content=collector.getData()
freq=frequency(content)
out=open('test.csv', 'a')
print(out, 'URL', 'word', 'count')
csv=writer(out)
#print the frequency of every text data word in web page
print('\n {:50}{:10}{:5}'.format('URL', 'word', 'count'))
for word in freq:
row1=(url, word, freq[word])
print('\n {:50} {:10} {:5}'.format(url, word, freq[word]))
csv.writerow(row1)
print('\n {:50} {:10}'.format('URL', 'link'))
for link in urls:
print('\n {:50} {:10}'.format(url, link))
row2=(url, link)
csv.writerow(row2)
return urls
class Crawler:
'a web crawler'
def __init__(self):
self.visited = set()
self.prohibited=['*google.com/*','*yahoo.com/*']
def crawl(self, url):
'''calls analyze() on web page url
and calls itself on every link to an univisted webpage'''
links=analyze(url)
self.visited.add(url)
for link in links:
if link not in self.visited and self.prohibited:
try:
self.crawl(link)
except:
pass
link not in self.visited and self.prohibited
主要等价于link not in self.visited
,因为在这个语句中,self.prohibited
总是被计算成{self.prohibited
是非空列表)我想你应该把
self.prohibited
替换成:not any(re.match(x, link) for x in self.prohibited)
。 对于每个禁止的regexp,此代码检查链接是否与regexp匹配。在相关问题 更多 >
编程相关推荐