关于Python Web

2024-04-19 20:28:34 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在利用“使用python计算简介”中的代码制作一个web爬虫程序。我希望做的是避开某些网站,如谷歌或雅虎,因为它的规模和它的潜力,让我直接到仙女座。在

因此,我创造了自我禁止的用于筛选某些网页的部分。 但是,它不起作用。你有什么建议要修吗? 提前致谢。在

def analyze(url):
    '''returns the list of http links
    in absolute format in the web page with URL url'''

    print('Visiting: ', url) # for testing

    # obtain links in the web page
    content=urlopen(url).read().decode()
    collector=Collector(url)
    collector.feed(content)
    urls = collector.getLink()

    # compute word frequencies
    content=collector.getData()
    freq=frequency(content)

    out=open('test.csv', 'a')
    print(out, 'URL', 'word', 'count')
    csv=writer(out)


    #print the frequency of every text data word in web page
    print('\n {:50}{:10}{:5}'.format('URL', 'word', 'count'))
    for word in freq:
        row1=(url, word, freq[word])
        print('\n {:50} {:10} {:5}'.format(url, word, freq[word]))
        csv.writerow(row1)

    print('\n {:50} {:10}'.format('URL', 'link'))
    for link in urls:
        print('\n {:50} {:10}'.format(url, link))
        row2=(url, link)
        csv.writerow(row2)

    return urls


class Crawler:
    'a web crawler'
    def __init__(self):
        self.visited = set()
        self.prohibited=['*google.com/*','*yahoo.com/*']

    def crawl(self, url):
        '''calls analyze() on web page url
        and calls itself on every link to an univisted webpage'''
        links=analyze(url)
        self.visited.add(url)

        for link in links:
            if link not in self.visited and self.prohibited:
                try:
                    self.crawl(link)
                except:
                    pass

Tags: theinselfwebformaturlforpage
1条回答
网友
1楼 · 发布于 2024-04-19 20:28:34

link not in self.visited and self.prohibited主要等价于link not in self.visited,因为在这个语句中,self.prohibited总是被计算成{}。(self.prohibited是非空列表)

我想你应该把self.prohibited替换成:not any(re.match(x, link) for x in self.prohibited)。 对于每个禁止的regexp,此代码检查链接是否与regexp匹配。在

相关问题 更多 >