尝试创建python web

import urllib2 seed=raw_input('Enter a url : ') def getAllNewLinksOnPage(page,prevLinks): response = urllib2.urlopen(page) html = response.read() links,pos,allFound=[],0,False while not allFound: aTag=html.find("<a href=",pos) if aTag>-1: href=html.find('"',aTag+1) endHref=html.find('"',href+1) url=html[href+1:endHref] if url[:7]=="http://": if url[-1]=="/": url=url[:-1] if not url in links and not url in prevLinks: links.append(url) print url closeTag=html.find("</a>",aTag) pos=closeTag+1 else: allFound=True return links toCrawl=[seed] crawled=[] while toCrawl: url=toCrawl.pop() crawled.append(url) newLinks=getAllNewLinksOnPage(url,crawled) toCrawl=list(set(toCrawl)|set(newLinks)) print crawled

1条回答

网友

1楼 · 发布于 2024-05-16 20:07:50

到目前为止，您实现的是一种随机顺序搜索，因为您保留了一个set的链接来进行爬网。（实际上，您保留了一个list，但将它来回转换为一个set，这会扰乱您的顺序。）

要将此转换为深度优先搜索，通常的解决方案是递归地进行搜索。那么你不需要任何外部存储的链接来爬行。您仍然需要跟踪迄今为止已爬网的链接，以避免重复，并且因为您希望在末尾对链接进行排序（这需要对某些内容进行排序），但仅此而已。所以：

def crawl(seed):
    crawled = set()
    def crawl_recursively(link):
        if link in crawled:
            return
        newLinks = getAllLinksOnPage(link)
        crawled.add(seed)
        for link in newLinks:
            crawl_recursively(link)
    crawl_recursively(seed)
    return sorted(crawled)

如果不想使用递归，另一种方法是使用显式的链接堆栈进行爬网。但是你不能一直重新组织这个堆栈，否则它就不再是一个堆栈了。同样，一组单独的已爬网链接将解决避免重复查找的问题。

def crawl(seed):
    crawled = set()
    to_crawl = [seed]
    while to_crawl:
        link = to_crawl.pop()
        if link in crawled:
            continue
        crawled.add(link)
        newLinks = getAllLinksOnPage(link)
        to_crawl.extend(newLinks)
    return sorted(crawled)

将堆栈转换为队列（只需将一行更改为to_crawl.pop(0)）就可以实现广度优先搜索。

如果你担心会因为重复次数太多而变得太大，你可以边走边把它们去掉。既然你想先深入，你就要把那些叠起来待会儿用的去掉，而不是新的。最简单的方法可能是使用OrderedSet（例如，从^{}文档链接的recipe）：

    to_crawl = OrderedSet()
    # …
        new_link_set = OrderedSet(newLinks)
        to_crawl -= new_link_set
        to_crawl |= new_link_set - crawled

相关问题更多 >

编程相关推荐

热门问题

热门文章