使用python从网站获取所有url

2024-04-23 07:25:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在学习建立网络爬虫,目前正在努力从一个网站获得所有的网址。我一直在玩,没有我以前做的相同的代码,但我已经能够得到所有的链接,但我的问题是递归,我需要做同样的事情一遍又一遍,但我认为我的问题是递归,它所做的是正确的代码,我写的。我的代码如下

#!/usr/bin/python
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

def getAllUrl(url):
    page = urllib2.urlopen( url ).read()
    urlList = []
    try:
        soup = BeautifulSoup(page)
        soup.prettify()
        for anchor in soup.findAll('a', href=True):
            if not 'http://' in anchor['href']:
                if urlparse.urljoin('http://bobthemac.com', anchor['href']) not in urlList:
                    urlList.append(urlparse.urljoin('http://bobthemac.com', anchor['href']))
            else:
                if anchor['href'] not in urlList:
                    urlList.append(anchor['href'])

        length = len(urlList)

        for url in urlList:
            getAllUrl(url)

        return urlList
    except urllib2.HTTPError, e:
        print e

if __name__ == "__main__":
    urls = getAllUrl('http://bobthemac.com')
    for x in urls:
        print x

我要实现的是用程序运行的当前设置获取一个站点的所有url,直到它耗尽内存,我只想从一个站点获取url。有人知道怎么做吗?我认为我有正确的想法,只是需要对代码做一些小的修改。

编辑

下面是我的工作代码,它可以获取站点的所有urs,有些人可能会觉得它很有用。这不是最好的代码,需要一些工作,但有些工作,它可能是相当好的。

#!/usr/bin/python
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

def getAllUrl(url):
urlList = []
try:
    page = urllib2.urlopen( url ).read()
    soup = BeautifulSoup(page)
    soup.prettify()
    for anchor in soup.findAll('a', href=True):
        if not 'http://' in anchor['href']:
            if urlparse.urljoin('http://bobthemac.com', anchor['href']) not in urlList:
                urlList.append(urlparse.urljoin('http://bobthemac.com', anchor['href']))
        else:
            if anchor['href'] not in urlList:
                urlList.append(anchor['href'])

    return urlList

except urllib2.HTTPError, e:
    urlList.append( e )

if __name__ == "__main__":
urls = getAllUrl('http://bobthemac.com')

fullList = []

for x in urls:
    listUrls = list
    listUrls = getAllUrl(x)
    try:
        for i in listUrls:
            if not i in fullList:
                fullList.append(i)
    except TypeError, e:
        print 'Woops wrong content passed'

for i in fullList:
    print i

Tags: 代码inimporthttpurlforifnot
2条回答

我认为这是可行的:

#!/usr/bin/python
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

def getAllUrl(url):
    try:
        page = urllib2.urlopen( url ).read()
    except:
        return []
    urlList = []
    try:
        soup = BeautifulSoup(page)
        soup.prettify()
        for anchor in soup.findAll('a', href=True):
            if not 'http://' in anchor['href']:
                if urlparse.urljoin(url, anchor['href']) not in urlList:
                    urlList.append(urlparse.urljoin(url, anchor['href']))
            else:
                if anchor['href'] not in urlList:
                    urlList.append(anchor['href'])

        length = len(urlList)

        return urlList
    except urllib2.HTTPError, e:
        print e

def listAllUrl(urls):
    for x in urls:
        print x
        urls.remove(x)
        urls_tmp = getAllUrl(x)
        for y in urls_tmp:
            urls.append(y)


if __name__ == "__main__":
    urls = ['http://bobthemac.com']
    while(urls.count>0):
        urls = getAllUrl('http://bobthemac.com')
        listAllUrl(urls)

在函数getAllUrl中,在for循环中再次调用getAllUrl,它将进行递归。

元素一旦放入urlList就永远不会被移出,因此urlList永远不会为空,然后,递归就永远不会中断。

这就是为什么你的程序永远不会以util内存不足而告终。

相关问题 更多 >