如何从包含多个csv文件链接的网页html中提取特定csv

import urllib url = 'https://vincentarelbundock.github.io/Rdatasets/datasets.html' from BeautifulSoup import * def scrapper(url,k): c=0 html = urllib.urlopen(url).read() soup = BeautifulSoup(html) #. Retrieve all of the anchor tags tags = soup('a') for tag in tags: y= (tag.get('href', None)) #print ((y)) if y == 'csv/datasets/co2.csv': print y break c= c+ 1 if c is k: return y print(type(y)) for w in range(29): print(scrapper(url,w))

1条回答

网友

1楼 · 发布于 2024-06-01 02:24:05

您正在下载并重新分析循环30次迭代的完整html页面，只是为了获得下一个csv文件，看看它是否是您想要的。这是非常低效的，对服务器不太礼貌。只需阅读html页面一次，然后使用循环来检查标记是否是您想要的标记！如果是这样，请对其进行一些操作，并停止循环以避免不必要的进一步处理，因为您说过您只需要一个特定的文件。在

与您的问题相关的另一个问题是，在html文件中csv href是相对url。所以你必须在它们所在文档的基url上加入它们。urlparse.urljoin()就是这样。在

与问题没有直接关系，但是你也应该尝试清理你的代码

修正缩进（第9行的注释）
选择更好的变量名；y/c/k/w没有意义。在

结果是：

import urllib
import urlparse

url = 'https://vincentarelbundock.github.io/Rdatasets/datasets.html'
from BeautifulSoup import *


def scraper(url):
    html = urllib.urlopen(url).read() 
    soup = BeautifulSoup(html)
    # Retrieve all of the anchor tags
    tags = soup('a')
    for tag in tags:
        href = (tag.get('href', None))
        if href.endswith("/co2.csv"):
            csv_url = urlparse.urljoin(url, href)
            # ... do something with the csv file....
            contents = urllib.urlopen(csv_url).read()
            print "csv file size=", len(contents)
            break   # we only needed this one file, so we end the loop.

scraper(url)

相关问题更多 >

编程相关推荐

热门问题

热门文章