使用Python提取HTML链接

from collections import defaultdict import urllib2 import re def PrintLinks(website): counter = 0 regexp_link= regexp_link = r'''<frame src =((http|ftp)s?://.*?)''' pattern = re.compile(regexp_link) links = [None]*len(website) for x in website: html_page = urllib2.urlopen(website[counter]) html = html_page.read() links[counter] = re.findall(pattern,html) counter += 1 return links def main(): website=["A.com","B.com","C.com"]

1条回答

网友

1楼 · 发布于 2024-04-25 02:16:31

您不需要使用regex重新发明轮子，有一些很棒的python包可以为您做到这一点，成为最著名的BeautifulSoup。你知道吗

用pip安装BeautifulSoup和httplib2，然后尝试以下操作

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

sites=['http://www.site1.com', 'http://www.site2.com', 'http://www.site3.com']
http = httplib2.Http()

for site in sites:
    status, response = http.request(site)
    for iframe in BeautifulSoup(response, parseOnlyThese=SoupStrainer('iframe')):
        print site + ' ' + iframe['src']

相关问题更多 >

编程相关推荐

热门问题

热门文章