获取页面上的所有URL Python

import urllib import time import re fwcURL = "http://www.microsoft.com" #URL to read mylines = urllib.urlopen(fwcURL).readlines() print "Found URLs:" time.sleep(1) #Pause execution for a bit for item in mylines: if "http://" in item.lower(): #For http print item[item.index("http://"):].split("'")[0].split('"')[0] # Remove ' and " from the end, for example in href= if "https://" in item.lower(): #For https print item[item.index("https://"):].split("'")[0].split('"')[0] # Ditto

3条回答

网友

1楼 · 编辑于 2024-04-18 16:02:59

我会使用lxml并执行以下操作：

import lxml.html

page = lxml.html.parse('http://www.microsoft.com').getroot()
anchors = page.findall('a')

值得注意的是，如果链接是动态生成的（通过JS或类似的方式），那么您将不会缺少某种方式的浏览器自动化。在

网友

2楼 · 编辑于 2024-04-18 16:02:59

首先，HTML不是一种常规语言，任何简单的字符串操作都不可能在所有页面上都起作用。你需要一个真正的HTML解析器。我推荐Lxml。然后就是在树中递归并找到所需的元素。在

第二，有些页面可能是动态的，因此您无法在html源代码中找到所有内容。Google大量使用javascript和AJAX（注意它如何在不重新加载页面的情况下显示结果）。在

网友

3楼 · 编辑于 2024-04-18 16:02:59

尝试使用Mechanize或beauthoulsoup或lxml。在

通过使用beauthulsoup，您可以很容易地获得所有html/xml内容。在

import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("some_url")
soup = BeautifulSoup(page.read())
links = soup.findAll("a")
for link in links:
    print link["href"]

^{}很容易学习和理解。在

相关问题更多 >

编程相关推荐

热门问题

热门文章