如何在Python中只解析网页中的链接？

#my current output# http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/" http://www.asecuritysite.com/content/icon_clown.gif" alt="if broken see alex@school.ac.uk +44(0)1314552759" height="100" http://www.rottentomatoes.com/m/sleeper/" http://www.rottentomatoes.com/m/sleeper/trailer/" http://www.rottentomatoes.com/m/star_wars/" http://www.rottentomatoes.com/m/star_wars/trailer/" http://www.rottentomatoes.com/m/wargames/" http://www.rottentomatoes.com/m/wargames/trailer/" https://www.sans.org/press/sans-institute-and-crowdstrike-partner-to-offer-hacking-exposed-live-webinar-series.php"> SANS to Offer "Hacking Exposed Live" https://www.sans.org/webcasts/archive/2013" #I want to get this when i run the module# http://net.tutsplus.com/tutorials/other/8-regular-expressions-you-should-know/ http://www.asecuritysite.com/content/icon_clown.gif http://www.rottentomatoes.com/m/sleeper/ http://www.rottentomatoes.com/m/sleeper/trailer/ http://www.rottentomatoes.com/m/star_wars/ http://www.rottentomatoes.com/m/star_wars/trailer/ http://www.rottentomatoes.com/m/wargames/ http://www.rottentomatoes.com/m/wargames/trailer/ https://www.sans.org/press/sans-institute-and-crowdstrike-partner-to-offer-hacking-exposed-live-webinar-series.php https://www.sans.org/webcasts/archive/2013

3条回答

网友

1楼 · 编辑于 2024-05-14 14:23:00

You should not use regular expressions for parsing HTML.有专门的工具叫做HTML解析器。你知道吗

下面是一个使用^{}和^{}的示例：

from bs4 import BeautifulSoup
import requests

page = requests.get('http://www.soc.napier.ac.uk/~cs342/CSN08115/cw_webpage/index.html')
soup = BeautifulSoup(page.content)

for link in soup.find_all('a', href=True):
    print link.get('href')

印刷品：

http://www.rottentomatoes.com/m/sleeper/
http://www.rottentomatoes.com/m/sleeper/trailer/
http://www.rottentomatoes.com/m/wargames/
http://www.rottentomatoes.com/m/wargames/trailer/
...

网友

2楼 · 编辑于 2024-05-14 14:23:00

通过BeautifulsoupCSS selectors。你知道吗

>>> from bs4 import BeautifulSoup
>>> import requests
>>> page = requests.get('http://www.soc.napier.ac.uk/~cs342/CSN08115/cw_webpage/index.html')
>>> soup = BeautifulSoup(page.content)
>>> for i in soup.select('a[href]'):
        print(i['href'])

http://www.rottentomatoes.com/m/sleeper/
http://www.rottentomatoes.com/m/sleeper/trailer/
http://www.rottentomatoes.com/m/wargames/
http://www.rottentomatoes.com/m/wargames/trailer/
..................

网友

3楼 · 编辑于 2024-05-14 14:23:00

\w+://\w+\.\w+\.\w+[^"]+

试试看这个。看到了吗演示。你知道吗

http://regex101.com/r/hQ9xT1/31

相关问题更多 >

编程相关推荐

热门问题

热门文章