Python：如何使用BeautifulSoup从HTML页面提取URL？

3条回答

网友

1楼 · 编辑于 2024-05-19 00:02:57

from bs4 import BeautifulSoup as BS
html = # Your HTML
soup = BS(html)
for text in soup.find_all('div', class_='article-additional-info'):
    for links in text.find_all('a'):
        print links.get('href')

打印内容：

http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece    
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece    
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments

网友

2楼 · 编辑于 2024-05-19 00:02:57

在处理完文档后，我按照以下方式完成了，谢谢大家的回答，我很感激他们

>>> import urllib2
>>> f = urllib2.urlopen('http://www.thehindu.com/news/cities/delhi/?union=citynews')
>>> soup = BeautifulSoup(f.fp)
>>> for link in soup.select('.article-additional-info'):
...   print link.find('a').attrs['href']
... 
http://www.thehindu.com/news/cities/Delhi/airport-metro-express-is-back/article4335059.ece
http://www.thehindu.com/news/cities/Delhi/91-more-illegal-colonies-to-be-regularised/article4335069.ece
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/nurses-women-groups-demand-safety-audit-of-workplaces/article4331470.ece
http://www.thehindu.com/news/cities/Delhi/delhi-bpl-families-to-get-12-subsidised-lpg-cylinders/article4328990.ece
http://www.thehindu.com/news/cities/Delhi/shias-condemn-violence-against-religious-minorities/article4328276.ece
http://www.thehindu.com/news/cities/Delhi/new-archbishop-of-delhi-takes-over/article4328284.ece
http://www.thehindu.com/news/cities/Delhi/delhi-metro-to-construct-subway-without-disrupting-traffic/article4328290.ece
http://www.thehindu.com/life-and-style/Food/going-for-the-kill-in-patparganj/article4323210.ece
http://www.thehindu.com/news/cities/Delhi/fire-at-janpath-bhavan/article4335068.ece
http://www.thehindu.com/news/cities/Delhi/fiveyearold-girl-killed-as-school-van-overturns/article4335065.ece
http://www.thehindu.com/news/cities/Delhi/real-life-stories-of-real-women/article4331483.ece
http://www.thehindu.com/news/cities/Delhi/women-councillors-allege-harassment-by-male-councillors-of-rival-parties/article4331471.ece
http://www.thehindu.com/news/cities/Delhi/airport-metro-resumes-today/article4331467.ece
http://www.thehindu.com/news/national/hearing-today-on-plea-to-shift-trial/article4328415.ece
http://www.thehindu.com/news/cities/Delhi/protestors-demand-change-in-attitude-of-men-towards-women/article4328277.ece
http://www.thehindu.com/news/cities/Delhi/bjp-promises-5-lakh-houses-for-poor-on-interestfree-loans/article4328280.ece
http://www.thehindu.com/life-and-style/metroplus/papad-bidi-and-a-dacoit/article4323219.ece
http://www.thehindu.com/life-and-style/Food/gharana-of-food-not-just-music/article4323212.ece
>>>

网友

3楼 · 编辑于 2024-05-19 00:02:57

根据你的标准，它返回三个url（不是两个）-你想过滤掉第三个吗？

基本思想是遍历HTML，只拉出类中的那些元素，然后遍历该类中的所有链接，拉出实际链接：

In [1]: from bs4 import BeautifulSoup

In [2]: html = # your HTML

In [3]: soup = BeautifulSoup(html)

In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}):
   ...:     for link in item.find_all('a'):
   ...:         print link.get('href')
   ...:         
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments

这将限制您只搜索那些带有article-additional-info类标记的元素，并且在其中查找所有锚（a）标记并获取它们相应的href链接。

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python：如何使用BeautifulSoup从HTML页面提取URL？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >