Python:如何使用BeautifulSoup从HTML页面提取URL?

2024-05-19 00:02:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个包含多个div的HTML页面,比如

<div class="article-additional-info">
A peculiar situation arose in the Supreme Court on Tuesday when two lawyers claimed to be the representative of one of the six accused in the December 16 gangrape case who has sought shifting of t...
<a class="more" href="http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece">
<span class="arrows">»</span>
</a>
</div>

<div class="article-additional-info">
Power consumers in the city will have to brace for spending more on their monthly bills as all three power distribution companies – the Anil Ambani-owned BRPL and BYPL and the Tatas-owned Tata Powe...
<a class="more" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece">
<a class="commentsCount" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments">
</div>

我需要得到类为article-additional-info的所有div的<a href=>值 我是新来的美女

所以我需要网址

"http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece"
"http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece"

实现这一目标的最佳方法是什么?


Tags: thetoindivcomhttpwwwarticle
3条回答
from bs4 import BeautifulSoup as BS
html = # Your HTML
soup = BS(html)
for text in soup.find_all('div', class_='article-additional-info'):
    for links in text.find_all('a'):
        print links.get('href')

打印内容:

http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece    
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece    
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments

在处理完文档后,我按照以下方式完成了,谢谢大家的回答,我很感激他们

>>> import urllib2
>>> f = urllib2.urlopen('http://www.thehindu.com/news/cities/delhi/?union=citynews')
>>> soup = BeautifulSoup(f.fp)
>>> for link in soup.select('.article-additional-info'):
...   print link.find('a').attrs['href']
... 
http://www.thehindu.com/news/cities/Delhi/airport-metro-express-is-back/article4335059.ece
http://www.thehindu.com/news/cities/Delhi/91-more-illegal-colonies-to-be-regularised/article4335069.ece
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/nurses-women-groups-demand-safety-audit-of-workplaces/article4331470.ece
http://www.thehindu.com/news/cities/Delhi/delhi-bpl-families-to-get-12-subsidised-lpg-cylinders/article4328990.ece
http://www.thehindu.com/news/cities/Delhi/shias-condemn-violence-against-religious-minorities/article4328276.ece
http://www.thehindu.com/news/cities/Delhi/new-archbishop-of-delhi-takes-over/article4328284.ece
http://www.thehindu.com/news/cities/Delhi/delhi-metro-to-construct-subway-without-disrupting-traffic/article4328290.ece
http://www.thehindu.com/life-and-style/Food/going-for-the-kill-in-patparganj/article4323210.ece
http://www.thehindu.com/news/cities/Delhi/fire-at-janpath-bhavan/article4335068.ece
http://www.thehindu.com/news/cities/Delhi/fiveyearold-girl-killed-as-school-van-overturns/article4335065.ece
http://www.thehindu.com/news/cities/Delhi/real-life-stories-of-real-women/article4331483.ece
http://www.thehindu.com/news/cities/Delhi/women-councillors-allege-harassment-by-male-councillors-of-rival-parties/article4331471.ece
http://www.thehindu.com/news/cities/Delhi/airport-metro-resumes-today/article4331467.ece
http://www.thehindu.com/news/national/hearing-today-on-plea-to-shift-trial/article4328415.ece
http://www.thehindu.com/news/cities/Delhi/protestors-demand-change-in-attitude-of-men-towards-women/article4328277.ece
http://www.thehindu.com/news/cities/Delhi/bjp-promises-5-lakh-houses-for-poor-on-interestfree-loans/article4328280.ece
http://www.thehindu.com/life-and-style/metroplus/papad-bidi-and-a-dacoit/article4323219.ece
http://www.thehindu.com/life-and-style/Food/gharana-of-food-not-just-music/article4323212.ece
>>> 

根据你的标准,它返回三个url(不是两个)-你想过滤掉第三个吗?

基本思想是遍历HTML,只拉出类中的那些元素,然后遍历该类中的所有链接,拉出实际链接:

In [1]: from bs4 import BeautifulSoup

In [2]: html = # your HTML

In [3]: soup = BeautifulSoup(html)

In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}):
   ...:     for link in item.find_all('a'):
   ...:         print link.get('href')
   ...:         
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments

这将限制您只搜索那些带有article-additional-info类标记的元素,并且在其中查找所有锚(a)标记并获取它们相应的href链接。

相关问题 更多 >

    热门问题