美丽的汤萃取物a href从谷歌搜索

2024-06-07 01:09:04 发布

您现在位置:Python中文网/ 问答频道 /正文

google搜索得到了以下关于HTML的第一个结果:

<h3 class="r"><a href="https://rads.stackoverflow.com/amzn/click/com/0470284889" rel="nofollow noreferrer" class="l vst" onmousedown="return rwt(this,'','','','1','AFQjCNEv1W9YC2jcSKYdEo2kNqBMJ-Utmg','k89K9hF4cVNpxQYHtEKiUQ','0CCoQFjAA',null,event)"><em>Quantitative Trading</em>: <em>How to Build Your Own Algorithmic</em> <b>...</b> - Amazon</a></h3>

我想从中提取链接http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889,但当我使用beautiful soup提取信息时,我获得

^{pr2}$

我得到了以下字符串:

/url?q=http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889&sa=U&ei=P2ycT6OoNuasiAL2ncV5&ved=0CBIQFjAA&usg=AFQjCNEo_ujANAKnjheWDRlBKnJ1BGeA7A

我知道链接在那里,我可以通过删除/url来解析它?q=以及符号后的所有内容,但我想知道是否有更干净的解决方案。在

谢谢!在


Tags: buildcomhttpurlamazon链接wwwbusiness
1条回答
网友
1楼 · 发布于 2024-06-07 01:09:04

您可以使用urlparse.urlparseurlparse.parse_qs的组合,例如

>>> import urlparse
>>> url = '/url?q=http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889&sa=U&ei=P2ycT6OoNuasiAL2ncV5&ved=0CBIQFjAA&usg=AFQjCNEo_ujANAKnjheWDRlBKnJ1BGe'
>>> data = urlparse.parse_qs(
...     urlparse.urlparse(url).query
... )
>>> data
{'ei': ['P2ycT6OoNuasiAL2ncV5'],
 'q': ['http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889'],
 'sa': ['U'],
 'usg': ['AFQjCNEo_ujANAKnjheWDRlBKnJ1BGe'],
 'ved': ['0CBIQFjAA']}
>>> data['q'][0]
'http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889'

相关问题 更多 >