如何使用scrapy提取电子邮件地址?

2024-05-16 11:35:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试提取TripAdvisor上每家餐厅的电子邮件地址。在

我试过了,但一直返回一个[]:

response.xpath('//*[@class= "restaurants-detail-overview-cards-LocationOverviewCard__detailLink--iyzJI restaurants-detail-overview-cards-LocationOverviewCard__contactItem--89flT6"]')

TripAdvisor页面上的代码片段如下:


Tags: 电子邮件response地址overview餐厅xpathclasscards
3条回答

第一:你的类名有误。在

第二:它是<div>中的类,但是@href<a>中。而且<a>不在<div>之后,所以您需要

'//*[@class="..."]//a/@href'

(我跳过类名,因为它太长而无法显示)


但是你可以试着代替这么长的类名

^{pr2}$

{{cd6}测试

text = '''<div class="restaurants-detail-overview-cards-LocationOverviewCard__detailLink iyzJI restaurants-detail-overview-cards-LocationOverviewCard__contactItem 1flT6">
<span><a href="mailto:info@canopylounge.my?subject=?">
<span class="ui_icon email restaurants-detail-overview-cards-LocationOverviewCard__detailLinkIcon T_k32"></span>
<span class="restaurants-detail-overview-cards-LocationOverviewCard__detailLinkText co3ei">Email</span>
<span class="ui_icon external-link-no-box restaurants-detail-overview-cards-LocationOverviewCard__upLinkIcon 1oVn1"></span>
</a></span>
</div>'''

import lxml.html

soup = lxml.html.fromstring(text)

print(soup.xpath('//*[@class="restaurants-detail-overview-cards-LocationOverviewCard__detailLink iyzJI restaurants-detail-overview-cards-LocationOverviewCard__contactItem 1flT6"]//a/@href'))
print(soup.xpath('//a[contains(@href, "mailto")]/@href'))

Selector还有一个.re()方法,用于使用正则表达式提取数据。在

In [2]: response.xpath('//a[contains(@href, "mailto")]/@href')
Out[2]: [<Selector xpath='//a[contains(@href, "mailto")]/@href' data='mailto:info@coinopsf.com?subject=?'>]

In [3]: response.xpath('//a[contains(@href, "mailto")]/@href').get()
Out[3]: 'mailto:info@coinopsf.com?subject=?'

In [4]: response.xpath('//a[contains(@href, "mailto")]/@href').re('mailto:(.*)\?\w')
Out[4]: ['info@coinopsf.com']
In [5]: response.xpath('//a[contains(@href, "mailto")]/@href').re('mailto:([^?]*)')
Out[5]: ['info@coinopsf.com']

这是您可以:

import requests
from scrapy import Selector

site_link = 'https://www.tripadvisor.com/Restaurant_Review-g60713-d11882449-Reviews-Coin_Op_Game_Room-San_Francisco_California.html'

res = requests.get(site_link)
sel = Selector(res)
email = sel.xpath("//*[contains(@class,'LocationOverviewCard__contactItem ')]//a[contains(@href,'mailto:')]/@href").get()
email = email.split("mailto:")[1].split("?")[0] if email else ""
print(email)

输出:

^{pr2}$

相关问题 更多 >