我使用Scrapy和splash来获取基于Javascript的网站的值。代码运行良好,spider会清除所有有趣的值。问题是它只将所有这些值保存到一个项中。你知道吗
class Spider(CrawlSpider):
name = "test"
start_urls = ["http://example.com/results"]
rules = (
Rule(LinkExtractor(restrict_xpaths = ('//div[contains(@class, "products")]'), ),
callback="parse",
follow=False),)
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url,callback=self.parse, endpoint='render.html', args={'wait':25.5})
def parse(self, response):
product_list = response.xpath('//div[contains(@class, "products")]').extract()
for items in product_list:
item=TestItem()
item['CompanyName'] = response.xpath('').extract()
item['Revenue'] = response.xpath('').extract()
item['Tag'] = response.xpath('').extract()
yield item
我看不出上面的代码有什么问题。我的所有项目都驻留在一个div中。但是有多个文件div包含这些项目。网站在一个页面上显示了很多结果,我需要从中获取这些值。例如,在div products
中有10个不同的div包含所述项。你知道吗
输出如下:
CompanyName,Tagline,Revenue
XcompanyName, YcomapnyName, ZCompanyName
Xtagline, Ytagline, Ztagline
Xrevenue, Yrevenue, Zrevenue
虽然我希望它是:
CompanyName,Tagline,Revenue
XcompanyName, Ytagline, Zrevenue
YcompanyName, Ytagline, Yrevenue
ZcompanyName, Ztagline, Zrevenue
网站CSS:
<div class="products">
<div id="ember1" class="product ember-view"><a href="/product/NameCompany" id="ember1" class="product-link ember-view"> <div class="product-card-header">
<div id="ember1" class="product-card-logo ember-view"><img src="https://storage.googleapis.com/" id="ember1" class="product-avatar-img ember-view">
</div>
<div class="product-card-header-t">
<span class="product-card__name">NameCompany</span>
<span class="product-card__tagline">Simple</span>
</div>
</div>
<!---->
<div class="product-card-revenue">
<div class="product-card-revenue-t">
<span class="product-card-revenue-r">
$0
<span class="product-card-slash">/</span>
<span class="product-card-period">month</span>
</span>
<span class="product-revenue">
<!----> reported
</span>
</div>
</div>
</div>
编辑:
如果我在xpath中为项使用extract_first()
,则文件的格式是正确的,但它只保存一个div中的信息,而忽略其余部分。你知道吗
@Umair的回答是对的
我需要在循环中传递
items
(不是item),而不是response对象。答复是在有关div的范围内确定的。现在输出具有正确的格式。你知道吗相关问题 更多 >
编程相关推荐