刮屑提取<li>,内有跨度

2024-06-16 13:00:54 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从这个html结构中提取文本:

<div class="col-6 col-lg-3">
    <span class="font-weight-bold">List of Birds</span>
        <ul class="bird-forms">
            <li>Crow <span class="color">Black</span></li>
            <li>Peacock <span class="color">Multicolored</span></li>
            <li>Dove <span class="color">Multicolored</span></li>
            <li>Sparrow <span class="color">Brown</span></li>
            <li>Goose <span class="color">Multicolored</span></li>
            <li>Ostrich <span class="color">Multicolored</span></li>
        </ul>
</div>

使用刮壳:response.css('ul.bird-forms li ::text').extract()

我希望结果如下所示:

['Crow Black', 
 'Peacock Multicolored',
 'Dove Multicolored', 
 'Sparrow Brown', 
 'Goose Multicolored',
 'Ostrich Multicolored']

而不是这个:

['Crow',
 'Black', 
 'Peacock',
 'Multicolored', 
 'Dove', 
 'Multicolored', 
 'Sparrow', 
 'Brown',
 'Goose', 
 'Multicolored',
 'Ostrich', 
 'Multicolored']

Tags: divliuldoveclasscolorblackspan
3条回答

只需使用XPath string()

birds = []
for li in response.xpath('//ul[@class="bird-forms"]/li'):
    bird = li.xpath('string(.)').get()
    birds.append(bird)

我们可以单独提取细节,并在以下情况下合并它们:

   li_tags = response.xpath(".//ul[@class='bird-forms']//li/text()").extract()
    color_tags = response.xpath(".//ul[@class='bird-forms']//span[@class='color']/text()").extract()


[" ".join(entry) for entry in zip(li_tags, color_tags)]

['Crow  Black',
 'Peacock  Multicolored',
 'Dove  Multicolored',
 'Sparrow  Brown',
 'Goose  Multicolored',
 'Ostrich  Multicolored']

您需要先分别选择li标记,然后为每个li标记选择文本:

data = []
for li_tag in response.css("ul.bird-forms li"):
    data.append(" ".join(li_tag.css("*::text").extract()))

与python列表理解相同:

data = [" ".join(x.css("*::text").extract()) for x in response.css("ul.bird-forms li")]

print(data)
# output <class 'list'>: ['Crow  Black', 'Peacock  Multicolored',
# 'Dove  Multicolored', 'Sparrow  Brown', 'Goose  Multicolored', 'Ostrich  Multicolored']

相关问题 更多 >