Scrapy：两个HTML元素之间的文本选择器。。？

class Profiles(scrapy.Spider): name = 'profiles' allowed_domains = ['url.com'] start_urls = ['https://www.url/profiles/'] def parse(self, response): for profile in response.css('.herald-entry-content p'): url = response.urljoin(profile.css('a::attr(href)').extract_first()) yield scrapy.Request(url=url, callback=self.parse_profile, dont_filter=True) def parse_profile(self, response): birth_name = response.xpath("//*[@id='post-19807']/div/div[1]/div/div[2]/div/p[1]/text()[1]").extract() profile = Profile( birth_name=birth_name ) yield profile

<div class="herald-entry-content"> Profile: Facts Stage Name: Any name Birth Name: Any name Birthday: July 10, 1994 Zodiac Sign: Cancer Height: 178 cm </div>

1条回答

网友

1楼 · 发布于 2024-05-12 23:46:41

在这种情况下，您可能不得不回退到正则表达式

在不了解页面的完整结构的情况下，很难准确地提供您所需的内容，但下面是一个使用您提供的代码片段的示例

import scrapy

sel = scrapy.Selector(text="""
 <div class="herald-entry-content">
        <p><b>Profile: Facts<br>
        </b><br>
            <span>Stage Name:</span> Any name<br>
            <span>Birth Name:</span> Any name<br>
            <span>Birthday:</span> July 10, 1994<br>
            <span>Zodiac Sign:</span> Cancer<br>
            <span>Height:</span> 178 cm <br>
        </p>
    </div>
""")

info = sel.re("<span>(.+):</span>\s(.+)<br>")
output = dict(zip(*[iter(info)] * 2))
print(output)

我会给你

{'Stage Name': 'Any name', 
 'Birth Name': 'Any name', 
 'Birthday': 'July 10, 1994', 
 'Zodiac Sign': 'Cancer', 
 'Height': '178 cm '}

稍微隐晦的dict(zip(*[iter(info)] * 2))来自here

注意，您不应该直接使用scrapy.Selector，您应该能够执行以下操作

def parse_profile(self, response):
    herald_content = response.xpath('//div[@class="herald-entry-content"]')
    info = herald_content.re("<span>(.+):</span>\s(.+)<br>")
    # and so on from example above...

相关问题更多 >

编程相关推荐

热门问题

热门文章