抓狂:提取链接

2024-03-29 15:02:35 发布

您现在位置:Python中文网/ 问答频道 /正文

{I}正在尝试从cdy中提取新的

这是我的scrape.py代码,它是Spider文件

 from scrapy.spiders import CrawlSpider, Rule
 from scrapy.linkextractors import LinkExtractor
 from scrapy.selector import HtmlXPathSelector
 from scrapy.selector import HtmlXPathSelector
 from scrapy.item import Item, Field
 import re

ss_base_url = "https://www.springfieldspringfield.co.uk/episode_scripts.php"

class Script(Item):
    url = Field()
    episode_name = Field()
    script = Field()

class SubtitleSpider(CrawlSpider):
    name = "scrape"
    allowed_domains = ['www.springfieldspringfield.co.uk']
    start_urls = [ss_base_url]
    rules = (
        Rule(LinkExtractor(allow=['/episode_scripts.php?tv-show=bojack-horseman-2014&episode=\w+']),
             callback="parse_script",
             follow=True),)

    def fix_field_names(self, field_name):
        field_name = re.sub(" ","_", field_name)
        field_name = re.sub(":","", field_name)
        return field_name

    def parse_script(self, response):
        x = HtmlXPathSelector(response)
        script = Script()
        script['url'] = response.url
        script['episode_name'] = "".join(x.select("//h3/text()").extract())
        script['script'] = "\n".join(x.select("//div[@class='episode_script']/text()").extract())
        return script

我正在尝试从https://www.springfieldspringfield.co.uk/episode_scripts.php?tv-show=bojack-horseman-2014中提取所有季节的字幕

这些链接中有字幕

^{pr2}$

当我跑的时候

 scrapy crawl --nolog scrape

我应该得到以上链接作为输出。 但它没有什么回报,我哪里出错了?在


Tags: namefromimportreurlfieldwwwscript
1条回答
网友
1楼 · 发布于 2024-03-29 15:02:35

用于匹配链接的正则表达式包含一个问号,需要对其进行转义才能使匹配生效,如果将其更改为:

'\/view_episode_scripts\.php\?tv-show=bojack-horseman-2014&episode=\w+'

当你用nolog运行脚本时,它不会记录链接,所以你也需要删除它。在

相关问题 更多 >