Scrapy 在我当前语法下未返回网页文本内容

1 投票

1 回答

653 浏览

提问于 2025-04-18 14:49

我在使用Python.org的2.7 64位版本，操作系统是Windows Vista 64位。我成功地用Scrapy这个工具写了一个递归的网页抓取程序，可以解析维基百科文章里的所有文字。不过，我现在想把同样的代码用在代码中提到的另一个网站上，但它没有返回任何正文内容：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time


class ExampleSpider(CrawlSpider):
    name = "goal3"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com"]
    download_delay = 1

    rules = [Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')]
    #rules = [Rule(SgmlLinkExtractor(allow=()), 
                  #follow=True),
             #Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
    #]
    #rules = [
        #Rule(
            #SgmlLinkExtractor(allow=('Regions/252/Tournaments/2',)), 
            #callback='parse_item',
            #follow=True,
        #)
    #]
    def parse_item(self,response):
        self.log('A response from %s just arrived!' % response.url)
        scripts = response.selector.xpath("normalize-space(//title)")
        for scripts in scripts:
            body = response.xpath('//p').extract()
            body2 = "".join(body)
            print remove_tags(body2).encode('utf-8')  


execute(['scrapy','crawl','goal3'])

我想查看的一个示例页面是这样的：

http://www.whoscored.com/Articles/pn4gahfw90kjwje-yx7ztq/Show/Player-Focus-Potential-Change-in-System-may-Convince-Vidal-to-Leave-Juventus

根据我的理解，上面的代码应该能提取页面上找到的所有文本，并把它们连接在一起。这个示例页面的HTML代码里，文本是用<p>标签包裹的，所以我不明白为什么它没有正常工作。有没有人能看出为什么我用这段代码只得到了页面的底部内容？

编程调试网页抓取数据抓取文本提取 scrapy 递归抓取网站内容解析HTML

1 个回答

在parse_item()这个函数里有点乱。这里是修正后的版本，它可以从所有的段落（p标签）中提取文本并把它们连接起来：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.utils.markup import remove_tags


class ExampleSpider(CrawlSpider):
    name = "goal3"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com"]
    download_delay = 1

    rules = [Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')]

    def parse_item(self,response):
        paragraphs = response.selector.xpath("//p").extract()
        text = "".join(remove_tags(paragraph).encode('utf-8') for paragraph in paragraphs)
        print text

对于这个页面，它会输出：

"There is no budget, there is money. We are in a very strong financial position. We can make big signings." Music to the ears of Manchester United fans as vice-chairman Ed Woodward confirmed the club can make big-money acquisitions in this very transfer window. In a bid to return to the summit of England’s top tier, Woodward has effectively given the green light to a spending spree that has supporters rubbing their hands with glee. Ander Herrara and Luke Shaw have arrived for a combined £59m already this summer and the carousel through the Old Trafford entrance door shows no sign of slowing down. Ángel Di María, Mats Hummels and Daley Blind, amongst others, have all been linked with a move to United, while reports suggesting midfield pitbull Arturo Vidal is set to join Louis van Gaal’s side refuse to die down.  "I’m still on holiday at the moment. Can I say I’m staying at Juve? I don’t know. On Monday I’ll talk to (Juventus manager, Massimili
...
 Contact Us | About Us | Glossary | Privacy Policy | WhoScored Ratings
            Copyright © 2014 WhoScored.com

回答于 2025-04-18 由 Python大师

分享举报

Scrapy 在我当前语法下未返回网页文本内容

1 个回答

撰写回答