网页抓取错误:例外情况.MemoryE

2024-04-23 11:00:22 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从gsmarena下载数据。下载HTC one me规范的示例代码来自以下站点:“http://www.gsmarena.com/htc_one_me-7275.php”,如下所述。在

网站上的数据以表格和表格行的形式分类。 数据格式如下:

table header > td[@class='ttl'] > td[@class='nfo']

在项目.py文件:

^{pr2}$

蜘蛛文件:

from scrapy.selector import Selector
from scrapy import Spider
from gsmarena_data.items import gsmArenaDataItem

class testSpider(Spider):
    name = "mobile_test"
    allowed_domains = ["gsmarena.com"]
    start_urls = ('http://www.gsmarena.com/htc_one_me-7275.php',)

    def parse(self, response):
        # extract whatever stuffs you want and yield items here
        hxs = Selector(response)
        phone = gsmArenaDataItem()
        tableRows = hxs.css("div#specs-list table")
        for tableRows in tableRows:
            phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0]
            for ttl in tableRows.xpath(".//td[@class='ttl']"):
                ttl_value = " ".join(ttl.xpath(".//text()").extract())
                nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())
                colonSign = ": "
                commaSign = ", "
                seq = [ttl_value, colonSign, nfo_value, commaSign]
                seq = seq.join(seq)
        phone['phoneDetails'] = seq
        yield phone

但是,代码在运行时出错:

File "C:\Users\ajhavery\Desktop\gsmarena_data\gsmarena_data\spiders\test.py", line 26, in parse
            sequenceNew = sequenceNew.join(seq)
        exceptions.MemoryError:

其目的是获得以下格式的数据:

表行标题:各自的数据,表行标题:各自的数据。。。。在

如下图所示:

Network Technology: GSM / HSPA / LTE, Network 2G bands: GSM 850 / 900 / 1800 / 1900 - SIM 1 & SIM 2, Battery Stand-by: Up to 598 h (2G) / Up to 626 h (3G), Battery Talk time: Up to 23 h (2G) / Up to 13 h (3G),

更新1:

使用@alecxe建议的代码:

def parse(self, response):
        # extract whatever stuffs you want and yield items here
        phone = gsmArenaDataItem()
        details = []

        for tableRows in response.css("div#specs-list table"):
            phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0]
            for ttl in tableRows.xpath(".//td[@class='ttl']"):
                ttl_value = " ".join(ttl.xpath(".//text()").extract())
                nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())
                details.append('{title}: {info}'.format(title=ttl_value, info=nfo_value))
        phone['phoneDetails'] = ", ".join(details)
        yield phone

给出错误:

File "C:\Users\ajhavery\Desktop\gsmarena_data\gsmarena_data\spiders\test.py", line 22, in parse
details.append('{title}: {info}'.format(title=ttl_value, info=nfo_value))
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)

Tags: 数据textinvaluephoneextractxpathseq
1条回答
网友
1楼 · 发布于 2024-04-23 11:00:22

收集列表中的电话详细信息并在循环后join()它们:

def parse(self, response):
    phone = gsmArenaDataItem()
    details = []

    for tableRows in response.css("div#specs-list table"):
        phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0]
        for ttl in tableRows.xpath(".//td[@class='ttl']"):
            ttl_value = " ".join(ttl.xpath(".//text()").extract())
            nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())
            details.append('{title}: {info}'.format(title=ttl_value.encode("utf-8"), info=nfo_value.encode("utf-8")))

    phone['phoneDetails'] = ", ".join(details)
    yield phone

相关问题 更多 >