Scrapy 解析 JavaScript

7 投票
2 回答
3986 浏览
提问于 2025-04-18 06:34

我在页面上有一段这样的javascript代码:

new Shopify.OptionSelectors("product-select", { product: {"id":185310341,"title":"10. Design | Siyah \u0026 beyaz kalpli",

我想要获取“185310341”。我在谷歌上搜索了几个小时,但没有找到任何信息,希望你能帮帮我。我该如何提取这段javascript代码并获取这个ID呢?

我尝试了以下代码:

id = sel.search('"id":(.*?),',text).group(1)
print id

但我得到了:

exceptions.AttributeError: 'Selector' object has no attribute 'search'

2 个回答

6

除了使用正则表达式的方法,还有一种替代方案,就是用一个JavaScript解析器,把解析器的输出转换成XML文档,然后用XPath来解析这个文档。

这个方法在js2xml中实现,它使用了slimitlxml(免责声明:我写了js2xml;警告:不太稳定)

在你的情况下,可以看看这个示例的scrapy shell会话,使用js2xml.jsonlike.getall()

paul:~$ scrapy shell http://2loom.com/products/2loom-design-siyah-beyaz-kalpli
2014-05-19 16:12:00+0200 [scrapy] INFO: Scrapy 0.23.0 started (bot: scrapybot)
2014-05-19 16:12:00+0200 [scrapy] INFO: Optional features available: ssl, http11
2014-05-19 16:12:00+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled item pipelines: 
2014-05-19 16:12:00+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-05-19 16:12:00+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-05-19 16:12:00+0200 [default] INFO: Spider opened
2014-05-19 16:12:01+0200 [default] DEBUG: Crawled (200) <GET http://2loom.com/products/2loom-design-siyah-beyaz-kalpli> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f8552946610>
[s]   item       {}
[s]   request    <GET http://2loom.com/products/2loom-design-siyah-beyaz-kalpli>
[s]   response   <200 http://2loom.com/products/2loom-design-siyah-beyaz-kalpli>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <Spider 'default' at 0x7f8552384b90>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
/usr/local/lib/python2.7/dist-packages/IPython/frontend.py:30: UserWarning: The top-level `frontend` package has been deprecated. All its subpackages have been moved to the top `IPython` level.
  warn("The top-level `frontend` package has been deprecated. "

In [1]: scripts = response.selector.xpath('//script/text()').extract()

In [2]: import js2xml, js2xml.jsonlike

In [3]: js = js2xml.parse(scripts[-1])

In [4]: js2xml.jsonlike.getall(js)
Out[4]: 
[{'onVariantSelected': 'selectCallback',
  'product': {'available': True,
   'compare_at_price': None,
   'compare_at_price_max': 0,
   'compare_at_price_min': 0,
   'compare_at_price_varies': False,
   'content': u'<blockquote>Siyah-beyaz kalpli tulumlarimiz 100% polyester olup kap\u015fonun i\xe7i ve ribanas\u0131 lacivertir. Fermuar\u0131 iki tarafl\u0131 a\xe7\u0131l\u0131r kapan\u0131r olup kap\u015fonun tamam\u0131n\u0131 kapsar ve beyaz renklidir. Tulumlar\u0131n her iki taraf\u0131ndaki cepler\xa0 beyaz fermuarl\u0131 ve elcikler siyaht\u0131r. Ayr\u0131ca kar\u0131n bolgesinde cepler vard\u0131r Tulumlardaki logolar beyazd\u0131r. Kad\u0131nlar ve erkekler i\xe7in tasarlanm\u0131\u015ft\u0131r.</blockquote>',
   'created_at': '2013-11-29T13:37:11+02:00',
   'description': u'<blockquote>Siyah-beyaz kalpli tulumlarimiz 100% polyester olup kap\u015fonun i\xe7i ve ribanas\u0131 lacivertir. Fermuar\u0131 iki tarafl\u0131 a\xe7\u0131l\u0131r kapan\u0131r olup kap\u015fonun tamam\u0131n\u0131 kapsar ve beyaz renklidir. Tulumlar\u0131n her iki taraf\u0131ndaki cepler\xa0 beyaz fermuarl\u0131 ve elcikler siyaht\u0131r. Ayr\u0131ca kar\u0131n bolgesinde cepler vard\u0131r Tulumlardaki logolar beyazd\u0131r. Kad\u0131nlar ve erkekler i\xe7in tasarlanm\u0131\u015ft\u0131r.</blockquote>',
   'featured_image': '//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_girls.jpg?v=1389259261',
   'handle': '2loom-design-siyah-beyaz-kalpli',
   'id': 185310341,
   'images': ['//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_girls.jpg?v=1389259261',
    '//cdn.shopify.com/s/files/1/0305/9953/products/6._Zwarte_hartjes_ak_girls.jpg?v=1389259259',
    '//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_boys.jpg?v=1389259264',
    '//cdn.shopify.com/s/files/1/0305/9953/products/6._Zwartje_hartjes_ak_boys.jpg?v=1389259264'],
   'options': ['Size'],
   'price': 15900,
   'price_max': 15900,
   'price_min': 15900,
   'price_varies': False,
   'published_at': '2013-11-29T13:34:20+02:00',
   'tags': [u'2\xb7Loom',
    'Beyaz',
    'Design',
    'Ekrek',
    u'Kad\u0131n',
    'Kalpli',
    'Lacivert'],
   'title': '10. Design | Siyah & beyaz kalpli',
   'type': '2 Loom Limiteds',
   'variants': [{'available': True,
     'barcode': None,
     'compare_at_price': None,
     'id': 424584985,
     'inventory_management': 'shopify',
     'inventory_policy': 'deny',
     'inventory_quantity': 3,
     'option1': 'XS (34-36: 1.60m-1.70m)',
     'option2': None,
     'option3': None,
     'options': ['XS (34-36: 1.60m-1.70m)'],
     'price': 15900,
     'requires_shipping': True,
     'sku': 'T01-BLWH-1-XS',
     'taxable': True,
     'title': 'XS (34-36: 1.60m-1.70m)',
     'weight': 0},
    {'available': True,
     'barcode': None,
     'compare_at_price': None,
     'id': 424584989,
     'inventory_management': 'shopify',
     'inventory_policy': 'deny',
     'inventory_quantity': 3,
     'option1': 'S (36-38: 1.65m-1.75m)',
     'option2': None,
     'option3': None,
     'options': ['S (36-38: 1.65m-1.75m)'],
     'price': 15900,
     'requires_shipping': True,
     'sku': 'T01-BLWH-1-S',
     'taxable': True,
     'title': 'S (36-38: 1.65m-1.75m)',
     'weight': 0},
    {'available': True,
     'barcode': None,
     'compare_at_price': None,
     'id': 424584997,
     'inventory_management': 'shopify',
     'inventory_policy': 'deny',
     'inventory_quantity': 7,
     'option1': 'M (38-40: 1.70m-1.80m)',
     'option2': None,
     'option3': None,
     'options': ['M (38-40: 1.70m-1.80m)'],
     'price': 15900,
     'requires_shipping': True,
     'sku': 'T01-BLWH-1-M',
     'taxable': True,
     'title': 'M (38-40: 1.70m-1.80m)',
     'weight': 0},
    {'available': True,
     'barcode': None,
     'compare_at_price': None,
     'id': 424585001,
     'inventory_management': 'shopify',
     'inventory_policy': 'deny',
     'inventory_quantity': 7,
     'option1': 'L (40-42: 1.75m-1.85m)',
     'option2': None,
     'option3': None,
     'options': ['L (40-42: 1.75m-1.85m)'],
     'price': 15900,
     'requires_shipping': True,
     'sku': 'T01-BLWH-1-L',
     'taxable': True,
     'title': 'L (40-42: 1.75m-1.85m)',
     'weight': 0}],
   'vendor': u'2\xb7Loom'}}]

In [5]: 
7

Scrapy的选择器内置了对正则表达式的支持,具体可以查看这里

sel.xpath('<xpath_to_find_the_element_text>').re(r'"id":(\d+)')

下面是一个演示,展示了这个特定正则表达式的工作效果:

>>> import re
>>> s = 'new Shopify.OptionSelectors("product-select", { product: {"id":185310341,"title":"10. Design | Siyah \u0026 beyaz kalpli",'
>>> re.search('"id":(\d+)', s).group(1)
'185310341' 

撰写回答