使用Scrapy使用正则表达式提取脚本数据

2024-05-13 20:34:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图使用Scrapy提取商店定位器上的脚本标记的内容,但我有点卡住了

在view source中,脚本内容如下所示:

<script>
    var map_locations = [{"col_id":"1","col_postcode":"DN18 5DE","col_latitude":"53.6825556","col_longitude":"-0.438675","col_address1":"9a Market Lane","col_name":"XX","col_website":"https:\/\/branches.XX.co.uk\/barton-upon-humber\/9a-market-lane.html?type=0&stores=DN18+5DE?utm_source=directories&utm_medium=local&utm_campaign=yext&utm_content=1444","col_facebook":"https:\/\/www.facebook.com\/XXDN185DE\/","col_city":"Barton-Upon-Humber","col_state":"North Lincolnshire","col_yextid":"1444"}...
</script>

我复制了xpath并使用response.xpath('/html/body/script[1]/text()')在终端中检索它

现在,我想将脚本中的信息解析为单独的列,并最终加载到csv中

我应该如何解析这些信息?如果我想要col_的邮政编码?我读过其他一些帖子,其中人们使用regex&;json


Tags: https脚本信息source内容facebookhtmlscript
1条回答
网友
1楼 · 发布于 2024-05-13 20:34:50

.*捕获封闭在[]内的零个或多个字符

import re
import json

# response.xpath will return list of 'Selector' Object & calling extract return the extracted string.
for script in response.xpath("/html/body/script[1]/text()").extract():

    search_ = re.search("\[(.*)\]", script)
    # if multiple script tag's exists, find only which matches the condition.
    if search_:
        for doc in json.loads(search_.group()):
            print(doc['col_postcode'])

输出

DN18 5DE

相关问题 更多 >