python scrapy不会提取<script>标记

2024-05-14 14:07:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我是Scrapy新手,我的代码无法从HTML响应中提取所需的脚本标记

每个请求的HTML响应都以以下结构开始,如浏览器开发人员工具中所示(例如,随机响应,省略其余HTML,因为相关数据始终位于同一个脚本标记上):

<html  xmlns="http://www.w3.org/1999/xhtml"><head><script src="/pje/a4j/g/3_3_3.Final/org/ajax4jsf/framework.pack.js" type="text/javascript"></script><script src="/pje/a4j/g/3_3_3.Final/org/richfaces/ui.pack.js" type="text/javascript"></script><link class="component" href="/pje/a4j/s/3_3_3.Finalorg/richfaces/renderkit/html/css/basic_classes.xcss/DATB/eAELXT5DOhSIAQ!sA18_" rel="stylesheet" type="text/css" /><link class="component" href="/pje/a4j/s/3_3_3.Finalorg/richfaces/renderkit/html/css/extended_classes.xcss/DATB/eAELXT5DOhSIAQ!sA18_" media="rich-extended-skinning" rel="stylesheet" type="text/css" /><link class="component" href="/pje/a4j/s/3_3_3.Final/org/richfaces/skin.xcss/DATB/eAELXT5DOhSIAQ!sA18_" rel="stylesheet" type="text/css" /><script id="org.ajax4jsf.queue_script" type="text/javascript">if (typeof A4J != 'undefined') { if (A4J.AJAX) { with (A4J.AJAX) {if (!EventQueue.getQueue('org.richfaces.queue.global')) { EventQueue.addQueue(new EventQueue('org.richfaces.queue.global',null,null)) };}}};</script><script type="text/javascript">window.RICH_FACES_EXTENDED_SKINNING_ON=true;</script><link class="user" href="/pje/stylesheet/estilos/bootstrap/bootstrap.min.css" rel="stylesheet" type="text/css" /><link class="user" href="/pje/stylesheet/dropzone/dropzone.css" rel="stylesheet" type="text/css" /><link class="user" href="/pje/stylesheet/estilos/richfaces/tema.css" rel="stylesheet" type="text/css" /><link class="user" href="/pje/stylesheet/padrao.css" rel="stylesheet" type="text/css" /><link class="user" href="/pje/stylesheet/autos-digitais.css" rel="stylesheet" type="text/css" /><script src="/pje/js/modernizr.custom.js" type="text/javascript"></script><script src="/pje/js/jquery-2.1.4.min.js" type="text/javascript"></script><script src="/pje/js/bootstrap/bootstrap.min.js" type="text/javascript"></script><script src="/pje/js/jquery.maskedinput.min.js" type="text/javascript"></script><script src="/pje/js/mousetrap/mousetrap.min.js" type="text/javascript"></script><script src="/pje/js/mousetrap/plugins/global-bind/mousetrap-global-bind.js" type="text/javascript"></script><script src="/pje/js/pje/menu.js" type="text/javascript"></script><script src="/pje/js/global.js" type="text/javascript"></script><script src="/pje/js/pje/autos-digitais.js" type="text/javascript"></script><link class="user" href="/pje/stylesheet/estilos/icomoon/style.css" rel="stylesheet" type="text/css" /><script src="/pje/js/jquery.maskMoney.js" type="text/javascript"></script><script src="/pje/js/pje.js" type="text/javascript"></script><script src="/pje/js/pjeOffice.js" type="text/javascript"></script><script src="/pje/js/signerApplet.js" type="text/javascript"></script></head><script>window.open('https://api-pjestorage.tjdft.jus.br/2021063010s/0709994-47.2021.8.07.0020-1625061173643-2414698-processo.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minio-pje%2F20210630%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210630T135253Z&X-Amz-Expires=120&X-Amz-SignedHeaders=host&X-Amz-Signature=3348dc1ce55f1306d4555fb04f933af24ce5fa0b9c2540f5493a04bc83143be5');</script>
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:c="http://java.sun.com/jsp/jstl/core">

    <head>
        
    
    <title>0709994-47.2021.8.07.0020 &middot; Processo Judicial Eletr&ocirc;nico - 1&ordm; Grau</title>

我的目标是提取HTML响应的这一部分中始终包含的URL,在上面的示例中是:https://api-pjestorage.tjdft.jus.br/2021063010s/0709994-47.2021.8.07.0020-1625061173643-2414698-processo.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minio-pje%2F20210630%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210630T135253Z&X-Amz-Expires=120&X-Amz-SignedHeaders=host&X-Amz-Signature=3348dc1ce55f1306d4555fb04f933af24ce5fa0b9c2540f5493a04bc83143be5

我使用response.css("script").extract()作为测试,打算使用re模块从提到的脚本标记中提取URL,但是,出于某种原因,Scrapy跳过了所需的标记并产生以下结果(我省略了列表的其余部分,因为我只对提取提到的URL感兴趣,但Scrapy继续下一个脚本标记,跳过了我需要的一个):

'<script src="/pje/a4j/g/3_3_3.Final/org/ajax4jsf/framework.pack.js" type="text/javascript"></script>', '<script src="/pje/a4j/g/3_3_3.Final/org/richfaces/u
i.pack.js" type="text/javascript"></script>', '<script id="org.ajax4jsf.queue_script" type="text/javascript">if (typeof A4J != \'undefined\') { if (A4J.AJAX) { with (
A4J.AJAX) {if (!EventQueue.getQueue(\'org.richfaces.queue.global\')) { EventQueue.addQueue(new EventQueue(\'org.richfaces.queue.global\',null,null)) };}}};</script>',
 '<script type="text/javascript">window.RICH_FACES_EXTENDED_SKINNING_ON=true;</script>', '<script src="/pje/js/modernizr.custom.js" type="text/javascript"></script>',
 '<script src="/pje/js/jquery-2.1.4.min.js" type="text/javascript"></script>', '<script src="/pje/js/bootstrap/bootstrap.min.js" type="text/javascript"></script>', '<
script src="/pje/js/jquery.maskedinput.min.js" type="text/javascript"></script>', '<script src="/pje/js/mousetrap/mousetrap.min.js" type="text/javascript"></script>',
 '<script src="/pje/js/mousetrap/plugins/global-bind/mousetrap-global-bind.js" type="text/javascript"></script>', '<script src="/pje/js/pje/menu.js" type="text/javasc
ript"></script>', '<script src="/pje/js/global.js" type="text/javascript"></script>', '<script src="/pje/js/pje/autos-digitais.js" type="text/javascript"></script>',
'<script src="/pje/js/jquery.maskMoney.js" type="text/javascript"></script>', '<script src="/pje/js/pje.js" type="text/javascript"></script>', '<script src="/pje/js/p
jeOffice.js" type="text/javascript"></script>', '<script src="/pje/js/signerApplet.js" type="text/javascript"></script>', '<script type="text/javascript">\n\t//<![CDA
TA[\n\t(function($){\n\t\t  $(document).ready(function() {\n\t\t\tvar selector = \'dtInicioInputDate\';\n\n\t\t\t//Seleciona elemento por id\n\t\t\tvar $input = $("in
put[id$=\'" + selector + "\']");\n\t\t\t\n\t\t\tif($input.length < 1){\n\t\t\t\t//Seleciona elemento por class\n\t\t\t\t$input = $("input" + selector);\n\t\t\t}\n\t\t
\t\n\t\t\tif (\'99/99/9999\' == \'\') {\n\t\t\t\t$input.unmask();\n\t\t\t} else {\n\t\t\t\t$input.mask(\'99/99/9999\');\n\t\t\t}\n\t\t });\n\t})(jQuery_21);\n\t//]]>\
n\t</script>'

上面列表中的最后一个元素出现在所需的脚本标记之后(从HTML响应中省略,因为它与任务无关):

'<script type="text/javascript">\n\t//<![CDATA[\n\t(function($){\n\t\t  $(document).ready(function() {\n\t\t\tvar selector = \'dtInicioInputDate\';\n\n\t\t\t//Seleciona elemento por id\n\t\t\tvar $input = $("in
put[id$=\'" + selector + "\']");\n\t\t\t\n\t\t\tif($input.length < 1){\n\t\t\t\t//Seleciona elemento por class\n\t\t\t\t$input = $("input" + selector);\n\t\t\t}\n\t\t
\t\n\t\t\tif (\'99/99/9999\' == \'\') {\n\t\t\t\t$input.unmask();\n\t\t\t} else {\n\t\t\t\t$input.mask(\'99/99/9999\');\n\t\t\t}\n\t\t });\n\t})(jQuery_21);\n\t//]]>\
n\t</script>'

我也试过使用

pattern = re.compile(r"'(https://api-pjestorage.tjdft.jus.br/.+)'")
lst_find = re.findall(pattern=pattern, string=response.text)

但它返回一个空列表,即使当我将HTML复制为字符串并尝试它时,该模式工作正常,这表明所需的脚本标记不包含在“response.text”中,原因我不明白

如何使用regex获取完整的原始HTML文本,或者如何确保response.css(或xpath)将提取所需的脚本标记

为什么Scrapy跳过了一个脚本标记,但正确地提取了所有其他脚本标记

不幸的是,我无法共享我正试图抓取的页面,因为需要登录名和密码

如果您有任何建议,我们将不胜感激。对不起,您的英语太差了


Tags: text标记orgsrcinputtypejsscript
1条回答
网友
1楼 · 发布于 2024-05-14 14:07:25

您正在从中删除的HTML页面的HTML格式似乎不正确。例如,您有两个<html>元素和两个<head>元素。这种格式错误的HTML可能会阻止scrapy找到您的脚本

解决这个问题的一个简单方法是纯粹通过字符串操作和正则表达式

  1. 仅将HTML的第一行保存到变量firstLine(在第一行中断之前\nfirstLine = response.text.split('\n')[0]
  2. 应用正则表达式:
    lst_find = re.findall(pattern=pattern, string=firstLine)
    

相关问题 更多 >

    热门问题