如何使用beautifulsoup通过var从<script>获取数据?

2024-03-28 16:02:16 发布

您现在位置:Python中文网/ 问答频道 /正文

使用ASSIONS库对网页进行scrab,然后使用beautifulsoup解析页面源代码。Soup包含一个包含大量脚本的大型html。我需要最后的-9

page_source = await session.get_page_source()
    soup = bs(page_source, 'html.parser')
    scripts = soup.find_all('script')
    script9 = scripts[-9].next

下面是脚本9:

    sometext;
var thumbdata = {
  thumbs: [{avatar: "/i/nophoto.jpg", username: "IslandGirlSearching",la:"0 second ",chatid: "0",userid: "2088789", age:"21",city:"Cebu"},{avatar: "/p/2021-08/Cristina266/ava-1629535964.jpg", username: "Cristina266",la:"0 second ",chatid: "0",userid: "2095868", age:"26",city:"Pasig City"}  ] }; 
  var source = sometext;

然后,我遵循您分享的示例:

    pattern = re.compile(r"var thumbdata = {\n"
                         r"(.*?);")

    m = pattern.match(script9.string)
    thumbs = json.loads(m.groups()[0])

    for thumb in thumbs:
        print(thumb)

检查我的正则表达式,它是正确的。但当我执行此代码时,我得到属性错误:

AttributeError: 'NoneType' object has no attribute 'groups'

Tags: 脚本sourcevarhtmlpagescriptsusernamela
1条回答
网友
1楼 · 发布于 2024-03-28 16:02:16

您的方法仍然存在一些问题:

  1. 要将字符串传递给json.loads(),它需要是有效的JSON;否则,将出现异常。对于要捕获的内容,需要将前导的{标记作为捕获组的一部分。将两种不同的模式合并为:

    var thumbdata = ({\n.*?);
    

    Regex101

  2. 您会注意到,即使是使用也会更改为获取前面的大括号标记,您提取的字符串仍然是无效的JSON。与普通的老JavaScript对象不同,所有键名都必须封装在引号中;您将要提取的文本不会预先执行此操作。因此,您需要将内置的JSON解析器(它严格遵守规范,将而不是按JSON原样解析此数据)替换为类似^{}的东西,它不实现具有此限制的规范

    Relevant SO thread

  3. re.match()不像你想象的那样。深入the documentation for this method在这一特定情况下具有启发性(我的重点是):

    Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.

    这一点很重要,因为script9中的字符串数据不以任何根据您的模式被视为“匹配”的数据开头。相反,将re.match()的调用替换为^{}

对上述更改再做一些调整,您的代码看起来更像以下内容:

import re
import hjson

script9 = '''    sometext;
var thumbdata = {
  thumbs: [{avatar: "/i/nophoto.jpg", username: "IslandGirlSearching",la:"0 second ",chatid: "0",userid: "2088789", age:"21",city:"Cebu"},{avatar: "/p/2021-08/Cristina266/ava-1629535964.jpg", username: "Cristina266",la:"0 second ",chatid: "0",userid: "2095868", age:"26",city:"Pasig City"}  ] }; 
  var source = sometext;
'''

pattern = re.compile(r"var thumbdata = ({\n.*?);")

m = pattern.search(script9)
thumbs = list(hjson.loads(m.groups()[0]).items())
print(thumbs)

Repl.it

产出:

[('thumbs', [OrderedDict([('avatar', '/i/nophoto.jpg'), ('username', 'IslandGirlSearching'), ('la', '0 second '), ('chatid', '0'), ('userid', '2088789'), ('age', '21'), ('city', 'Cebu')]), OrderedDict([('avatar', '/p/2021-08/Cristina266/ava-1629535964.jpg'), ('username', 'Cristina266'), ('la', '0 second '), ('chatid', '0'), ('userid', '2095868'), ('age', '26'), ('city', 'Pasig City')])])]
('thumbs', [OrderedDict([('avatar', '/i/nophoto.jpg'), ('username', 'IslandGirlSearching'), ('la', '0 second '), ('chatid', '0'), ('userid', '2088789'), ('age', '21'), ('city', 'Cebu')]), OrderedDict([('avatar', '/p/2021-08/Cristina266/ava-1629535964.jpg'), ('username', 'Cristina266'), ('la', '0 second '), ('chatid', '0'), ('userid', '2095868'), ('age', '26'), ('city', 'Pasig City')])])

相关问题 更多 >