从页面源获取字符串

2条回答

网友

1楼 · 编辑于 2024-05-11 03:33:30

我想和大家分享一些其他使用漂亮汤的方法。简单使用正则表达式可能有一些优势，因为它解析页面数据的方式与真正的web浏览器类似

# Sample content based on the format of <https://pastebin.com/raw/YGPupvjj>
content = '''
<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Fake Page</title>
    <script type="text/javascript">
    (function() { var xyz = 'Some other irrelevant script block'; })();
    </script>
  </head>
  <body>
    <p>Dummy body content</p>
    <script type="text/javascript">
        window._sharedData = {
            "entry_data": {
                "PostPage": [{
                    "graphql": {
                        "shortcode_media": {
                            "edge_media_to_tagged_user": {
                                "edges": [{
                                    "node": {
                                        "user": {
                                            "full_name": "John Doe",
                                            "id": "132389782",
                                            "is_verified": false,
                                            "profile_pic_url": "https://example.com/something.jpg",
                                            "username": "johndoe"
                                        }
                                    }
                                }]
                            }
                        }
                    }
                }]
            }
        };
    </script>
  </body>
</html>
'''

相反，如果您想使用实际页面数据尝试此操作，您可以获取它：

import requests
request = requests.get('https://pastebin.com/raw/YGPupvjj')
content = request.content

使用Beauty Soup解析web内容：

from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')

Beautiful Soup让我们能够轻松访问包含数据的<script>块，但它只以字符串的形式返回。它无法解析JavaScript。这里有两种提取数据的方法

方法#1：使用正则表达式查找JSON数据，使用Python JSON库对其进行解析，并搜索加载的JSON数据

import json
import re

# Search JSON data recursively and yield any dict item value with
# key "profile_pic_url"
def search(d):

    if isinstance(d, list):
        for x in d:
            yield from search(x)
        return

    if not isinstance(d, dict):
        return

    url = d.get('profile_pic_url')
    if url:
        yield url

    for v in d.values():
        yield from search(v)


for script_block in soup.find_all('script'):

    if not script_block.string:
        continue

    m = re.fullmatch(r'(?s)\s*window\._sharedData\s*=\s*({.*\});\s*', script_block.string)

    if m is not None:
        data = json.loads(m.group(1))
        for x in search(data):
            print(x)

方法2：使用pyjsparser解析JavaScript `<script>`块，并在解析的语法树中搜索文本键

import pyjsparser

# Search the syntax tree recursively and yield value of
# JS Object property with literal key "profile_pic_url"
def search(d):

    if isinstance(d, list):
        for i, x in enumerate(d):
            yield from search(x)

    if not isinstance(d, dict):
        return

    if d['type'] == 'ObjectExpression':
        for p in d['properties']:
            if (p['key']['type'] == 'Literal'
                    and p['value']['type'] == 'Literal'
                    and p['key']['value'] == 'profile_pic_url'):
                yield p['value']['value']
            yield from search(p['key'])
            yield from search(p['value'])
        return

    for k, v in d.items():
        yield from search(v)

for script_block in soup.find_all('script'):

    if not script_block.string:
        continue

    try:
        code = pyjsparser.parse(script_block.string)
    except pyjsparser.JsSyntaxError:
        continue

    for found in search(code):
        print(found)

网友

2楼 · 编辑于 2024-05-11 03:33:30

解决方案

您可以使用正则表达式（正则表达式）来执行此操作。您需要import re，然后使用以下命令获得所有video_urls的列表

import re
re.findall('\"video_url\":\s*\"(.[^\s]*)\"\s', str(content))

虚拟数据

# suppose this is the text in your "content"
content = '''
"video_url":"https://scontent-lhr8-1.cdninstagram.com/v/t50.2886-16/90894630_221502022556337_2214905061309385826_n.mp4?_nc_ht=scontent-lhr8-1.cdninstagram.com\u0026_nc_cat=109\u0026_nc_ohc=waOdsa3MtFcAX83adIS\u0026oe=5E8413A8\u0026oh=d6ba6cb583afd7f341f6844c0fd02dbf"

jhasbvvlb
duyd7f97tyqubgjn ] \
f;vjnus0fjgr9eguer
Vn d[sb]-u54ldb 
"video_url":  -
"video_url": "https://www.google.com"
'''

代码

然后，下面将为您提供一个视频URL列表

import re
re.findall('\"video_url\":\s*\"(.[^\s]*)\"\s', content)

输出：

['https://scontent-lhr8-1.cdninstagram.com/v/t50.2886-16/90894630_221502022556337_2214905061309385826_n.mp4?_nc_ht=scontent-lhr8-1.cdninstagram.com&_nc_cat=109&_nc_ohc=waOdsa3MtFcAX83adIS&oe=5E8413A8&oh=d6ba6cb583afd7f341f6844c0fd02dbf',
 'https://www.google.com']

参考资料

我还鼓励您进一步了解正则表达式在python中的应用

请参见：https://developers.google.com/edu/python/regular-expressions

方法#1：使用正则表达式查找JSON数据，使用Python JSON库对其进行解析，并搜索加载的JSON数据

方法2：使用pyjsparser解析JavaScript `<script>`块，并在解析的语法树中搜索文本键

解决方案

虚拟数据

代码

参考资料

相关问题更多 >

编程相关推荐

热门问题

热门文章

从页面源获取字符串

方法#1：使用正则表达式查找JSON数据，使用Python JSON库对其进行解析，并搜索加载的JSON数据

方法2：使用pyjsparser解析JavaScript <script>块，并在解析的语法树中搜索文本键

解决方案

虚拟数据

代码

参考资料

相关问题 更多 >

编程相关推荐

热门问题

热门文章

方法2：使用pyjsparser解析JavaScript `<script>`块，并在解析的语法树中搜索文本键

相关问题更多 >