从BeautifulSoup Python获取CDATA

2024-05-23 16:14:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个带有CDATA标记的HTML源代码,其中包含一些我想要的信息。你知道吗

当我尝试以下操作时:

switch_url = switch_soup.find_all(text=re.compile(('Switches')))

我得到这个输出:

['//<![CDATA[\n    "url":"https://xxxx.meraki.com/xxxxxxx/n/xxxxx/manage/nodes/list","name":"Switches","admin_only":false},{"is_current":false,"url":"https://nxx.meraki.com/xxxxx/n/xxxxx/manage/configure/switchports","name":"Switch ports","admin_only":false},{"is_current":false,"url":"https://xxxx.meraki.com/Dormitory/n/xxxxxxx/manage/configure/dhcp_servers"//]]>\n  ']

如何从CDATA输出中获取“Switches”url,即:“https://xxxx.meraki.com/xxxxxxx/n/xxxxx/manage/nodes/list”?你知道吗

提前谢谢!你知道吗


Tags: namehttpscomfalseurlmanagelistnodes
1条回答
网友
1楼 · 发布于 2024-05-23 16:14:17

你需要的是这个

from BeautifulSoup import BeautifulSoup
import re

// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
cdata = soup.find(text=re.compile("CDATA"))

或者你可以试试这个

for script in soup(['script', 'style']):
        script.decompose()

    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)

相关问题 更多 >