使用beautifulsoup更新html文件中的数组

2024-04-30 01:30:09 发布

您现在位置:Python中文网/ 问答频道 /正文

在我用Python运行的脚本中,我想打开一个本地html文件,该文件在我想更新的脚本标记内有一个javascript数组

这是一个测试代码:

from bs4 import BeautifulSoup

html = '''
<script>
    var myArray = [
        {'name':'Michael', 'age':'30', 'birthdate':'11/10/1989'},
        {'name':'Mila', 'age':'32', 'birthdate':'10/1/1989'},
        {'name':'Paul', 'age':'29', 'birthdate':'10/14/1990'},
        {'name':'Dennis', 'age':'25', 'birthdate':'11/29/1993'},
        {'name':'Tim', 'age':'27', 'birthdate':'3/12/1991'},
        {'name':'Erik', 'age':'24', 'birthdate':'10/31/1995'},
    ]
    
    buildTable(myArray)
   '''

soup = BeautifulSoup(html, 'lxml')
scripts = soup.find_all('script')  # successfully captures the <script> element
for script in scripts:
   print(script)

我不知道如何选择myArray变量并用另一个变量更新它(我的脚本中有)


Tags: 文件name标记脚本agehtmlscriptsscript
1条回答
网友
1楼 · 发布于 2024-04-30 01:30:09

您不能直接选择myArray变量,因为它是Javascript,并且BeautifulSoup只解析HTML。因此<script>内的所有内容都将作为原始文本处理

这意味着如果要更新<script>标记,需要使用类似regex的东西,如下所示:

from bs4 import BeautifulSoup
import re

newArray = [
       {'name':'Bobby', 'age':'29', 'birthdate':'11/11/1988'}
    ]


html = '''
<script>
    var myArray = [
        {'name':'Michael', 'age':'30', 'birthdate':'11/10/1989'},
        {'name':'Mila', 'age':'32', 'birthdate':'10/1/1989'},
        {'name':'Paul', 'age':'29', 'birthdate':'10/14/1990'},
        {'name':'Dennis', 'age':'25', 'birthdate':'11/29/1993'},
        {'name':'Tim', 'age':'27', 'birthdate':'3/12/1991'},
        {'name':'Erik', 'age':'24', 'birthdate':'10/31/1995'},
    ]
    
    buildTable(myArray)
   '''

soup = BeautifulSoup(html, 'lxml')
scripts = soup.find_all('script')  # successfully captures the <script> element
for script in scripts:
   script.string = re.sub(r"(myArray.*)\[[^\]]*\]", r"\1" + str(newArray), script.string)

print(soup)

# <html><head><script>
#    var myArray = [{'name': 'Bobby', 'age': '29', 'birthdate': '11/11/1988'}]
#
#    buildTable(myArray)
#   </script></head></html>

相关问题 更多 >