用Python删除html DIV

0 投票
3 回答
3025 浏览
提问于 2025-04-17 21:28

我正在尝试使用Python中的BeautifulSoup库,通过id删除一个html页面中的div,同时还需要在同一个html页面中的特定标签里添加一些属性。

我的代码是这样的:

原始HTML:

<html>
    <head>
    </head>
    <body>
        <div class="my_class">Div wanted with a new added attribute</div>
        <div id="to_delete">
            Parental div which I want to delete, that contains two other divs, one of which containing a table too.
            <div></div>
            <div>
                <table></table>
            </div>
        </div>
    </body>
</html>

想要的最终HTML:

<html>
    <head>
    </head>
    <body>
        <div class="my_class" id="my_new_id">Wanted div, with a new attribute</div>
    </body>
</html>

我的Python代码:

from bs4 import BeautifulSoup

def replace_all(text, dic):
    for i, j in dic.iteritems():
        text = text.replace(i, j)
    return text

html_data = open("index.html").read()

old_wanted_div = '''<div class="my_class"'''
new_wanted_div = '''<div class="my_class" id="my_new_id"'''

soup = BeautifulSoup(html_data)
old_unwanted_div = soup.find("div", attrs={"id": "to_delete"})
old_unwanted_div_str = '''{}'''.format(str(old_unwanted_div))
new_unwanted_div = ''' '''

reps = {old_wanted_div:new_wanted_div, old_unwanted_div_str:new_unwanted_div}

new_html = replace_all(html_data, reps)

f = open("index.html", "w")
f.write(new_html)
f.close()

现在,这段代码可以成功添加一个属性,但却没有删除不需要的div,我不明白问题出在哪里。

3 个回答

0

这样做可以解决问题吗?

import re
newhtml = re.sub(re.compile('<div id="to_delete">.*body>',re.DOTALL),'</body>',oldhtml)
0

非常感谢你的帮助和回复!

我按照Sudeep Juvekar提议的方法进行了尝试,这和我之前写的代码很相似,不过还是要感谢bsoist!

我花了一些时间才让它正常工作。

我遇到了一个错误:

TypeError: expected a character buffer object

这个问题通过这个资源解决了。

总的来说,现在能正常工作的代码看起来是这样的:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import shlex, subprocess
from subprocess import Popen, PIPE

# adding of a new attribute into the wanted DIV
def replace_all(text, dic):
    for i, j in dic.iteritems():
        text = text.replace(i, j)
    return text
old_html = open("index.html", "r")
old_data = old_html.read()
old_html.close()
old_wanted_div = '''<div class="my_class"'''
new_wanted_div = '''<div class="my_class" id="my_new_id"'''
replacements = {old_wanted_div:new_wanted_div}
new_data_1 = replace_all(old_data, replacements)
f = open("index.html", 'w')
f.write(new_data_1)
f.close()

# script to delete the DIV with id="to_delete", written in another python file
py_del_div = """from bs4 import BeautifulSoup
old_html = open("index.html", "r")
old_data = old_html.read()
old_html.close()
soup = BeautifulSoup(old_data)
old_div_unwanted = soup.find_all("div", attrs={"id": "to_delete"})
new_div_unwanted = old_div_unwanted[0].replace_with("")
new_data_2 = str(soup)
new_file = open("index.html", "w")
new_file.write(new_data_2)
new_file.close()
exit()"""

py_script = open ("index.py", 'w')
py_script.write(py_del_div)
py_script.close()
py1_cmd = "pythonw ./index.py"
html_1 = shlex.split(py1_cmd)
subprocess1 = subprocess.Popen(html_1, shell=False)
subprocess1.wait()
subprocess1.terminate()

exit()

不幸的是,目前我不得不把python代码拆分开,因为html是通过一个子进程生成的,似乎在html生成之前就开始进行替换,导致了一个错误:

IOError: [Errno 2] No such file or directory: '. / Index.html'

所以我想把脚本拆分,把删除div的部分写在另一个python脚本中,然后作为子进程运行……

如果有人知道更简单的方法,欢迎分享。

总之,感谢Sudeep Juvekar!

问候,
Riccardo

2

BeautifulSoup 让你可以直接替换 HTML 中的元素,这样就不需要去修改字符串了。

要替换 to_delete 这个 ID,首先在 soup 中找到这个 ID。

tg = soup.find_all(attrs={"id": "to_delete"})
print tg
out: 
     [<div id="to_delete">
        Parental div which I want to delete, that contains two other divs, one of which           containing a table too.
        <div></div>
        <div>
          <table></table>
        </div>
      </div>]

这会返回一个结果列表。然后你可以用 replace_with 来替换这个结果。

tg[0].replace_with("")

这样会返回替换后的结果,同时也会在 soup 中进行替换。

print soup
out: <html>
     <head>
     </head>
     <body>
       <div class="my_class">Div wanted with a new added attribute</div>
     </body>
     </html>

你也可以在删除之后,类似地更改第一个 div 的 id,比如 soup.div.id = "new_id"。想了解更多关于 replace_with 的信息,可以查看这个链接。 http://www.crummy.com/software/BeautifulSoup/bs4/doc/

撰写回答