用Python删除html DIV
我正在尝试使用Python中的BeautifulSoup库,通过id删除一个html页面中的div,同时还需要在同一个html页面中的特定标签里添加一些属性。
我的代码是这样的:
原始HTML:
<html>
<head>
</head>
<body>
<div class="my_class">Div wanted with a new added attribute</div>
<div id="to_delete">
Parental div which I want to delete, that contains two other divs, one of which containing a table too.
<div></div>
<div>
<table></table>
</div>
</div>
</body>
</html>
想要的最终HTML:
<html>
<head>
</head>
<body>
<div class="my_class" id="my_new_id">Wanted div, with a new attribute</div>
</body>
</html>
我的Python代码:
from bs4 import BeautifulSoup
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i, j)
return text
html_data = open("index.html").read()
old_wanted_div = '''<div class="my_class"'''
new_wanted_div = '''<div class="my_class" id="my_new_id"'''
soup = BeautifulSoup(html_data)
old_unwanted_div = soup.find("div", attrs={"id": "to_delete"})
old_unwanted_div_str = '''{}'''.format(str(old_unwanted_div))
new_unwanted_div = ''' '''
reps = {old_wanted_div:new_wanted_div, old_unwanted_div_str:new_unwanted_div}
new_html = replace_all(html_data, reps)
f = open("index.html", "w")
f.write(new_html)
f.close()
现在,这段代码可以成功添加一个属性,但却没有删除不需要的div,我不明白问题出在哪里。
3 个回答
这样做可以解决问题吗?
import re
newhtml = re.sub(re.compile('<div id="to_delete">.*body>',re.DOTALL),'</body>',oldhtml)
非常感谢你的帮助和回复!
我按照Sudeep Juvekar提议的方法进行了尝试,这和我之前写的代码很相似,不过还是要感谢bsoist!
我花了一些时间才让它正常工作。
我遇到了一个错误:
TypeError: expected a character buffer object
这个问题通过这个资源解决了。
总的来说,现在能正常工作的代码看起来是这样的:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import shlex, subprocess
from subprocess import Popen, PIPE
# adding of a new attribute into the wanted DIV
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i, j)
return text
old_html = open("index.html", "r")
old_data = old_html.read()
old_html.close()
old_wanted_div = '''<div class="my_class"'''
new_wanted_div = '''<div class="my_class" id="my_new_id"'''
replacements = {old_wanted_div:new_wanted_div}
new_data_1 = replace_all(old_data, replacements)
f = open("index.html", 'w')
f.write(new_data_1)
f.close()
# script to delete the DIV with id="to_delete", written in another python file
py_del_div = """from bs4 import BeautifulSoup
old_html = open("index.html", "r")
old_data = old_html.read()
old_html.close()
soup = BeautifulSoup(old_data)
old_div_unwanted = soup.find_all("div", attrs={"id": "to_delete"})
new_div_unwanted = old_div_unwanted[0].replace_with("")
new_data_2 = str(soup)
new_file = open("index.html", "w")
new_file.write(new_data_2)
new_file.close()
exit()"""
py_script = open ("index.py", 'w')
py_script.write(py_del_div)
py_script.close()
py1_cmd = "pythonw ./index.py"
html_1 = shlex.split(py1_cmd)
subprocess1 = subprocess.Popen(html_1, shell=False)
subprocess1.wait()
subprocess1.terminate()
exit()
不幸的是,目前我不得不把python代码拆分开,因为html是通过一个子进程生成的,似乎在html生成之前就开始进行替换,导致了一个错误:
IOError: [Errno 2] No such file or directory: '. / Index.html'
所以我想把脚本拆分,把删除div的部分写在另一个python脚本中,然后作为子进程运行……
如果有人知道更简单的方法,欢迎分享。
总之,感谢Sudeep Juvekar!
问候,
Riccardo
BeautifulSoup 让你可以直接替换 HTML 中的元素,这样就不需要去修改字符串了。
要替换 to_delete
这个 ID,首先在 soup
中找到这个 ID。
tg = soup.find_all(attrs={"id": "to_delete"})
print tg
out:
[<div id="to_delete">
Parental div which I want to delete, that contains two other divs, one of which containing a table too.
<div></div>
<div>
<table></table>
</div>
</div>]
这会返回一个结果列表。然后你可以用 replace_with
来替换这个结果。
tg[0].replace_with("")
这样会返回替换后的结果,同时也会在 soup
中进行替换。
print soup
out: <html>
<head>
</head>
<body>
<div class="my_class">Div wanted with a new added attribute</div>
</body>
</html>
你也可以在删除之后,类似地更改第一个 div 的 id
,比如 soup.div.id = "new_id"
。想了解更多关于 replace_with 的信息,可以查看这个链接。 http://www.crummy.com/software/BeautifulSoup/bs4/doc/