2024-04-19 12:36:33 发布
网友
我想知道如何使用BeautifulSoup删除所有HTML标记及其内容。
BeautifulSoup
输入:
... text <strong>ha</strong> ... text
输出:
... text ... text
这是针对XML的,如果您想将它用于HTML,请将导入从BeautifulStoneSoup更改为BeautifulSoup
BeautifulStoneSoup
try: #Using bs4 from bs4 import BeautifulStoneSoup from bs4 import Tag except ImportError: #Using bs3 from BeautifulSoup import BeautifulStoneSoup from BeautifulSoup import Tag def info_extract(isoup): ''' Recursively walk a nested list and upon finding a non iterable, return its string ''' tlist = [] def info_extract_helper(inlist, count = 0): if(isinstance(inlist, list)): for q in inlist: if(isinstance(q, Tag)): info_extract_helper(q.contents, count + 1) else: extracted_str = q.strip() if(extracted_str and (count > 1)): tlist.append(extracted_str) info_extract_helper([isoup]) return tlist xml_str = \ ''' <?xml version="1.0" encoding="UTF-8"?> <first-tag> <second-tag> <events-data> <event-date someattrib="test"> <date>20040913</date> </event-date> </events-data> <events-data> <event-date> <date>20040913</date> </event-date> </events-data> </second-tag> </first-tag> ''' soup = BeautifulStoneSoup(xml_str) print info_extract(soup)
使用^{}(或replaceWith()):
replaceWith()
from bs4 import BeautifulSoup, Tag text = "text <strong>ha</strong> ... text" soup = BeautifulSoup(text) for tag in soup.find_all('strong'): tag.replaceWith('') print soup.get_text()
印刷品:
text ... text
或者,正如@mata建议的那样,您可以使用tag.decompose()而不是tag.replaceWith('')-将产生相同的结果,但看起来更合适。
tag.decompose()
tag.replaceWith('')
这是针对XML的,如果您想将它用于HTML,请将导入从
BeautifulStoneSoup
更改为BeautifulSoup
使用^{} (或
replaceWith()
):印刷品:
或者,正如@mata建议的那样,您可以使用
tag.decompose()
而不是tag.replaceWith('')
-将产生相同的结果,但看起来更合适。相关问题 更多 >
编程相关推荐