将html实体文件转换为Unicode(使用beauthulsoup和Python?)

2024-04-26 15:15:19 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经在Win10上安装了python2.7.13、pip和beautifulsoup。我想把一个包含html实体的大文件转换成Unicode字符,但我不知道该怎么做(我对Python不太了解)。文件内容如下:

<b>&#947;&#941;&#961;&#969;&#957;</b>, <i>&#959;&#957;&#964;&#959;&#962;, &#8001;</i>, Wurzel <i>&#915;&#917;&#929;</i>, verwandt mit <i>&#947;&#941;&#961;&#945;&#962;, &#947;&#949;&#961;&#945;&#961;&#972;&#962;, &#947;&#949;&#961;&#945;&#953;&#972;&#962;</i>

我可以用EmEditor做一些小的工作(使用Edit>;Encode/Decode Selection->HTML/XML字符引用Unicode),但它太慢了,无法处理大文件转换)。在

我会很高兴的任何(离线)解决方案。在


Tags: pip文件gt实体内容mithtmlunicode
3条回答
import bs4

html = '''<b>&#947;&#941;&#961;&#969;&#957;</b>, <i>&#959;&#957;&#964;&#959;&#962;, &#8001;</i>, Wurzel <i>&#915;&#917;&#929;</i>, verwandt mit <i>&#947;&#941;&#961;&#945;&#962;, &#947;&#949;&#961;&#945;&#961;&#972;&#962;, &#947;&#949;&#961;&#945;&#953;&#972;&#962;</i>'''

soup = bs4.BeautifulSoup(html, 'lxml')

输出:

^{pr2}$

Document

To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:

from bs4 import BeautifulSoup
> 
> soup = BeautifulSoup(open("index.html"))  # you can open you file in here
> 
> soup = BeautifulSoup("<html>data</html>")

First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:

这是html编码的,请尝试以下操作:

from HTMLParser import HTMLParser

f = open("myfile.txt")
h = HTMLParser()
new_file_content = h.unescape(f.read())
new_file = open("newfile.txt", 'w')
new_file.write(new_file_content)

beauthoulsoup有一个内置函数用于执行此操作,称为.decode()。当你读入文件时,只需将此添加到行尾!在

示例:

site_read = site_download.read().decode('utf-8')

相关问题 更多 >