将html实体文件转换为Unicode（使用beauthulsoup和Python？）

γέρων, οντος, ὁ, Wurzel ΓΕΡ, verwandt mit γέρας, γεραρός, γεραιός

3条回答

网友

1楼 · 编辑于 2024-04-26 15:15:19

import bs4

html = '''<b>&#947;&#941;&#961;&#969;&#957;</b>, <i>&#959;&#957;&#964;&#959;&#962;, &#8001;</i>, Wurzel <i>&#915;&#917;&#929;</i>, verwandt mit <i>&#947;&#941;&#961;&#945;&#962;, &#947;&#949;&#961;&#945;&#961;&#972;&#962;, &#947;&#949;&#961;&#945;&#953;&#972;&#962;</i>'''

soup = bs4.BeautifulSoup(html, 'lxml')

输出：

^{pr2}$

Document：

To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:

from bs4 import BeautifulSoup
> 
> soup = BeautifulSoup(open("index.html"))  # you can open you file in here
> 
> soup = BeautifulSoup("<html>data</html>")

First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:

网友

2楼 · 编辑于 2024-04-26 15:15:19

这是html编码的，请尝试以下操作：

from HTMLParser import HTMLParser

f = open("myfile.txt")
h = HTMLParser()
new_file_content = h.unescape(f.read())
new_file = open("newfile.txt", 'w')
new_file.write(new_file_content)

网友

3楼 · 编辑于 2024-04-26 15:15:19

beauthoulsoup有一个内置函数用于执行此操作，称为.decode()。当你读入文件时，只需将此添加到行尾！在

示例：

site_read = site_download.read().decode('utf-8')

相关问题更多 >

编程相关推荐

热门问题

热门文章

将html实体文件转换为Unicode（使用beauthulsoup和Python？）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >