标记是否转换为HTML实体？

2024-05-20 00:38:29 发布

您现在位置：Python中文网/ 问答频道 /正文

2708

网友

男 | 程序猿一只，喜欢编程写python代码。

我想用BeautifulSoup来解析一些肮脏的HTML。一个这样的HTML是http://f10.5post.com/forums/showthread.php?t=1142017

结果是，首先，树丢失了一大块页面。其次，tostring(tree)会将页面一半上的<div>这样的标记转换成</div>这样的HTML实体。例如

原件：

<div class="smallfont" align="centre">All times are GMT -4. The time now is <span class="time">02:12 PM</span>.</div>`

toString(tree)给出

&lt;div class="smallfont" align="center"&gt;All times are GMT -4. The time now is &lt;span class="time"&gt;02:12 PM&lt;/span&gt;.&lt;/div&gt;

这是我的密码：

from BeautifulSoup import BeautifulSoup
import urllib2

page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page)

print soup

谢谢

Tags： lt gt div com http time html 页面

1条回答

网友

1楼 · 发布于 2024-05-20 00:38:29

使用^{}和极其宽大的^{eem>^{cd2>}parser：

import urllib2
from bs4 import BeautifulSoup  # NOTE: importing beautifulsoup4 here

page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page, "html5lib")

print soup

标记是否转换为HTML实体？

相关问题更多 >

编程相关推荐

热门问题

热门文章

标记是否转换为HTML实体？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >