在BeautifulSoup中处理印度语言

3 投票

2 回答

1642 浏览

提问于 2025-04-17 13:15

我正在尝试从NDTV网站上抓取新闻标题。这是我用作HTML源的页面。我使用BeautifulSoup（bs4）来处理HTML代码，其他部分都运行得很好，但当我遇到页面中的印地语标题时，我的代码就出错了。

到目前为止，我的代码是：

import urllib2
from bs4 import BeautifulSoup

htmlUrl = "http://archives.ndtv.com/articles/2012-01.html"
FileName = "NDTV_2012_01.txt"

fptr = open(FileName, "w")
fptr.seek(0)

page = urllib2.urlopen(htmlUrl)
soup = BeautifulSoup(page, from_encoding="UTF-8")

li = soup.findAll( 'li')
for link_tag in li:
   hypref = link_tag.find('a').contents[0]
   strhyp = str(hypref)
   fptr.write(strhyp)
   fptr.write("\n")

我遇到的错误是：

Traceback (most recent call last):
  File "./ScrapeTemplate.py", line 30, in <module>
  strhyp = str(hypref)
  UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

即使我没有包含from_encoding这个参数，我也得到了同样的错误。我最开始用fromEncoding，但Python警告我说这是过时的用法。

我该如何解决这个问题呢？根据我所读到的，我需要避免印地语标题，或者明确将其编码为非ASCII文本，但我不知道该怎么做。任何帮助都将非常感谢！

数据提取网页抓取 html解析 beautifulsoup 编码问题印地语非ascii文本网络数据处理

2 个回答

strhyp = hypref.encode('utf-8')

这是一个链接，指向一个关于Unicode的文章，网址是 http://joelonsoftware.com/articles/Unicode.html。

回答于 2025-04-17 由 Python大师

分享举报

你看到的是一个可导航字符串的实例（它是从Python的unicode类型派生出来的）：

(Pdb) hypref.encode('utf-8')
'NDTV'
(Pdb) hypref.__class__
<class 'bs4.element.NavigableString'>
(Pdb) hypref.__class__.__bases__
(<type 'unicode'>, <class 'bs4.element.PageElement'>)

你需要使用以下方法转换为utf-8格式：

hypref.encode('utf-8')

回答于 2025-04-17 由 Python大师

分享举报

在BeautifulSoup中处理印度语言

2 个回答

撰写回答