python - 遇到字符串编码和重音引号/撇号的问题

0 投票

1 回答

2782 浏览

提问于 2025-04-17 17:05

我有一个简单的RSS订阅源脚本，它会把每篇文章的内容进行一些简单处理，然后保存到数据库里。

问题是，在处理文本的时候，所有带重音的撇号和引号都被去掉了。

# this is just an example string, I use feed_parser to download the feeds
string = """&#160; <p>This is a sentence. This is a sentence. I'm a programmer. I&#8217;m a programmer, however I don&#8217;t graphic design.</p>"""

text = BeautifulSoup(string)
# does some simple soup processing

string = text.renderContents()
string = string.decode('utf-8', 'ignore')
string = string.replace('<html>','')
string = string.replace('</html>','')
string = string.replace('<body>','')
string = string.replace('</body>','')
string = unicodedata.normalize('NFKD', string).encode('utf-8', 'ignore')
print "".join([x for x in string if ord(x)<128])

这样就导致了：

> <p>  </p><p>This is a sentence. This is a sentence. I'm a programmer. Im a programmer, however I dont graphic design.</p>

所有的HTML实体引号和撇号都被去掉了。我该怎么解决这个问题呢？

文本处理字符串编码数据库处理 html 实体 RSS 订阅

1 个回答

下面的代码对我来说是有效的，你可能忽略了在 BeautifulSoup 构造函数中需要的 convertEntities 参数：

string = """&#160; <p>This is a sentence. This is a sentence. I'm a programmer. I&#8217;m a programmer, however I don&#8217;t graphic design.</p>"""

text = BeautifulSoup(string, convertEntities=BeautifulSoup.HTML_ENTITIES) # See the converEntities argument
# does some simple soup processing

string = text.renderContents()
string = string.decode('utf-8')
string = string.replace('<html>','')
string = string.replace('</html>','')
string = string.replace('<body>','')
string = string.replace('</body>','')
# I don't know why your are doing this
#string = unicodedata.normalize('NFKD', string).encode('utf-8', 'ignore')
print string

回答于 2025-04-17 由 Python大师

分享举报

python - 遇到字符串编码和重音引号/撇号的问题

1 个回答

撰写回答