图像转文本 - 在Python 2.7中移除非ASCII字符

3 投票

2 回答

3075 浏览

提问于 2025-04-18 14:37

我正在使用pytesser来识别一张小图片里的文字，并把它转换成字符串：

image= Image.open(ImagePath)
text = image_to_string(image)
print text

不过，pytesser有时候会识别出一些非ASCII字符。这个问题出现在我想打印我刚识别的内容时。在我使用的Python 2.7中，程序就会崩溃。

有没有办法让pytesser不返回任何非ASCII字符呢？也许可以在tesseract OCR中做一些设置？

或者，有没有办法检测一个字符串是否包含非ASCII字符（不让程序崩溃），然后就不打印那一行呢？

有人建议使用Python 3.4，但根据我的研究，pytesser似乎不支持这个版本：在Python 3.4中使用Pytesser：名称'image_to_string'未定义？

字符串处理版本兼容性非ascii字符图像识别光学字符识别 tesseract pytesser 编程崩溃

2 个回答

有没有办法让pytesser不返回任何非ASCII字符呢？

你可以通过使用选项 tessedit_char_whitelist 来限制tesseract识别的字符。

比如：

import string
char_whitelist = string.digits
char_whitelist += string.ascii_lowercase
char_whitelist += string.ascii_uppercase
image= Image.open(ImagePath)
text = image_to_string(image,
    config="-c tessedit_char_whitelist=%s_-." % char_whitelist)
print text

另外，你可以查看这个链接了解更多信息： https://github.com/tesseract-ocr/tesseract/wiki/FAQ-Old#how-do-i-recognize-only-digits

回答于 2025-04-18 由 Python大师

分享举报

我推荐使用 Unidecode 这个库。这个库的作用是把那些不是ASCII字符的东西转换成最相似的ASCII字符。

import unidecode
image = Image.open(ImagePath)
text = image_to_string(image)
print unidecode(text)

这样应该能完美运行！

回答于 2025-04-18 由 Python大师

分享举报

图像转文本 - 在Python 2.7中移除非ASCII字符

2 个回答

撰写回答