字母数字字符串的java Tesseract配置：混合2、Z、6和G

12 月 Questions & Answers 786

我正在尝试将tesseract配置为识别长度为10个字符（均为大写）的字母数字字符串

这很管用，只是它似乎经常混淆以下角色：

2和Z

6和G

经过一些研究，我可能已经找到了原因（但我不确定）：tesseract中可能存在某种系统，它根据字典（或其他数据）中延迟的字符位置进行预测。我认为Tesseract可能会选择一个2而不是一个Z，因为它看起来很像，但也因为它总是出现在一个单词中；在这种情况下，{}没有多大意义。至少这是我能理解的

我想禁用此功能，所以我尝试了几个选项，但没有成功：

tesseract = new Tesseract(); tesseract.setOcrEngineMode(TessAPI.TessOcrEngineMode.OEM_TESSERACT_ONLY); tesseract.setPageSegMode(7); tesseract.setTessVariable("load_system_dawg", "0"); tesseract.setTessVariable("load_freq_dawg", "0"); tesseract.setTessVariable("load_punc_dawg", "0"); tesseract.setTessVariable("load_number_dawg", "0"); tesseract.setTessVariable("load_unambig_dawg", "0"); tesseract.setTessVariable("load_bigram_dawg", "0"); tesseract.setTessVariable("load_fixed_length_dawgs", "0"); tesseract.setTessVariable("classify_enable_learning", "0"); tesseract.setTessVariable("classify_enable_adaptive_matcher", "0"); tesseract.setTessVariable("segment_penalty_garbage", "0"); tesseract.setTessVariable("segment_penalty_dict_nonword", "0"); tesseract.setTessVariable("segment_penalty_dict_frequent_word", "0"); tesseract.setTessVariable("segment_penalty_dict_case_ok", "0"); tesseract.setTessVariable("segment_penalty_dict_case_bad", "0");

请注意，这是Java代码，但我的问题不限于Java

我对Tesseract没有真正的经验，似乎觉得文档非常不清楚。我希望其他人能帮我

为了提供更多的背景：

如何训练Tesseract

我通过将200多张图像组合成一张图像来训练Tesseract。每个图像包含10个字母数字字符。此外，我确信box文件是正确的

我通过执行以下批处理脚本来构建最终语言：

tesseract qwe.combined.jpg qwe.combined.box nobatch box.train echo combined 1 0 0 0 0 > font_properties unicharset_extractor qwe.combined.box shapeclustering -F font_properties -U unicharset qwe.combined.box.tr mftraining -F font_properties -U unicharset -O qwe.unicharset qwe.combined.box.tr cntraining qwe.combined.box.tr copy inttemp qwe.inttemp copy normproto qwe.normproto copy pffmtable qwe.pffmtable copy shapetable qwe.shapetable combine_tessdata qwe.

如何让Tesseract更好地区分2、Z、6和G

Python中文网

有 Java 编程相关的问题?

字母数字字符串的java Tesseract配置：混合2、Z、6和G

共 (0) 个答案