python符号拼写
symspellp的Python项目详细描述
符号拼写
symspellpy是SymSpellv6.3的一个python端口,它提供了更高的速度和更低的内存消耗。单元测试 从原来的项目都是为了保证港口的准确性而实施的。
请注意,端口尚未针对速度进行优化。
用法
安装symspellpy
模块
pip install -U symspellpy
将频率字典复制到项目
复制frequency_dictionary_en_82_765.txt
(位于内部symspellpy
目录)到您的项目目录,这样您最终会得到以下布局:
project_dir
+-frequency_dictionary_en_82_765.txt
\-project.py
添加新术语
- 使用
load_dictionary(corpus=<path/to/dictionary.txt>, <term_index>,<count_index>)
。dictionary.txt
应该包含:
<term> <count>
<term> <count>
...
<term> <count>
其中,term_index
表示术语的列数,count_index
表示计数/频率的列数。
- 将
<term> <count>
附加到提供的frequency_dictionary_en_82_765.txt
- 使用方法
create_dictionary_entry(key=<term>, count=<count>)
示例用法(create_dictionary
importosfromsymspellpy.symspellpyimportSymSpell# import the moduledefmain():# maximum edit distance per dictionary precalculationmax_edit_distance_dictionary=2prefix_length=7# create objectsym_spell=SymSpell(max_edit_distance_dictionary,prefix_length)# create dictionary using corpus.txtifnotsym_spell.create_dictionary(<path/to/corpus.txt>):print("Corpus file not found")returnforkey,countinsym_spell.words.items():print("{}{}".format(key,count))if__name__=="__main__":main()
corpus.txt
应该包含:
abc abc-def abc_def abc'def abc qwe qwe1 1qwe q1we 1234 1234
预期输出:
abc 4
def 2
abc'def 1
qwe 1
qwe1 1
1qwe 1
q1we 1
1234 2
示例用法(lookup
和lookup_compound
)
使用project.py
(代码比允许解释方法参数所需的代码更详细)
importosfromsymspellpy.symspellpyimportSymSpell,Verbosity# import the moduledefmain():# maximum edit distance per dictionary precalculationmax_edit_distance_dictionary=2prefix_length=7# create objectsym_spell=SymSpell(max_edit_distance_dictionary,prefix_length)# load dictionarydictionary_path=os.path.join(os.path.dirname(__file__),"frequency_dictionary_en_82_765.txt")term_index=0# column of the term in the dictionary text filecount_index=1# column of the term frequency in the dictionary text fileifnotsym_spell.load_dictionary(dictionary_path,term_index,count_index):print("Dictionary file not found")return# lookup suggestions for single-word input stringsinput_term="memebers"# misspelling of "members"# max edit distance per lookup# (max_edit_distance_lookup <= max_edit_distance_dictionary)max_edit_distance_lookup=2suggestion_verbosity=Verbosity.CLOSEST# TOP, CLOSEST, ALLsuggestions=sym_spell.lookup(input_term,suggestion_verbosity,max_edit_distance_lookup)# display suggestion term, term frequency, and edit distanceforsuggestioninsuggestions:print("{}, {}, {}".format(suggestion.term,suggestion.distance,suggestion.count))# lookup suggestions for multi-word input strings (supports compound# splitting & merging)input_term=("whereis th elove hehad dated forImuch of thepast who ""couqdn'tread in sixtgrade and ins pired him")# max edit distance per lookup (per single word, not per whole input string)max_edit_distance_lookup=2suggestions=sym_spell.lookup_compound(input_term,max_edit_distance_lookup)# display suggestion term, edit distance, and term frequencyforsuggestioninsuggestions:print("{}, {}, {}".format(suggestion.term,suggestion.distance,suggestion.count))if__name__=="__main__":main()
预期产量:
members, 1, 226656153
where is the love he had dated for much of the past who couldn't read in six grade and inspired him, 9, 300000
示例用法(word_segmentation
)
使用project.py
(代码比允许解释
方法参数)
importosfromsymspellpy.symspellpyimportSymSpell# import the moduledefmain():# maximum edit distance per dictionary precalculationmax_edit_distance_dictionary=0prefix_length=7# create objectsym_spell=SymSpell(max_edit_distance_dictionary,prefix_length)# load dictionarydictionary_path=os.path.join(os.path.dirname(__file__),"frequency_dictionary_en_82_765.txt")term_index=0# column of the term in the dictionary text filecount_index=1# column of the term frequency in the dictionary text fileifnotsym_spell.load_dictionary(dictionary_path,term_index,count_index):print("Dictionary file not found")return# a sentence without any spacesinput_term="thequickbrownfoxjumpsoverthelazydog"result=sym_spell.word_segmentation(input_term)# display suggestion term, term frequency, and edit distanceprint("{}, {}, {}".format(result.corrected_string,result.distance_sum,result.log_prob_sum))if__name__=="__main__":main()
预期产量:
the quick brown fox jumps over the lazy dog 8 -34.491167981910635
输送套管
从原来的短语转换大小写
要更正输入错误,请使用的transfer_casing
布尔标志
lookup()
和lookup_compound()
方法:
lookup_compound()
:
suggestions = sym_spell.lookup_compound(input_term,
max_edit_distance_lookup,
transfer_casing=True)
lookup()
:
suggestions = sym_spell.lookup(input_term,
suggestion_verbosity,
max_edit_distance_lookup,
transfer_casing=True)
变更日志
6.3.9(2019-08-06)
- 将
transfer_casing
添加到lookup
和lookup_compound
- 固定前缀长度签入
_edits_prefix
6.3.8(2019-03-21)
- 实现
delete_dictionary_entry
- 通过使用python内置哈希来提高性能
- 添加了pickle的版本控制
6.3.7(2019-02-18)
- 在
lookup
中修复了 - 删除了未使用的
initial_capacity
参数 - 提高了
_get_str_hash
性能 - 实现了
save_pickle
和load_pickle
,以避免创建 每次都查字典
include_unknown
6.3.6(2019-02-11)
- 添加了
create_dictionary()
功能
6.3.5(2019-01-14)
- 修复了
lookup_compound()
以返回正确的distance
6.3.4(2019-01-04)
- 添加
<self._replaced_words = dict()>
以跟踪拼写错误的单词数 - 将
ignore_token
添加到word_segmentation()
以忽略正则表达式的单词
6.3.3(2018-12-05)
- 添加了
word_segmentation()
功能
6.3.2(2018-10-23)
- 将
encoding
选项添加到load_dictionary()
6.3.1(2018-08-30)
- 为
symspellpy
创建包
6.3.0(2018-08-13)
- 移植的SymSpellv6.3