在python中重新实现hfst优化的查找。包含原始hfst优化查找的包装
hfstol的Python项目详细描述
hfstol
python中的hfst优化查找
pip install hfstol
以下所有示例都基于两个.hfstol
文件
分别是:crk-descriptive-analyzer.hfstol crk-normative-generator.hfstol
使用
使用crk-descriptive-analyzer.hfstol
:
fromhfstolimportHFSTOLhfst=HFSTOL.from_file('crk-descriptive-analyzer.hfstol')hfst.feed('niska')# returns: # (('niska', '+N', '+A', '+Sg'), ('niska', '+N', '+A', '+Obv'))hfst.feed_in_bulk(['niska','kinipânânaw'])# returns: # {'niska': {('niska', '+N', '+A', '+Obv'), ('niska', '+N', '+A', '+Sg')}, 'kinipânânaw': {('nipâw', '+V', '+AI', '+Ind', '+Prs', '+12Pl')}}hfst.feed_in_bulk_fast(['niska','kinipânânaw'])# returns:# {'niska': {'niska+N+A+Obv', 'niska+N+A+Sg'}, 'kinipânânaw': {'nipâw+V+AI+Ind+Prs+12Pl'}}
例如crk-normative-generator.hfstol
:
fromhfstolimportHFSTOLhfst=HFSTOL.from_file('crk-normative-generator.hfstol')hfst.feed('niska+N+A+Pl')# returns: # (('niskak',),)hfst.feed_in_bulk(["niska+N+A+Pl",'nipâw+V+AI+Ind+Prs+12Pl'])# returns: # {'niska+N+A+Pl': {('niskak',)}, 'nipâw+V+AI+Ind+Prs+12Pl': {('kinipânânaw',), ('kinipânaw',)}}hfst.feed_in_bulk_fast(["niska+N+A+Pl",'nipâw+V+AI+Ind+Prs+12Pl'],multi_process=4)# returns:# {'niska+N+A+Pl': {'niskak'}, 'nipâw+V+AI+Ind+Prs+12Pl': {'kinipânânaw', 'kinipânaw'}}
要查看包含边缘情况的全面a p i行为,请参见this test file(如果i feed('absolute garbage')
)如何
API签名
# HFSTOL.from_file@classmethoddeffrom_file(cls,filename:Union[str,pathlib.Path]):""" :param filename: the `.hfstol` file :return: an `HFSTOL` instance, which you can use to convert surface forms to deep forms """pass# HFSTOL.feeddeffeed(self,surface_form:str,concat:bool=True)->Tuple[Tuple[str,...],...]:""" feed surface form to hfst :param surface_form: the surface form :param concat: whether to concatenate single characters example output for `surface_form` = 'niskak', with `crk-descriptive-analyzer.hfstol` - True: (('niska', '+N', '+A', '+Pl'), ('nîskâw', '+V', '+II', '+II', '+Cnj', '+Prs', '+3Sg')) - False: (('n', 'i', 's', 'k', 'a', '+N', '+A', '+Pl'), ('n', 'î', 's', 'k', 'â', 'w', '+V', '+II', '+II', '+Cnj', '+Prs', '+3Sg')) example output for `surface_form` = 'niska+N+A+Pl' with `crk-normative-generator.hfstol` - True: (('niskak',),) - False: (('n', 'i', 's', 'k', 'a', 'k'),) example output for `surface_form` = 'niska+N+A+Pl' with `crk-normative-generator.hfstol` (an inflection that has two spellings) - True: (('kinipânaw',), ('kinipânânaw',)) -False: (('k', 'i', 'n', 'i', 'p', 'â', 'n', 'a', 'w'), ('k', 'i', 'n', 'i', 'p', 'â', 'n', 'â', 'n', 'a', 'w')) """pass# HFSTOL.feed_in_bulk deffeed_in_bulk(self,surface_forms:List[str],concat=True)->Dict[str,Set[Tuple[str,...]]]:""" feed a multiple of surface forms to hfst at once :param surface_forms: :return: a dictionary with keys being each surface form fed in, values being their corresponding deep forms """pass# HFSTOL.feed_in_bulk_fastdeffeed_in_bulk_fast(self,strings:Iterable[str],multi_process:int=1)->Dict[str,Set[str]]:""" calls `hfstol-optimized-lookup`. Evaluation is magnitudes faster. Note the generated symbols will all be all concatenated. e.g. instead of ['n', 'i', 's', 'k', 'a', '+N', '+A', '+Pl'] it returns ['niska+N+A+Pl'] :keyword multi_process: Defaults to 1. Specify how many parallel processes you want to speed up computation. A rule is to have processes at most your machine core count. """
使用feed_in_bulk_fast
feed_in_bulk_fast
调用编译的c代码,其速度可能比feed_in_bulk
快100倍。
它需要安装hfst-optimized-lookup
。版本1.2经过测试可以工作。对于linux系统,安装可以像sudo apt install hfst
一样简单。对于其他系统,请参见installation guide
如果找不到hfst-optimized-lookup
,则调用feed_in_bulk_fast
抛出ImportError