跨http://www.cs.cmu.edu/~dbaman/latin.html进行分析和搜索的工具
archives_org_latin_toolkit的Python项目详细描述
什么?
这一软件将用于david bamman(http://www.cs.cmu.edu/~dbamman/latin.html)编写的11k个拉丁文本。它只支持纯文本格式和元数据github repo csv文件。仅用python3进行了测试。我欢迎任何新功能或向后兼容支持。
如何安装?
- 开发版本:
- 克隆存储库:
git clone https://github.com/ponteineptique/archives_org_latin_toolkit.git
- 转到目录:
cd archives_org_latin_toolkit
- 使用develop选项安装源代码:
python setup.py install
- 克隆存储库:
- 带PIP:
- 从pip安装:
pip install archives_org_latin_toolkit
- 从pip安装:
示例
下面的示例应该使用tests/test_data中的数据运行。示例可以使用python example.py
# We import the main classes from the modulefromarchives_org_latin_toolkitimportRepo,Metadatafrompprintimportpprint# We initiate a Metadata object and a Repo objectmetadata=Metadata("./test/test_data/latin_metadata.csv")# We want the text to be set in lowercaserepo=Repo("./test/test_data/archive_org_latin/",metadata=metadata,lowercase=True)# We define a list of token we want to search fortokens=["ecclesiastico","ecclesia","ecclesiis","""]# We instantiate a result storageresults=[]# We iter over text having those tokens :# Note that we need to "unzip" the listfortext_matchinginrepo.find(*tokens):# For each text, we iter over embeddings found in the text# We want 3 words left, 3 words right,# and we want to keep the original token (Default behaviour)forembeddingintext_matching.find_embedding(*tokens,window=3,ignore_center=False):# We add it to the resultsresults.append(embedding)# We print the result (list of list of strings)pprint(results)