用子结构单位标记微笑

SmilesPE的Python项目详细描述


微笑对编码(SmilesPE)

SMILES Pair Encoding (SmilesPE) trains a substructure tokenizer from a large set of SMILES strings (e.g., ChEMBL) based on byte-pair-encoding (BPE).

概述

安装

pip install SmilesPE

使用说明

基本标记器

  1. 原子级标记器
^{pr2}$
['C', 'C', '[N+]', '(', 'C', ')', '(', 'C', ')', 'C', 'c', '1', 'c', 'c', 'c', 'c', 'c', '1', 'Br']
  1. K-mer代币商
fromSmilesPE.pretokenizerimportkmer_tokenizersmi='CC[N+](C)(C)Cc1ccccc1Br'toks=kmer_tokenizer(smi,ngram=4)print(toks)
['CC[N+](', 'C[N+](C', '[N+](C)', '(C)(', 'C)(C', ')(C)', '(C)C', 'C)Cc', ')Cc1', 'Cc1c', 'c1cc', '1ccc', 'cccc', 'cccc', 'ccc1', 'cc1Br']

基本标记器也与SELFIES和{a2}兼容。需要安装程序包。在

自拍范例

importselfiessmi='CC[N+](C)(C)Cc1ccccc1Br'sel=selfies.encoder(smi)print(f'SELFIES string: {sel}')>>>SELFIESstring:[C][C][N+][Branch1_2][epsilon][C][Branch1_3][epsilon][C][C][c][c][c][c][c][c][Ring1][Branch1_1][Br]toks=atomwise_tokenizer(sel)print(toks)>>>['[C]','[C]','[N+]','[Branch1_2]','[epsilon]','[C]','[Branch1_3]','[epsilon]','[C]','[C]','[c]','[c]','[c]','[c]','[c]','[c]','[Ring1]','[Branch1_1]','[Br]']toks=kmer_tokenizer(sel,ngram=4)print(toks)>>>['[C][C][N+][Branch1_2]','[C][N+][Branch1_2][epsilon]','[N+][Branch1_2][epsilon][C]','[Branch1_2][epsilon][C][Branch1_3]','[epsilon][C][Branch1_3][epsilon]','[C][Branch1_3][epsilon][C]','[Branch1_3][epsilon][C][C]','[epsilon][C][C][c]','[C][C][c][c]','[C][c][c][c]','[c][c][c][c]','[c][c][c][c]','[c][c][c][c]','[c][c][c][Ring1]','[c][c][Ring1][Branch1_1]','[c][Ring1][Branch1_1][Br]']

深微笑的例子

importdeepsmilesconverter=deepsmiles.Converter(rings=True,branches=True)smi='CC[N+](C)(C)Cc1ccccc1Br'deepsmi=converter.encode(smi)print(f'DeepSMILES string: {deepsmi}')>>>DeepSMILESstring:CC[N+]C)C)Ccccccc6Brtoks=atomwise_tokenizer(deepsmi)print(toks)>>>['C','C','[N+]','C',')','C',')','C','c','c','c','c','c','c','6','Br']toks=kmer_tokenizer(deepsmi,ngram=4)print(toks)>>>['CC[N+]C','C[N+]C)','[N+]C)C','C)C)',')C)C','C)Cc',')Ccc','Cccc','cccc','cccc','cccc','ccc6','cc6Br']

使用预先训练过的笑脸标记器

道布劳德'SPE_ChEMBL.txt'。在

importcodecsfromSmilesPE.tokenizerimport*spe_vob=codecs.open('../SPE_ChEMBL.txt')spe=SPE_Tokenizer(spe_vob)smi='CC[N+](C)(C)Cc1ccccc1Br'spe.tokenize(smi)>>>'CC [N+](C) (C)C c1ccccc1 Br'

使用自定义数据集训练SmilesPE标记器

请参阅train_SPE.ipynb以获取在ChEMBL数据上训练SPE标记器的示例。在

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java在Eclipse中使用ctrlshiftt而不是ctrlshiftr有什么好处?   java自动装箱概念SCJP   java如何使用JAXR和Resteasy、Angular和Wildfly10处理COR   java如何在整数数组中动态添加元素?   JAVA从继承生成器模式返回父对象继承   java问题调试生产者消费者问题   java MQ:已达到通道的最大实例数   JavaPowerMockMockito:我试图stubb的方法最终被调用   java Hibernate将多个列映射到一个表   在java中,将字符串中的单词大写,但跳过字符串中的数字和多余空格或符号   使用Eclipse将Java项目导出到JAR时出现“重复条目”错误   java使用eclipselink在实体表上指定NullConstraint   <Java>我可以在TCPIP中使用多个服务器socket吗?   带有自定义视图的java AlertDialog:调整大小以包装视图的内容   如何从用C#编写的web服务生成用于java的SOAP API?