用子结构单位标记微笑
SmilesPE的Python项目详细描述
微笑对编码(SmilesPE)
SMILES Pair Encoding (SmilesPE) trains a substructure tokenizer from a large set of SMILES strings (e.g., ChEMBL) based on byte-pair-encoding (BPE).
概述
安装
pip install SmilesPE
使用说明
基本标记器
- 原子级标记器
['C', 'C', '[N+]', '(', 'C', ')', '(', 'C', ')', 'C', 'c', '1', 'c', 'c', 'c', 'c', 'c', '1', 'Br']
- K-mer代币商
fromSmilesPE.pretokenizerimportkmer_tokenizersmi='CC[N+](C)(C)Cc1ccccc1Br'toks=kmer_tokenizer(smi,ngram=4)print(toks)
['CC[N+](', 'C[N+](C', '[N+](C)', '(C)(', 'C)(C', ')(C)', '(C)C', 'C)Cc', ')Cc1', 'Cc1c', 'c1cc', '1ccc', 'cccc', 'cccc', 'ccc1', 'cc1Br']
基本标记器也与SELFIES和{a2}兼容。需要安装程序包。在
自拍范例
importselfiessmi='CC[N+](C)(C)Cc1ccccc1Br'sel=selfies.encoder(smi)print(f'SELFIES string: {sel}')>>>SELFIESstring:[C][C][N+][Branch1_2][epsilon][C][Branch1_3][epsilon][C][C][c][c][c][c][c][c][Ring1][Branch1_1][Br]toks=atomwise_tokenizer(sel)print(toks)>>>['[C]','[C]','[N+]','[Branch1_2]','[epsilon]','[C]','[Branch1_3]','[epsilon]','[C]','[C]','[c]','[c]','[c]','[c]','[c]','[c]','[Ring1]','[Branch1_1]','[Br]']toks=kmer_tokenizer(sel,ngram=4)print(toks)>>>['[C][C][N+][Branch1_2]','[C][N+][Branch1_2][epsilon]','[N+][Branch1_2][epsilon][C]','[Branch1_2][epsilon][C][Branch1_3]','[epsilon][C][Branch1_3][epsilon]','[C][Branch1_3][epsilon][C]','[Branch1_3][epsilon][C][C]','[epsilon][C][C][c]','[C][C][c][c]','[C][c][c][c]','[c][c][c][c]','[c][c][c][c]','[c][c][c][c]','[c][c][c][Ring1]','[c][c][Ring1][Branch1_1]','[c][Ring1][Branch1_1][Br]']
深微笑的例子
importdeepsmilesconverter=deepsmiles.Converter(rings=True,branches=True)smi='CC[N+](C)(C)Cc1ccccc1Br'deepsmi=converter.encode(smi)print(f'DeepSMILES string: {deepsmi}')>>>DeepSMILESstring:CC[N+]C)C)Ccccccc6Brtoks=atomwise_tokenizer(deepsmi)print(toks)>>>['C','C','[N+]','C',')','C',')','C','c','c','c','c','c','c','6','Br']toks=kmer_tokenizer(deepsmi,ngram=4)print(toks)>>>['CC[N+]C','C[N+]C)','[N+]C)C','C)C)',')C)C','C)Cc',')Ccc','Cccc','cccc','cccc','cccc','ccc6','cc6Br']
使用预先训练过的笑脸标记器
道布劳德'SPE_ChEMBL.txt'。在
importcodecsfromSmilesPE.tokenizerimport*spe_vob=codecs.open('../SPE_ChEMBL.txt')spe=SPE_Tokenizer(spe_vob)smi='CC[N+](C)(C)Cc1ccccc1Br'spe.tokenize(smi)>>>'CC [N+](C) (C)C c1ccccc1 Br'
使用自定义数据集训练SmilesPE标记器
请参阅train_SPE.ipynb以获取在ChEMBL数据上训练SPE标记器的示例。在
- 项目
标签: