基于rnns/lstms的字符或词级深层神经文本生成的tensorflow包装器
tensorlm的Python项目详细描述
用4行代码生成莎士比亚诗歌。
安装
tensorlm是用/为python 3.4+和tensorflow 1.1+编写的
pip3 install tensorlm
基本用法
使用CharLM或WordLM类:
importtensorflowastffromtensorlmimportCharLMwithtf.Session()assession:# Create a new model. You can also use WordLMmodel=CharLM(session,"datasets/sherlock/tinytrain.txt",max_vocab_size=96,neurons_per_layer=100,num_layers=3,num_timesteps=15)# Train itmodel.train(session,max_epochs=5,max_steps=500)# Let it generate a textgenerated=model.sample(session,"The ",num_steps=100)print("The "+generated)
它应该输出如下内容:
The ee e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e
命令行用法
列车:python3 -m tensorlm.cli --train=True--level=char--train_text_path=datasets/sherlock/tinytrain.txt--max_vocab_size=96--neurons_per_layer=100--num_layers=2--batch_size=10--num_timesteps=15--save_dir=out/model--max_epochs=300--save_interval_hours=0.5
示例:python3 -m tensorlm.cli --sample=True--level=char--neurons_per_layer=400--num_layers=3--num_timesteps=160--save_dir=out/model
评估:python3 -m tensorlm.cli --evaluate=True--level=char--evaluate_text_path=datasets/sherlock/tinyvalid.txt--neurons_per_layer=400--num_layers=3--batch_size=10--num_timesteps=160--save_dir=out/model
有关所有选项,请参见python3 -m tensorlm.cli --help。
高级用法
自定义输入数据
输入和目标不必是文本。GeneratingLSTM仅限 需要标记ID,因此可以对序列使用任何数据类型,如 只要你能把数据编码成整数id。
# We use integer ids from 0 to 19, so the vocab size is 20. The range of ids must always start# at zero.batch_inputs=np.array([[1,2,3,4],[15,16,17,18]])# 2 batches, 4 time steps eachbatch_targets=np.array([[2,3,4,5],[16,17,18,19]])# Create the model in a TensorFlow graphmodel=GeneratingLSTM(vocab_size=20,neurons_per_layer=10,num_layers=2,max_batch_size=2)# Initialize all defined TF Variablessession.run(tf.global_variables_initializer())for_inrange(5000):model.train_step(session,batch_inputs,batch_targets)sampled=model.sample_ids(session,[15],num_steps=3)print("Sampled: "+str(sampled))
它应该输出如下内容:
Sampled: [16, 18, 19]
定制培训、退学等
直接使用GeneratingLSTM类。这个类对 数据集类型。它需要整数id并返回整数id。
importtensorflowastffromtensorlmimportVocabulary,Dataset,GeneratingLSTMBATCH_SIZE=20NUM_TIMESTEPS=15withtf.Session()assession:# Generate a token -> id vocabulary based on the textvocab=Vocabulary.create_from_text("datasets/sherlock/tinytrain.txt",max_vocab_size=96,level="char")# Obtain input and target batches from the text filedataset=Dataset("datasets/sherlock/tinytrain.txt",vocab,BATCH_SIZE,NUM_TIMESTEPS)# Create the model in a TensorFlow graphmodel=GeneratingLSTM(vocab_size=vocab.get_size(),neurons_per_layer=100,num_layers=2,max_batch_size=BATCH_SIZE,output_keep_prob=0.5)# Initialize all defined TF Variablessession.run(tf.global_variables_initializer())# Do the trainingepoch=1step=1forepochinrange(20):forinputs,targetsindataset:loss=model.train_step(session,inputs,targets)ifstep%100==0:# Evaluate from time to timedev_dataset=Dataset("datasets/sherlock/tinyvalid.txt",vocab,batch_size=BATCH_SIZE,num_timesteps=NUM_TIMESTEPS)dev_loss=model.evaluate(session,dev_dataset)print("Epoch: %d, Step: %d, Train Loss: %f, Dev Loss: %f"%(epoch,step,loss,dev_loss))# Sample from the model from time to timeprint("Sampled: \"The "+model.sample_text(session,vocab,"The ")+"\"")step+=1
它应该输出如下内容:
Epoch: 3, Step: 100, Train Loss: 3.824941, Dev Loss: 3.778008 Sampled: "The " Epoch: 7, Step: 200, Train Loss: 2.832825, Dev Loss: 2.896187 Sampled: "The " Epoch: 11, Step: 300, Train Loss: 2.778579, Dev Loss: 2.830176 Sampled: "The eee " Epoch: 15, Step: 400, Train Loss: 2.655153, Dev Loss: 2.684828 Sampled: "The ee e e e e e e e e e e e e e e e e e e e e e e e e e e e " Epoch: 19, Step: 500, Train Loss: 2.444502, Dev Loss: 2.479753 Sampled: "The an an an on on on on on on on on on on on on on on on on on on on on on o"