Word2vec使用意大利面和意大利面
theano-word2vec的Python项目详细描述
#无字2vec mikolov的word2vec在python 2中的一个实现,使用了theano和千层面。
##关于这个包裹 这个包的编写考虑了组件的模块性, 希望它们在创建标准的变体时可以重用 文字2vec.很快,我将提供完整的文档和指导 自定义和扩展,以及如何设置包的教程。 现在,请欣赏本快速入门指南
##快速启动 注意:这个包现在只适用于python 2。
###安装 从python包索引安装: `bash pip install theano-word2vec `
或者,安装一个可以破解的版本: `bash git clone https://github.com/enewe101/word2vec.git cd word2vec python setup.py develop `
###使用
训练word2vec嵌入的最简单方法是: `python >>> from word2vec import word2vec >>> embedder, dictionary = word2vec(files=['corpus/file1.txt','corpus/file2.txt']) ` 其中输入文件的格式应为每行一个句子,其中 标记空间分隔。
经过训练后,嵌入器可用于将单词转换为向量: `python >>> tokens = 'A sentence to embed'.split() >>> token_ids = dictionary.get_ids(tokens) >>> vectors = word2vec_embedder.embed(token_ids) `
word2vec()函数公开出现的大多数基本参数 在基于噪声对比估计的Mikolov跳跃图模型中: `python >>> embedder, dictionary = word2vec( ... # directory in which to save embedding parameters (deepest dir created if doesn't exist) ... savedir='data/my-embedding', ... ... # List of files comprising the corpus ... files=['corpus/file1.txt','corpus/file2.txt'], ... ... # Include whole directories of files (deep files not included) ... directories=['corpus','corpus/additional'], ... ... # Indicate files to exclude using regexes ... skip=[re.compile('*.bk$'),re.compile('exclude-from-corpus')], ... ... # Number of passes through training corpus ... num_epochs=5, ... ... # Specify the mapping from tokens to ints (else create it automatically) ... unigram_dictionary=preexisting_dictionary, ... ... # Number of "noise" examples included for every "signal" example ... noise_ratio=15, ... ... # Relative probability of skip-gram sampling centered on query word ... kernel=[1,2,3,3,2,1], ... ... # Threshold used to calculate discard-probability for query words ... t=1.0e-5, ... ... # Size of minibatches during training ... batch_size = 1000, ... ... # Dimensionality of the embedding vector space ... num_embedding_dimensions = 500, ... ... # Initializer for embedding parameters (can be a numpy array too) ... word_embedding_init=lasagne.init.Normal(), ... ... # Initializer for context embedding parameters (can be numpy array) ... context_embedding_init=lasagne.init.Normal(), ... ... # Size of stochastic gradient descent steps during training ... learning_rate = 0.1, ... ... # Amount of Nesterov momentum during training ... momentum=0.9, ... ... # Print messages during training ... verbose=True, ... ... # Number of parrallel corpus-reading processes ... num_example_generators=3 ... ) `
有关更多自定义,请查看文档(稍后)以了解如何 使用word2vec中提供的类组装您自己的培训设置。