我需要处理大量的txt
文件来构建word2vec
模型。
现在,我的txt文件有点乱,我需要删除所有` \n
'新行,从加载的字符串(txt文件)中读取所有的句子,然后标记每个句子以使用word2vec模型。在
问题是:我不能一行一行地读文件,因为有些句子在一行之后没有结束。因此,我使用´nltk.tokenizer.tokenize()
´,它将文件拆分成句子。在
I cant figure out, how to convert a list of strings into a list of list, where each sub-list contains the sentences, while passing it thourgh a generator.
或者我真的需要将每个句子保存到一个新的文件中(每行一个句子),以便通过生成器传递?在
我的代码是这样的:
'tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
# initialize tokenizer for processing sentences
class Raw_Sentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for file in file_loads: ## Note: file_loads includes directory name of files (e.g. 'C:/Users/text-file1.txt')
with open(file,'r', encoding='utf-8') as t:
# print(tokenizer.tokenize(t.read().replace('\n', ' ')))
storage = tokenizer.tokenize(t.read().replace('\n', ' '))
# I tried to temporary store the list of sentences to a list for an iteration
for sentence in storage:
print(nltk.word_tokenize(sentence))
yield nltk.word_tokenize(sentence)´
所以我们的目标是:
加载文件1:''some messy text here. And another sentence'
´
标记成句子´['some messy text here','And another sentence']
´
然后把每个句子分成单词` `{
加载文件2:'some other messy text. sentence1. sentence2.'
等等
在word2vec模型中输入句子:
'sentences = Raw_Sentences(directory)
´
'model = gensim.models.Word2Vec(sentences)
´
嗯。。。在写下来重新考虑之后。。。我想我自己解决了问题。如果我错了,请纠正我:
要迭代nltk punkt语句标记器创建的每个句子,必须将其直接传递给for循环:
一如既往,也有
yield gensim.utils.simple_preprocess(sentence, deacc= True)
的替代品将其输入
sentence = Raw_Sentences(directory)
中可构建一个正确的工作字2vecgensim.models.Word2Vec(sentences)
相关问题 更多 >
编程相关推荐