训练word2vec模型从文件流数据并标记到senten

2024-04-25 06:21:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要处理大量的txt文件来构建word2vec模型。 现在,我的txt文件有点乱,我需要删除所有` \n'新行,从加载的字符串(txt文件)中读取所有的句子,然后标记每个句子以使用word2vec模型。在

问题是:我不能一行一行地读文件,因为有些句子在一行之后没有结束。因此,我使用´nltk.tokenizer.tokenize()´,它将文件拆分成句子。在

I cant figure out, how to convert a list of strings into a list of list, where each sub-list contains the sentences, while passing it thourgh a generator.

或者我真的需要将每个句子保存到一个新的文件中(每行一个句子),以便通过生成器传递?在

我的代码是这样的: 'tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# initialize tokenizer for processing sentences

class Raw_Sentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for file in file_loads: ## Note: file_loads includes directory name of files (e.g. 'C:/Users/text-file1.txt')
            with open(file,'r', encoding='utf-8') as t:     
               # print(tokenizer.tokenize(t.read().replace('\n', ' ')))           
                storage = tokenizer.tokenize(t.read().replace('\n', ' '))
# I tried to temporary store the list of sentences to a list for an iteration
                for sentence in storage:
                    print(nltk.word_tokenize(sentence))
                    yield nltk.word_tokenize(sentence)´

所以我们的目标是: 加载文件1:''some messy text here. And another sentence'´ 标记成句子´['some messy text here','And another sentence']´ 然后把每个句子分成单词` `{

加载文件2:'some other messy text. sentence1. sentence2.' 等等

在word2vec模型中输入句子: 'sentences = Raw_Sentences(directory)´

'model = gensim.models.Word2Vec(sentences)´


Tags: 文件oftext模型txtforsentencesword2vec
1条回答
网友
1楼 · 发布于 2024-04-25 06:21:02

嗯。。。在写下来重新考虑之后。。。我想我自己解决了问题。如果我错了,请纠正我:

要迭代nltk punkt语句标记器创建的每个句子,必须将其直接传递给for循环:

def __iter__(self):
    for file in file_loads:
       with open(file,'r') as t:
           for sentence in tokenizer.tokenize(t.read().replace('\n',' ')):
                yield nltk.word_tokenize(sentence) 

一如既往,也有yield gensim.utils.simple_preprocess(sentence, deacc= True)的替代品

将其输入sentence = Raw_Sentences(directory)中可构建一个正确的工作字2vec gensim.models.Word2Vec(sentences)

相关问题 更多 >