如何使我的文本解析功能更高效/友好？

2024-05-15 09:07:04 发布

男 | 程序猿一只，喜欢编程写python代码。

我正在尝试对一个大的语料库（大约2MB）进行预处理，这样文本中的每个单词都会根据后面的两个单词进行分组（即以3个单词为一组）。因此，对于以下输入： 'The man ate the apple'，我会得到(The, man, ate), (man, ate, the), (ate, the, apple)。然后我想对每个单词进行矢量化，创建一个数据集（其中前两个单词用作输入，第三个单词用作输出），并将其输入到LSTM中

在Google Compute Engine的实例上运行以下代码时，当我增加（Keras）标记器接受的最大字数时，进程总是被终止。关于如何提高代码效率有什么想法吗

size_of_vocabulary = 1000

def preprocess_corpus():

    text = load_corpus(filename)
    print("Preprocessing...")

    tokenizer = Tokenizer(num_words=size_of_vocabulary)
    tokenizer.fit_on_texts([text])

    word_index = tokenizer.word_index
    reverse_word_index = dict(zip(word_index.values(), word_index.keys()))  

    return text, word_index, reverse_word_index

def trie_data():

    def clean_text(text):
        filters = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'
        translate_map = str.maketrans(filters, " " * len(filters))
        return text.translate(translate_map)

    def vectorize_word(word):
        word_vector = np.zeros(size_of_vocabulary-1).astype('float32')
        word_vector[word_index[word]] = 1.0
        return word_vector

    text, word_index, reverse_word_index = preprocess_corpus()
    clean_text = clean_text(text).split()

    X_data = list()
    Y_data = list()


    # Use generator (useful for large texts)
    def enumerate_data():
        for index, word in enumerate(clean_text):
            if index+2 < len(clean_text):
                if word_index[clean_text[index+2]] < size_of_vocabulary -1:
                    yield np.asarray([word_index[clean_text[index]], word_index[clean_text[index+1]]]), vectorize_word(clean_text[index+2])

    data = enumerate_data()
    for i in data:
        X_data.append(i[0])
        Y_data.append(i[1])

    return np.asarray(X_data), np.asarray(Y_data), word_index

Tags： of the text clean data size index return

0条回答

目前没有回答

如何使我的文本解析功能更高效/友好？

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使我的文本解析功能更高效/友好？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >