ValueError:无法将NumPy数组转换为数组大小超过4000的张量(不支持的对象类型NumPy.ndarray)

2024-06-09 15:53:57 发布

您现在位置:Python中文网/ 问答频道 /正文

当“input_data”长度超过4000时,我的代码似乎会产生此错误。但我想在180000长的阵列上训练它。 我刚刚完成了一个文本生成类,并试图让我的模型生成一些Eminem歌词,实际上,仅使用Eminem所有单词的5%(180k中的4k)并不太糟糕

'''

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import string
import numpy as np
import pandas as pd


# Eminem lyrics https://www.kaggle.com/thaddeussegura/eminem-lyrics-from-all-albums

from urllib.request import urlopen

data = urlopen('https://storage.googleapis.com/kagglesdsdata/datasets/835677/1426970/eminem_lyrics/ALL_eminem.txt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20200924%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20200924T201536Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=9e8afd7dba5915b209e33905c68e93f2bfb1d3baac9456e1a0d16d1b74a0b482baa26bb6f348c2f901b46b63555b1a2bcc900c9db7d17321c27fe4578cc5d12463ca6b3e7c8998cf66a05a33b4b324dba3e48341d010f13a423debb8d1c2f52536870a9cc3ddfa72a4ca9bda874e934bcfdd21512e413e068bbd8c0a2a4042df66358d978080d164ead2f9e0edf1eee4bf66cf2f5c0aa63a5b7e9cea80ca6c211a0558aca9e7671235f105074f5f3f74abb882001acec29573c84b8ed9bf044b7233fb270a12fefe01bd40fe64b44cc0b89d54469357719d14404bb3c6033961c25af43c5c5f9c20fc090cf38fe03946058ecb9b67ebdfe4022c564480a2c73c').read().decode('utf-8')


# split
text = data.split()

# remove puctuation, make all lowercase
dataset = []
import re
for s in text:
    s = re.sub(r'[^\w\s]','',s).lower()
    dataset.append(s)


def tokenize_corpus(corpus, num_words=-1):
  # Fit a Tokenizer on the corpus
  if num_words > -1:
    tokenizer = Tokenizer(num_words=num_words)
  else:
    tokenizer = Tokenizer()
  tokenizer.fit_on_texts(corpus)
  return tokenizer

# Tokenize the corpus
tokenizer = tokenize_corpus(dataset)

total_words = len(tokenizer.word_index) + 1
print(total_words)


# get inputs and outputs
input_data = []
labels = []
for i in range(180000):
    tokens = np.array(sum(tokenizer.texts_to_sequences(dataset[i:i+11]), []))
    input_data.append(tokens[:-1])
    labels.append(tokens[-1])

input_data = np.array(input_data)
labels = np.array(labels)

#print(input_data)
#print(labels)

# One-hot encode the labels
one_hot_labels = tf.keras.utils.to_categorical(labels, num_classes=total_words)

我还尝试将“输入数据”转换为张量,使其成为不同的数据类型,等等,这只会产生各种不同的错误。但是,如果将180000更改为4000以下,一切正常

如果模型不能一次处理所有180000个序列,我可以将其分解为45个阵列(每个阵列4000个)并在每个阵列上训练5-10个时代吗

型号:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional

model = Sequential()
model.add(Embedding(total_words, 64, input_length=10))
model.add(Bidirectional(LSTM(32)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(input_data, one_hot_labels, epochs=100, verbose=1)

最后一行给出了错误,也许我应该更改模型本身的某些内容

其余部分在这里,但“种子文本”只是从课堂实验室复制的:

seed_text = "im feeling chills getting these bills still while having meal"
next_words = 100
  
for _ in range(next_words):
  token_list = tokenizer.texts_to_sequences([seed_text])[0]
  token_list = pad_sequences([token_list], maxlen=10, padding='pre')
  predicted_probs = model.predict(token_list)[0]
  predicted = np.random.choice([x for x in range(len(predicted_probs))],
                               p=predicted_probs)
  output_word = ""
  for word, index in tokenizer.word_index.items():
    if index == predicted:
      output_word = word
      break
  seed_text += " " + output_word
print(seed_text)

请帮助解决此错误,并让我知道如果你有任何想法,我可以改善模型的整体


Tags: textfromimportinputdatalabelsmodeltensorflow
1条回答
网友
1楼 · 发布于 2024-06-09 15:53:57

我发现在大约4000个单词之后,出于某种原因,标记器开始产生不同长度的张量(不是指定的10),因此它需要多行代码来填充:

padded = pad_sequences(input_data, maxlen=10, padding="pre")

相关问题 更多 >