我正在为LSA(潜在语义分析)编写自然语言处理中的预处理函数。所有其他函数,如tfidf、remove\u stopwords,都与我创建的单元测试一起工作。但是,在测试其功能时,LSA函数不断给我以下错误:
“应为2D数组,改为1D数组: 数组=[“我在橄榄园吃了晚饭”,“我们正在买房子”, “我没有在橄榄园吃晚饭”,“我们的邻居正在买房子”]。 使用数组重塑数据。如果数据具有单个特征或数组,则重塑(-1,1)。如果数据包含单个样本,则重塑(1,-1)。”
以下是我的LSA函数代码和测试代码:
import pandas as pd
import nltk
import string
import sklearn
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.feature_extraction.text import TfidfVectorizer
def LSA(data, tfidf = True, remove_stopwords=True):
# done with stop word removal and tf-idf weighting keeping the 100 most common concepts
text = data.iloc[:,-1] #isolate text column
#Define the LSA function
vectors = sklearn.decomposition.TruncatedSVD(n_components = 2, algorithm = 'randomized', n_iter = 100, random_state = 100)
vectors.fit(text.tolist())
svd_matrix = vectors.fit_transform(text.tolist())
svd_matrix = Normalizer(copy=False).fit_transform(text.tolist())
dense = svd_matrix.todense()
denselist = dense.tolist()
data["cleaned_vectorized_document"] = denselist
return data
下面是我正在使用的抛出错误的测试代码:
p = pd.DataFrame({'two':[1,2,3,4],'test':['I ate dinner at Olive Garden', 'we are buying a house',
'I did not eat dinner at Olive Garden', 'our neighbors are buying a house']})
print(LSA(p))
我不确定这是否是您的问题,但您的数组在项之间缺少逗号,这至少会引发以下错误:
请尝试以下方法:
相关问题 更多 >
编程相关推荐