在下面的代码中,我使用tweeter数据集执行情绪分析。我使用执行以下过程的管道:
1)执行一些基本的文本预处理
2)将tweet文本矢量化
3)添加额外功能(文本长度)
4)分类
我想增加一个功能,这是规模化的追随者数量。我编写了一个函数,它将整个数据帧(df)作为输入,并返回一个新的数据帧,该数据帧具有一定数量的跟随者。但是,我发现在管道中添加这个过程是很有挑战性的,例如使用sklearn管道将这个特性添加到其他特性中。在
任何关于这个问题的帮助或建议都将不胜感激。在
下面的问题和代码源于Ryan的帖子:pipelines
import nltk
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
def import_data(filename,sep,eng,header = None,skiprows=1):
#read csv
dataset = pd.read_csv(filename,sep=sep,engine=eng,header = header,skiprows=skiprows)
#rename columns
dataset.columns = ['text','followers','sentiment']
return dataset
df = import_data('apple_v3.txt','\t','python')
X, y = df.text, df.sentiment
X_train, X_test, y_train, y_test = train_test_split(X, y)
tokenizer = nltk.casual.TweetTokenizer(preserve_case=False, reduce_len=True)
count_vect = CountVectorizer(tokenizer=tokenizer.tokenize)
classifier = LogisticRegression()
def get_scalled_followers(df):
scaler = MinMaxScaler()
df[['followers']] = df[['followers']].astype(float)
df[['followers']] = scaler.fit_transform(df[['followers']])
followers = df['followers'].values
followers_reshaped = followers.reshape((len(followers),1))
return df
def get_tweet_length(text):
return len(text)
import numpy as np
def genericize_mentions(text):
return re.sub(r'@[\w_-]+', 'thisisanatmention', text)
def reshape_a_feature_column(series):
return np.reshape(np.asarray(series), (len(series), 1))
def pipelinize_feature(function, active=True):
def list_comprehend_a_function(list_or_series, active=True):
if active:
processed = [function(i) for i in list_or_series]
processed = reshape_a_feature_column(processed)
return processed
else:
return reshape_a_feature_column(np.zeros(len(list_or_series)))
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn_helpers import pipelinize, genericize_mentions, train_test_and_evaluate
sentiment_pipeline = Pipeline([
('genericize_mentions', pipelinize(genericize_mentions, active=True)),
('features', FeatureUnion([
('vectorizer', count_vect),
('post_length', pipelinize_feature(get_tweet_length, active=True))
])),
('classifier', classifier)
])
sentiment_pipeline, confusion_matrix = train_test_and_evaluate(sentiment_pipeline, X_train, y_train, X_test, y_test)
您可以使用
FeatureUnion
组合从数据帧的不同列提取的特性。您应该将数据帧提供给管道,并使用FunctionTransformer
来提取特定的列。它可能看起来像这样(我还没有运行它,可能有一些错误)另一个解决方案不能使用
Pipeline
,而只是将这些特性与np.hstack
一起堆叠。在到目前为止,我找到的最好的解释是在下面的帖子中:pipelines
我的数据包括异构特性,下面的逐步方法很好用,而且很容易理解:
相关问题 更多 >
编程相关推荐