如何在scikit数据中组合数字和文本数据

2条回答

网友

1楼 · 编辑于 2024-05-16 09:14:33

您可以使用下面链接中的示例中的“ItemSelector”，为数据中的每一列使用不同的处理方法，然后使用sklearn的FeatureUnion将所有内容组合起来。

我想你的意思是你想把价格当作分类数据来处理？我会说，将数字转换为字符串，然后使用CountVectorizer，并将二进制标志设置为True。

这是一个非常有用的例子： http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py

编辑：另外，count向量器会去掉句点，所以最好不用像这样的标记器来调用它。可能有一个更优雅的解决方案在更少的行，但这也行得通。

def no_tokenizer(t):
    return [t]

CountVectorizer(binary=False, tokenizer=no_tokenizer)

编辑：下面是一个使用ItemSelector的管道示例。我有一个pandas数据帧集，我通过传递一个关键字来传递是否要返回文本、数字数据。

^{pr2}$

网友

2楼 · 编辑于 2024-05-16 09:14:33

熊猫

您可以使用one-hot编码。

import pandas as pd

# you can easily load the data using pd.read_csv()
# Or, if the data is in a numpy array, just use pd.Dataframe(data) and pass the appropriate column names to columns parameter as a list of strings

# For this example,
df = pd.DataFrame({'name':['milk', 'butter', 'eggs'],
              'cost':[10, 3.50, 0.99]}) 

print(df)

     name   cost
0    milk  10.00
1  butter   3.50
2    eggs   0.99

df=pd.get_dummies(data=df, columns=['name']) # indicates that we want to encode the name column
print(df)

    cost  name_butter  name_eggs  name_milk
0  10.00            0          0          1
1   3.50            1          0          0
2   0.99            0          1          0

您可以通过df.values获得数据集

^{pr2}$

熊猫

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何在scikit数据中组合数字和文本数据

熊猫

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >