我正在使用sklearn train_test_split函数分割我的培训和测试数据。在分割数据并运行分类器之后,我需要能够将特征值和标签值追溯到原始数据记录。我该怎么做?有没有一种方法可以包含某种被分类器忽略的隐藏id特性?在
import json
import numpy as np
from sklearn.cross_validation import train_test_split
json_data = r"""
[
{ "id": 101, "label": 1, "f1": 1, "f2":2, "f3": 3 },
{ "id": 653, "label": 0, "f1": 2, "f2":7, "f3": 8 },
{ "id": 219, "label": 0, "f1": 4, "f2":9, "f3": 2 },
{ "id": 726, "label": 1, "f1": 6, "f2":1, "f3": 0 },
{ "id": 403, "label": 0, "f1": 1, "f2":5, "f3": 4 }
]"""
data = json.loads(json_data)
feature_names = ["f1", "f2", "f3"]
labels = []
features = []
for item in data:
temp_list = []
labels.append(item["label"])
for feature_name in feature_names:
temp_list.append(item[feature_name])
features.append(temp_list)
labels_train, labels_test, features_train, features_test = train_test_split(labels, features, test_size = .20, random_state = 99)
print labels_test
print features_test
## this will give us labels_test = [0], features_test = [[4,9,2]] which corresponds to record with id = 219
## how can I efficiently correlate the split data back to the original records without comparing feature values?
通常,我将输入数据存储在Pandas数据框中,并使用索引进行列车测试拆分;对于您的示例,您可以使用以下内容:
相关问题 更多 >
编程相关推荐