如何将excel电子表格中的数据转换为合适的表示形式，以便训练scikitlearn mod

rb = open_workbook('subjectcat.xlsx')#C:/Users/5460/Desktop/ wb = copy(rb) #making a copy sheet = rb.sheet_by_index(0) data = () for row_index in range(1,500): #train using 500 temp,add = (),() subject,cat = 0,0 #trial for col_index in range(1,3): if col_index==1: #print col_index subject = sheet.cell(row_index,col_index).value #print subject #print cellname(row_index,col_index) subject = "'" + subject #temp +=(subject,) #print temp elif col_index==2: #print col_index cat = sheet.cell(row_index,col_index).value #print cat #print cellname(row_index,col_index) cat = "'" + cat + "'" add = add + (subject,cat) #print (add) data = data + (add,) print 'done' training_data = list(data) training_data = training_data[1:][::2] #removing the even items

1条回答

网友

1楼 · 发布于 2024-04-25 01:17:15

将输入数据包装为2dnumpy数组：每个示例/实例/观察结果一行。数组的列应该存储样本的数字描述符（特征）。在

您需要将输出/目标类存储为另一个整数数组。每个目标类都应该分配一个整数（例如0代表“ham”，1代表“spam”）。在

output/target classes数组的条目数应与输入数据中的行数相同（每个示例一个标签）。在

如果您不知道如何将Python列表转换为numpy数组，请阅读numpy的文档。你可以从这里开始：

http://docs.scipy.org/doc/numpy/user/basics.creation.html

为了获得支持向量机良好的预测精度，你还需要确保你的特征是有意义的（例如，不要使用字符串或整数表示来编码分类输入特征，而是使用一个热编码特征扩展），并将数据标准化到中心，并按单位方差缩放。尤其要看看：

http://scikit-learn.org/stable/modules/preprocessing.html

编辑：我没见过你上一句话：如果你的输入数据是原始电子邮件文本，你必须提取特征（从统计上总结电子邮件内容的数字描述符）。在这种情况下，您需要提取文本特征：

http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction

相关问题更多 >

编程相关推荐

热门问题

热门文章