sklearn如何处理missing_values='?
假设我有一个.txt文件,里面的内容是
2,3,4,?,5
我想把缺失的值'?'替换成其他数据的平均值,有什么好主意吗?如果是字符串列表,我想把'?'替换成出现频率最高的字符串,比如说'a'
'a','b','c','?','a','a'
我试过一些方法,但都不行。我最开始使用了
import numpy as np
from sklearn.preprocessing import Imputer
row = np.genfromtxt('a.txt',missing_values='?',dtype=float,delimiter=',',usemask=True)
# this will give: row = [2 3 4 -- 5]. I checked it will use filling_values=-1 to replace missing data
# but if I add 'filling_values=np.nan' in it, it will cause error,'cannot convert float into int'
imp = Imputer(missing_values=-1, strategy='mean')
imp.fit_transform(row)
# this will give: array([2., 3., 4.,5.], which did not replace missing_value by mean value.
如果我能把'?'替换成np.nan,我觉得我可以做到。
1 个回答
1
我无法重现你说的错误,'无法将浮点数转换为整数'。
试试这个:
>>> row = np.genfromtxt('a.txt',missing_values='?',dtype=float,delimiter=',')
>>> np.mean(row[~np.isnan(row)])
3.5
>>> mean = np.mean(row[~np.isnan(row)])
>>> row[np.isnan(row)] = mean
>>> row
array([ 2. , 3. , 4. , 3.5, 5. ])
补充说明
如果你想使用字符串,这里有一个用普通列表的解决方案。
>>> row = ['a','b','c','?','c','b','?','?','b']
>>> from collections import Counter
>>> letter_counts = Counter(letter for letter in row if letter != '?')
>>> letter_counts.most_common(1)
[('b', 3)]
>>> most_common_letter = letter_counts.most_common(1)[0][0]
>>> [letter if letter != '?' else most_common_letter
... for letter in row]
['a', 'b', 'c', 'b', 'c', 'b', 'b', 'b', 'b']