从可处理混合类型数据和缺失值的研究论文中实现最新的距离度量。
distython的Python项目详细描述
距离
从研究论文中实现最新的距离度量,它可以处理混合类型的数据和丢失的值。 目前,heom、hvdm和vdm已经测试并投入使用。vdm和hvdm最近已经发布,如果有错误,请报告。 请随时帮助和贡献的项目,因为缺乏现有的实现距离度量。
安装
使用git clone
克隆存储库。
使用pipenv install
示例-heom
# Example code of how the HEOM metric can be used together with Scikit-Learnimportnumpyasnpfromsklearn.neighborsimportNearestNeighborsfromsklearn.datasetsimportload_boston# Importing a custom metric classfromHEOMimportHEOM# Load the dataset from sklearnboston=load_boston()boston_data=boston["data"]# Categorical variables in the datacategorical_ix=[3,8]# The problem here is that NearestNeighbors can't handle np.nan# So we have to set up the NaN equivalentnan_eqv=12345# Introduce some missingness to the data for the purpose of the examplerow_cnt,col_cnt=boston_data.shapeforiinrange(row_cnt):forjinrange(col_cnt):rand_val=np.random.randint(20,size=1)ifrand_val==10:boston_data[i,j]=nan_eqv# Declare the HEOM with a correct NaN equivalent valueheom_metric=HEOM(boston_data,categorical_ix,nan_equivalents=[nan_eqv])# Declare NearestNeighbor and link the metricneighbor=NearestNeighbors(metric=heom_metric.heom)# Fit the model which uses the custom distance metric neighbor.fit(boston_data)# Return 5-Nearest Neighbors to the 1st instance (row 1)result=neighbor.kneighbors(boston_data[0].reshape(1,-1),n_neighbors=5)print(result)
研究论文
该代码基于以下文献实现: heom、vdm和hvdm:https://arxiv.org/pdf/cs/9701101.pdf