Python - 在Python中使用不同数据类型计算距离
我有一组数据,里面有11个属性。我想计算这些属性之间的距离。比如说,这些属性是(x1, x2, ..., x11)
,其中x1
和x2
是名义型(也就是分类的),x3, x4, ... x10
是有序型(也就是有等级的),而x11
是二元型(只有两种状态)。我该如何用Python读取这些属性呢?还有,怎么在Python中区分这些属性,以便我可以计算它们之间的距离?有没有人能告诉我该怎么做?谢谢!
示例数据:x1(林业,种植,其他,林业) x2(种植,种植,灌木,森林) x3(高,高,中,低) x4(低,中,高,高) x5(高,低,中,高) x6(中,低,高,中) x7(3,1,0,4) x8(低,低,高,中) x9(297,298,299,297) x10(1,2,0,4) x11(真,真,真,假)
2 个回答
0
你可以这样做:
def distance(x,y):
p = len(x)
m = sum(map(lambda (a,b): 1 if a == b else 0, zip(x,y)))
return float(p-m)/p
举个例子:
x1 = ("forestry", "plantation", "high", "low", "high", "medium", 3, "low", 297, 1, True)
x2 = ("plantation", "plantation", "high", "medium", "low", "low", 1, "low", 298, 2, True)
print distance(x1,x2) # result: 0.636363636364 = (11-4)/7
0
我把这个重写成了下面这样:
首先,我创建了一个名为“Nominal”的类型工厂:
class BaseNominalType:
name_values = {} # <= subclass must override this
def __init__(self, name):
self.name = name
self.value = self.name_values[name]
def __str__(self):
return self.name
def __sub__(self, other):
assert type(self) == type(other), "Incompatible types, subtraction is undefined"
return self.value - other.value
# class factory function
def make_nominal_type(name_values):
try:
nv = dict(name_values)
except ValueError:
nv = {item:i for i,item in enumerate(name_values)}
# make custom type
class MyNominalType(BaseNominalType):
name_values = nv
return MyNominalType
现在我可以定义你的名义类型,
Forest = make_nominal_type(["shrubs", "plantation", "forestry", "other"])
Level = make_nominal_type(["low", "medium", "high"])
Bool = make_nominal_type({"f":False, "t":True})
接着,我创建了一个“MixedVector”的类型工厂:
# base class
class BaseMixedVectorType:
types = [] # <= subclass must
distance_fn = None # <= override these
def __init__(self, values):
self.values = [type_(value) for type_,value in zip(self.types, values)]
def dist(self, other):
return self.distance_fn([abs(s - o) for s,o in zip(self.values, other.values)])
# class factory function
def make_mixed_vector_type(types, distance_fn):
tl = list(types)
df = distance_fn
class MyVectorType(BaseMixedVectorType):
types = tl
distance_fn = df
return MyVectorType
然后创建你的数据类型,
# your mixed-vector type
DataItem = make_mixed_vector_type(
[Forest, Forest, Level, Level, Level, Level, int, Level, int, int, Bool],
??? # have to define an appropriate distance function!
)
...等等,我们还没有定义距离函数呢!我写了这个类,让你可以插入任何你喜欢的距离函数,格式如下:
def manhattan_dist(_, vector):
return sum(vector)
def euclidean_dist(_, vector):
return sum(v*v for v in vector) ** 0.5
# the distance function per your description:
def fractional_match_distance(_, vector):
return float(sum(not v for v in vector)) / len(vector)
所以我们完成了创建
# your mixed-vector type
DataItem = make_mixed_vector_type(
[Forest, Forest, Level, Level, Level, Level, int, Level, int, int, Bool],
fractional_match_distance
)
并且测试它如下
def main():
raw_data = [
('forestry', 'plantation', 'high', 'low', 'high', 'medium', 3, 'low', 297, 1, 't'),
('plantation', 'plantation', 'high', 'medium', 'low', 'low', 1, 'low', 298, 2, 't'),
('other', 'shrubs', 'medium', 'high', 'medium', 'high', 0, 'high', 299, 0, 't'),
('forestry', 'forestry', 'low', 'high', 'high', 'medium', 4, 'medium', 297, 4, 'f')
]
a, b, c, d = [DataItem(d) for d in raw_data]
print("a to b, dist = {}".format(a.dist(b)))
print("b to c, dist = {}".format(b.dist(c)))
print("c to d, dist = {}".format(c.dist(d)))
if __name__=="__main__":
main()
这给了我们
a to b, dist = 0.363636363636
b to c, dist = 0.0909090909091
c to d, dist = 0.0909090909091