在Python中创建直方图
我在Python里写了一个函数,用来计算数据的直方图。这个函数有一个参数bins,它用来指定分成多少个部分。
我把我的代码放在下面,数据可以在这个链接找到:https://gist.github.com/mesarvagya/11367012
import numpy as np
def histogram_using_numpy(filename, bins=10):
datas = np.loadtxt(filename, delimiter=" ", usecols=(0,))
hist,bin_edges = np.histogram(datas, bins)
return hist
print "from numpy %s" % histogram_using_numpy("ex.txt", bins=10)
def histogram_using_list(filename, bins=10, take_col=0):
f = open(filename,"r")
data = []
for item in f.readlines():
data.append(float(item.split()[take_col]))
f.close()
mi,ma = min(data), max(data)
bin_length = (ma-mi)/bins
def get_count(lis,low,diff):
count = 0
for item in lis:
if item >= low and item < low + diff:
count += 1
return count
tot = []
for i in np.arange(mi, ma, bin_length):
tot.append(get_count(data,i, bin_length))
return tot
print "From my function %s " % histogram_using_list("ex.txt", bins=10)
现在,当bins = 10时,两个函数的结果是:
from numpy [10 19 20 28 15 16 14 11 5 12]
From my function [10, 19, 20, 28, 16, 15, 14, 11, 5, 12]
这个结果是不对的。而当bins = 15时,我得到的结果是:
from numpy [ 7 4 18 19 5 24 8 10 13 6 13 6 5 1 11]
From my function [7, 4, 18, 19, 10, 19, 8, 10, 13, 10, 9, 6, 5, 1, 11]
这个结果也不对。假设Numpy是正确的,那我的代码里有没有什么问题呢?
1 个回答
3
看起来你代码里缺少的部分是,numpy的直方图最后一个区间是闭合的,也就是说它包括了两个端点,而前面的区间是半开区间。你所有的区间都是半开的。 (来源,见“说明”)
如果一个区间是由它的边界定义的,binmin 和 binmax,那么一个值 x 会被分配到这个区间,如果满足以下条件:
对于前面的 n-1 个区间: binmin <= x < binmax
对于最后一个区间: binmin <= x <= binmax
同样,np.arange()
也期望使用半开区间,所以在接下来的代码中我用了np.linspace()
。
考虑以下内容:
import numpy as np
def histogram_using_numpy(filename, bins=10):
datas = np.loadtxt(filename, delimiter=" ", usecols=(0,))
hist, bin_edges = np.histogram(datas, bins)
return hist, bin_edges
def histogram_using_list(filename, bins=10, take_col=0):
f = open(filename,"r")
data = []
for item in f.readlines():
data.append(float(item.split()[take_col]))
f.close()
mi,ma = min(data), max(data)
def get_count(lis,binmin,binmax,inclusive_endpoint=False):
count = 0
for item in lis:
if item >= binmin and item < binmax:
count += 1
elif inclusive_endpoint and item == binmax:
count += 1
return count
bin_edges = np.linspace(mi, ma, bins+1)
tot = []
binlims = zip(bin_edges[0:-1], bin_edges[1:])
for i,(binmin,binmax) in enumerate(binlims):
inclusive = (i == (len(binlims) - 1))
tot.append(get_count(data, binmin, binmax, inclusive))
return tot, bin_edges
nump_hist, nump_bin_edges = histogram_using_numpy("ex.txt", bins=15)
func_hist, func_bin_edges = histogram_using_list("ex.txt", bins=15)
print "Histogram:"
print " From numpy: %s" % list(nump_hist)
print " From my function %s" % list(func_hist)
print ""
print "Bin Edges:"
print " From numpy: %s" % nump_bin_edges
print " From my function %s" % func_bin_edges
对于 bins=10,输出结果是:
Histogram:
From numpy: [10, 19, 20, 28, 15, 16, 14, 11, 5, 12]
From my function [10, 19, 20, 28, 15, 16, 14, 11, 5, 12]
Bin Edges:
From numpy: [ 4.3 4.66 5.02 5.38 5.74 6.1 6.46 6.82 7.18 7.54 7.9 ]
From my function [ 4.3 4.66 5.02 5.38 5.74 6.1 6.46 6.82 7.18 7.54 7.9 ]
对于 bins=15,输出结果是:
Histogram:
From numpy: [7, 4, 18, 19, 5, 24, 8, 10, 13, 6, 13, 6, 5, 1, 11]
From my function [7, 4, 18, 19, 5, 24, 8, 10, 13, 6, 13, 6, 5, 1, 11]
Bin Edges:
From numpy: [ 4.3 4.54 4.78 5.02 5.26 5.5 5.74 5.98 6.22 6.46 6.7 6.94 7.18 7.42 7.66 7.9 ]
From my function [ 4.3 4.54 4.78 5.02 5.26 5.5 5.74 5.98 6.22 6.46 6.7 6.94 7.18 7.42 7.66 7.9 ]