将文件加载到二维numpy数组的有效方法

import numpy as np from collections import defaultdict from sklearn.feature_extraction import DictVectorizer from sklearn.feature_extraction.text import TfidfTransformer import pandas as pd from scipy import sparse import os import assoc #read in data to a dict object - sums scripts by tuple (doc, drug) dictObj = {} rawData = 'subset.txt' with open(rawData) as infile: for line in infile: parts = line.split(',') key = (parts[0],parts[1]) val = float(parts[3]) if key in dictObj: dictObj[key] += val else: dictObj[key] = val infile.close() print "stage 1 done" #get the number of doctors and the number of drugs keys = dictObj.keys() docs = list(set([x[0] for x in keys])) drugs = sorted(list(set([x[1] for x in keys]))) #read through the dict and build out a 2d numpy array docC = 0 mat = np.empty([len(docs),len(drugs)]) for doc in docs: drugC = 0 for drug in drugs: key = (doc,drug) if key in dictObj: mat[(docC,drugC)] = dictObj[(key)] else: mat[(docC,drugC)] = 0 drugC += 1 docC+=1

3条回答

网友

1楼 · 编辑于 2024-04-27 10:29:29

基于问题的结尾，你似乎只需要让一只熊猫DataFrame给一个纽普比array。以下是如何做到这一点：

#df is your DataFrame
data = np.asarray(df)

所以现在你应该不会有使用熊猫的问题了！在

网友

2楼 · 编辑于 2024-04-27 10:29:29

我可能会做些

>>> df = pd.read_csv("trans.csv", skipinitialspace=True)
>>> w = df.groupby(["person", "product"])["val"].sum().reset_index()
>>> w
  person product  val
0      A       x   20
1      A       y   20
2      B       x   20
3      B       y   15
4      C       z   40
>>> w.pivot("person", "product").fillna(0)
         val        
product    x   y   z
person              
A         20  20   0
B         20  15   0
C          0   0  40
>>> w.pivot("person", "product").fillna(0).values
array([[ 20.,  20.,   0.],
       [ 20.,  15.,   0.],
       [  0.,   0.,  40.]])

你要找的是二维阵列。注意，您不必一次将整个文件读入内存，您可以使用chunksize参数（请参见the docs here）并逐块累加表。在

网友

3楼 · 编辑于 2024-04-27 10:29:29

recfromcsv（或recfromtxt）将把数据加载到记录数组中

data=np.recfromcsv('stack20179393.txt')

rec.array([('A', ' x', ' 1/1/2013', 10), ('A', ' x', ' 1/10/2013', 10),
       ('B', ' x', ' 1/2/2013', 20), ('B', ' y', ' 1/4/2013', 15),
       ('A', ' y', ' 1/8/2013', 20), ('C', ' z', ' 2/12/2013', 40)], 
      dtype=[('person', 'S1'), ('product', 'S2'), ('date', 'S10'), ('val', '<i4')])

data.person
# chararray((['A', 'A', 'B', 'B', 'A', 'C'], dtype='|S1')

data.val
# array([10, 10, 20, 15, 20, 40])

由于person可以以任何顺序出现，并且具有不同的频率（3A、2B、1C），所以您不能轻易地将其转换为2D数组。因此，您可能仍然需要迭代记录，在字典之类的东西中收集值—我建议使用collections.defaultdict。itertools.groupby也是将值收集到组中的一个方便工具。但是，这需要对你的记录进行分类。在

用defaultdict

^{pr2}$

或者

^{3}$

稀疏方法利用csr_matrix如何对重复索引求和

from scipy import sparse  
row=np.array([ord(a) for a in data.person])-65
col=np.zeros(row.shape)
sparse.csr_matrix((data.val,(row,col))).T.A
# array([[40, 35, 40]])

相关问题更多 >

编程相关推荐

热门问题

热门文章