从pandas DataFrame创建术语密度矩阵的内存使用情况
我有一个数据框(DataFrame),我从一个csv文件中保存和读取它,现在我想从中创建一个术语密度矩阵(Term Density Matrix)数据框。根据herrfz的建议,我使用了sklearn中的CountVectorizer。我把这段代码封装成了一个函数。
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
from scipy.sparse import coo_matrix, csc_matrix, hstack
def df2tdm(df,titleColumn,placementColumn):
'''
Takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
of the words appearing in the titleColumn
Inputs: df, a DataFrame containing titleColumn, placementColumn among other columns
Outputs: tdm_df, a DataFrame containing placementColumn and columns with all the words appearrig in df.titleColumn
Credits:
https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe
'''
tdm_df = pd.DataFrame(countvec.fit_transform(df[titleColumn]).toarray(), columns=countvec.get_feature_names())
tdm_df = tdm_df.join(pd.DataFrame(df[placementColumn]))
return tdm_df
这个函数会返回一个术语密度矩阵的数据框,比如说:
df = pd.DataFrame({'title':['Delicious boiled egg','Fried egg ', 'Potato salad', 'Split orange','Something else'], 'page':[1, 1, 2, 3, 4]})
print df.head()
tdm_df = df2tdm(df,'title','page')
tdm_df.head()
boiled delicious egg else fried orange potato salad something \
0 1 1 1 0 0 0 0 0 0
1 0 0 1 0 1 0 0 0 0
2 0 0 0 0 0 0 1 1 0
3 0 0 0 0 0 1 0 0 0
4 0 0 0 1 0 0 0 0 1
split page
0 0 1
1 0 1
2 0 2
3 1 3
4 0 4
不过,这个实现有个问题,就是内存使用不太好:当我用一个大小为190 kB的utf8格式的数据框时,这个函数会用大约200 MB的内存来创建术语密度矩阵。而当csv文件大小是600 kB时,内存使用量会达到700 MB;如果csv文件是3.8 MB,函数就会把我所有的内存和交换文件(8 GB)都用完,导致崩溃。
我还尝试过使用稀疏矩阵和稀疏数据框的实现(见下文),但内存使用情况差不多,只是速度明显慢了很多。
def df2tdm_sparse(df,titleColumn,placementColumn):
'''
Takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
of the words appearing in the titleColumn. This implementation uses sparse DataFrames.
Inputs: df, a DataFrame containing titleColumn, placementColumn among other columns
Outputs: tdm_df, a DataFrame containing placementColumn and columns with all the words appearrig in df.titleColumn
Credits:
https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe
https://stackoverflow.com/questions/17818783/populate-a-pandas-sparsedataframe-from-a-scipy-sparse-matrix
https://stackoverflow.com/questions/6844998/is-there-an-efficient-way-of-concatenating-scipy-sparse-matrices
'''
pm = df[[placementColumn]].values
tm = countvec.fit_transform(df[titleColumn])#.toarray()
m = csc_matrix(hstack([pm,tm]))
dfout = pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel()) for i in np.arange(m.shape[0]) ])
dfout.columns = [placementColumn]+countvec.get_feature_names()
return dfout
有没有什么建议可以改善内存使用呢?我在想这是否和scikit的内存问题有关,比如说在这里提到的。
1 个回答
0
我也觉得问题可能出在把稀疏矩阵转换成稀疏数据框的过程中。
你可以试试这个函数(或者类似的东西)
def SparseMatrixToSparseDF(xSparseMatrix):
import numpy as np
import pandas as pd
def ElementsToNA(x):
x[x==0] = NaN
return x
xdf1 =
pd.SparseDataFrame([pd.SparseSeries(ElementsToNA(xSparseMatrix[i].toarray().ravel()))
for i in np.arange(xSparseMatrix.shape[0]) ])
return xdf1
你会发现它通过使用density这个函数来减小数据的大小。
df1.density
希望这能帮到你。