当整列与填充值匹配时,数据帧子集设置返回NaN

2021-10-17 13:52:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我在由稀疏序列组成的数据帧中遇到了一些奇怪的情况。我可以用带有填充和类型的值字典生成DF,没问题,但是当我尝试对DF进行子集划分时,我得到了一些非常奇怪的结果。我能够重复地展示的是,当对稀疏序列创建的DF进行子集设置时,如果一列恰好在整个过程中是相同的(即所有实体都匹配填充值),那么DF子集会将这些列转换为NaN,而dtype会转换为float64。下面是一段测试代码:

import random
import numpy as np
import multiprocessing as mp
import pandas as pd


TEST_LINES = 10


samezeroint = [0 for i in range(TEST_LINES)]
sameoneint = [1 for i in range(TEST_LINES)]
samezerofloat = [0.0 for i in range(TEST_LINES)]
sameonefloat = [1.0 for i in range(TEST_LINES)]
indexone = [i for i in range(TEST_LINES)]

randomint = []
randomfloat = []

for i in range(TEST_LINES):
    randomint.append(random.randint(0,100))
    randomfloat.append(random.random())

testdict = {'indexone': indexone, "samezeroint": samezeroint, 'sameoneint': sameoneint, 'samezerofloat': samezerofloat, 'sameonefloat': sameonefloat, 'randomint': randomint, 'randomfloat': randomfloat}
filldict = {'indexone': 0, "samezeroint": 0, 'sameoneint': 1, 'samezerofloat': 0.0,
            'sameonefloat': 1.0, 'randomint': random.randint(0,100), 'randomfloat': random.random()}
dtypedict = {'indexone': np.int8, "samezeroint": np.int8, 'sameoneint': np.int8, 'samezerofloat': np.float,
            'sameonefloat': np.float, 'randomint': np.int8, 'randomfloat': np.float}


dospar = {}
for l in testdict:
    try:
        fill = filldict[l]
    except KeyError:
        fill = None
    try:
        datatype = dtypedict[l]
    except KeyError:
        datatype = np.str
    if fill is None:
        sparr = pd.Series(pd.array(testdict[l], dtype=datatype))
    else:
        sparr = pd.Series(pd.SparseArray(testdict[l], fill_value=fill, dtype=datatype))
    dospar[l] = sparr
testdf = pd.DataFrame.from_dict(dospar, orient='columns')

# Test a single series

print("\n\nSeries: All zeroes")
samezerointseries = pd.Series(pd.SparseArray(testdict['samezeroint'], fill_value=0, dtype=np.int8))
print("\nOriginal")
print(samezerointseries)
samezero = samezerointseries.isin([0])
samezerozero = samezerointseries[samezero]
print("\nFiltered: should be identical to above")
print(samezerozero)
sameone = samezerointseries.isin([1])
samezeroone = samezerointseries[sameone]
print("\nFiltered: should be empty")
print(samezeroone)


print("\n\nDataframe:")
with pd.option_context('display.max_rows', None, 'display.max_columns',
                       None):  # more options can be specified also
    print(testdf)
    print(testdf.dtypes)

print("\n\nDataframe: should be identical to above")
intone = testdf.loc[:, 'sameoneint'].isin([int(1)])
print(intone)
onedf = testdf[intone]
with pd.option_context('display.max_rows', None, 'display.max_columns',
                       None):  # more options can be specified also
    print(onedf)
    print(onedf.dtypes)

当我运行此测试时,得到以下结果:



Series: All zeroes

Original
0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
dtype: Sparse[int8, 0]

Filtered: should be identical to above
0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
dtype: Sparse[int8, 0]

Filtered: should be empty
Series([], dtype: Sparse[int8, 0])


Dataframe:
   indexone  samezeroint  sameoneint  samezerofloat  sameonefloat  randomint  \
0         0            0           1            0.0           1.0         60   
1         1            0           1            0.0           1.0         68   
2         2            0           1            0.0           1.0         65   
3         3            0           1            0.0           1.0        100   
4         4            0           1            0.0           1.0         53   
5         5            0           1            0.0           1.0         26   
6         6            0           1            0.0           1.0         16   
7         7            0           1            0.0           1.0         97   
8         8            0           1            0.0           1.0         50   
9         9            0           1            0.0           1.0         71   

   randomfloat  
0     0.417370  
1     0.970567  
2     0.836402  
3     0.029296  
4     0.179799  
5     0.928002  
6     0.354385  
7     0.646790  
8     0.191453  
9     0.088505  
indexone                             Sparse[int8, 0]
samezeroint                          Sparse[int8, 0]
sameoneint                           Sparse[int8, 1]
samezerofloat                   Sparse[float64, 0.0]
sameonefloat                    Sparse[float64, 1.0]
randomint                           Sparse[int8, 49]
randomfloat      Sparse[float64, 0.8838354729582943]
dtype: object


Dataframe: should be identical to above
0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
9    True
Name: sameoneint, dtype: bool
   indexone  samezeroint  sameoneint  samezerofloat  sameonefloat  randomint  \
0         0          NaN         NaN            NaN           NaN         60   
1         1          NaN         NaN            NaN           NaN         68   
2         2          NaN         NaN            NaN           NaN         65   
3         3          NaN         NaN            NaN           NaN        100   
4         4          NaN         NaN            NaN           NaN         53   
5         5          NaN         NaN            NaN           NaN         26   
6         6          NaN         NaN            NaN           NaN         16   
7         7          NaN         NaN            NaN           NaN         97   
8         8          NaN         NaN            NaN           NaN         50   
9         9          NaN         NaN            NaN           NaN         71   

   randomfloat  
0     0.417370  
1     0.970567  
2     0.836402  
3     0.029296  
4     0.179799  
5     0.928002  
6     0.354385  
7     0.646790  
8     0.191453  
9     0.088505  
indexone                            Sparse[int64, 0]
samezeroint                       Sparse[float64, 0]
sameoneint                        Sparse[float64, 1]
samezerofloat                   Sparse[float64, 0.0]
sameonefloat                    Sparse[float64, 1.0]
randomint                           Sparse[int8, 49]
randomfloat      Sparse[float64, 0.8838354729582943]

我正在使用导入的所有模块的最新版本。正如您所希望看到的,由于我正在对“列中有0的所有行都是零”进行子集设置,所以我应该创建一个相同的DF,而不是只创建列中有一些变化的序列。你知道吗

对于subsetting命令,我尝试了我能找到的所有变体:

newdf = testdf.loc[testdf['sameoneint'] == 1]

newdf =testdf.query('sameoneint == 1')

isone = testdf.loc[:, 'sameoneint'].isin([1])

newdf = testdf[isone]

所有这些都没有更好的效果,有些人会发出警告,不要打电话给你。你知道吗

那么,我是在编码的过程中漏掉了什么,还是在熊猫的工作方式中漏掉了一些我还没有弄清楚的东西?欢迎咨询!你知道吗