应用groupby后,数据帧不相等,但我不知道为什么

2024-06-13 00:11:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我为下面的代码墙和糟糕的格式表示歉意。当我应用DataFrame.equals()或更高版本的df1==df2时,我尝试了尽可能多的方法来找出导致这些数据帧返回False的原因。我找不出他们之间有什么不同之处

我通过将groupby应用到除ORDER\u QTY之外的所有列的第一个数据帧(bdf),得到了第二个数据帧(dftest)。由于这两个数据帧的行数是相同的,所以我假设没有任何变化(这一点我并不惊讶),但是,为了确保我使用bdf.equals(dftest)对它们进行了比较,结果返回false。这是在我确定列的顺序正确之后。我注意到的另一件事是数据帧的大小不同。否则我就迷路了

In:
dftest = bdf.groupby(['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER', 'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION', 'ITEM', 'FW_END_DT', 'BPS_INCLUDE']).sum().reset_index()
dftest = dftest[['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER', 'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION', 'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE']]

print(bdf.equals(dftest))
print(bdf.columns)
print(dftest.columns)


Out:
False
Index(['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER',
       'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION',
       'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE'],
      dtype='object')
Index(['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER',
       'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION',
       'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE'],
      dtype='object')

^列似乎是相同的,但是bdf.equals(dftest)会产生False

In:
bdf.info()
dftest.info()


Out:
<class 'pandas.core.frame.DataFrame'>
Index: 53025 entries, 0 to 53024
Data columns (total 14 columns):
SITE              53025 non-null object
CUST              53025 non-null object
ORDER_NUMBER      53025 non-null object
ORDER_DATE        53025 non-null datetime64[ns]
PURCHASE_ORDER    53025 non-null object
CHANNEL           53025 non-null object
SHIP_TO           53025 non-null object
PROD_LINE         53025 non-null object
GROUP_NUMBER      53025 non-null object
DESCRIPTION       53025 non-null object
ITEM              53025 non-null object
ORDER_QTY         53025 non-null int64
FW_END_DT     53025 non-null datetime64[ns]
BPS_INCLUDE       53025 non-null int64
dtypes: datetime64[ns](2), int64(2), object(10)
memory usage: 6.1+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53025 entries, 0 to 53024
Data columns (total 14 columns):
SITE              53025 non-null object
CUST              53025 non-null object
ORDER_NUMBER      53025 non-null object
ORDER_DATE        53025 non-null datetime64[ns]
PURCHASE_ORDER    53025 non-null object
CHANNEL           53025 non-null object
SHIP_TO           53025 non-null object
PROD_LINE         53025 non-null object
GROUP_NUMBER      53025 non-null object
DESCRIPTION       53025 non-null object
ITEM              53025 non-null object
ORDER_QTY         53025 non-null int64
FW_END_DT     53025 non-null datetime64[ns]
BPS_INCLUDE       53025 non-null int64
dtypes: datetime64[ns](2), int64(2), object(10)
memory usage: 5.7+ MB

^所有的东西看起来都一样,除了尺寸,就像我提到的

In:
common = bdf.merge(dftest,on=['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER', 'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION', 'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE'], how='outer', indicator=True)
print(common[common['_merge'] != 'both'])



Out:
Empty DataFrame
Columns: [SITE, CUST, ORDER_NUMBER, ORDER_DATE, PURCHASE_ORDER, CHANNEL, SHIP_TO, PROD_LINE, GROUP_NUMBER, DESCRIPTION, ITEM, ORDER_QTY, FW_END_DT, BPS_INCLUDE, _merge]
Index: []

尝试合并和选择不在两个df中的行

In:
bdf[(~bdf.SITE.isin(common.SITE))&(~bdf.CUST.isin(common.CUST))&(~bdf.ORDER_NUMBER.isin(common.ORDER_NUMBER))&(~bdf.ORDER_DATE.isin(common.ORDER_DATE))&(~bdf.PURCHASE_ORDER.isin(common.PURCHASE_ORDER))&(~bdf.CHANNEL.isin(common.CHANNEL))&(~bdf.SHIP_TO.isin(common.SHIP_TO))&(~bdf.PROD_LINE.isin(common.PROD_LINE))&(~bdf.GROUP_NUMBER.isin(common.GROUP_NUMBER))&(~bdf.DESCRIPTION.isin(common.DESCRIPTION))&(~bdf.ITEM.isin(common.ITEM))&(~bdf.ORDER_QTY.isin(common.ORDER_QTY))&(~bdf.FW_END_DT.isin(common.FW_END_DT))&(~bdf.BPS_INCLUDE.isin(common.BPS_INCLUDE))]



Out:
SITE    CUST    ORDER_NUMBER    ORDER_DATE  PURCHASE_ORDER  CHANNEL SHIP_TO PROD_LINE   GROUP_NUMBER    DESCRIPTION ITEM    ORDER_QTY   FW_END_DT   BPS_INCLUDE

什么也没做

In:
(bdf == dftest).all().all()




Out:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-34-6c2f52f55e60> in <module>()
----> 1 (bdf == dftest).all().all()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops.py in f(self, other)
   1611             # Another DataFrame
   1612             if not self._indexed_same(other):
-> 1613                 raise ValueError('Can only compare identically-labeled '
   1614                                  'DataFrame objects')
   1615             return self._compare_frame(other, func, str_rep)

ValueError: Can only compare identically-labeled DataFrame objects

它们的标签不一样

当我尝试搜索以下内容时,有人建议我尝试:

In:
bdf.eq(dftest)


Out:
    SITE    CUST    ORDER_NUMBER    ORDER_DATE  PURCHASE_ORDER  CHANNEL SHIP_TO PROD_LINE   GROUP_NUMBER    DESCRIPTION ITEM    ORDER_QTY   FW_END_DT   BPS_INCLUDE
0   False   False   False   False   False   False   False   False   False   False   False   False   False   False
1   False   False   False   False   False   False   False   False   False   False   False   False   False   False
2   False   False   False   False   False   False   False   False   False   False   False   False   False   False
3   False   False   False   False   False   False   False   False   False   False   False   False   False   False
4   False   False   False   False   False   False   False   False   False   False   False   False   False   False
5   False   False   False   False   False   False   False   False   False   False   False   False   False   False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
52995   False   False   False   False   False   False   False   False   False   False   False   False   False   False
106050 rows × 14 columns

在这种情况下,似乎每对细胞都不匹配……:(

我是不是漏掉了什么明显的东西


Tags: falsenumberdateobjectsiteordercommonpurchase