我为下面的代码墙和糟糕的格式表示歉意。当我应用DataFrame.equals()或更高版本的df1==df2时,我尝试了尽可能多的方法来找出导致这些数据帧返回False的原因。我找不出他们之间有什么不同之处
我通过将groupby应用到除ORDER\u QTY之外的所有列的第一个数据帧(bdf),得到了第二个数据帧(dftest)。由于这两个数据帧的行数是相同的,所以我假设没有任何变化(这一点我并不惊讶),但是,为了确保我使用bdf.equals(dftest)对它们进行了比较,结果返回false。这是在我确定列的顺序正确之后。我注意到的另一件事是数据帧的大小不同。否则我就迷路了
In:
dftest = bdf.groupby(['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER', 'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION', 'ITEM', 'FW_END_DT', 'BPS_INCLUDE']).sum().reset_index()
dftest = dftest[['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER', 'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION', 'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE']]
print(bdf.equals(dftest))
print(bdf.columns)
print(dftest.columns)
Out:
False
Index(['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER',
'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION',
'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE'],
dtype='object')
Index(['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER',
'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION',
'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE'],
dtype='object')
^列似乎是相同的,但是bdf.equals(dftest)
会产生False
In:
bdf.info()
dftest.info()
Out:
<class 'pandas.core.frame.DataFrame'>
Index: 53025 entries, 0 to 53024
Data columns (total 14 columns):
SITE 53025 non-null object
CUST 53025 non-null object
ORDER_NUMBER 53025 non-null object
ORDER_DATE 53025 non-null datetime64[ns]
PURCHASE_ORDER 53025 non-null object
CHANNEL 53025 non-null object
SHIP_TO 53025 non-null object
PROD_LINE 53025 non-null object
GROUP_NUMBER 53025 non-null object
DESCRIPTION 53025 non-null object
ITEM 53025 non-null object
ORDER_QTY 53025 non-null int64
FW_END_DT 53025 non-null datetime64[ns]
BPS_INCLUDE 53025 non-null int64
dtypes: datetime64[ns](2), int64(2), object(10)
memory usage: 6.1+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53025 entries, 0 to 53024
Data columns (total 14 columns):
SITE 53025 non-null object
CUST 53025 non-null object
ORDER_NUMBER 53025 non-null object
ORDER_DATE 53025 non-null datetime64[ns]
PURCHASE_ORDER 53025 non-null object
CHANNEL 53025 non-null object
SHIP_TO 53025 non-null object
PROD_LINE 53025 non-null object
GROUP_NUMBER 53025 non-null object
DESCRIPTION 53025 non-null object
ITEM 53025 non-null object
ORDER_QTY 53025 non-null int64
FW_END_DT 53025 non-null datetime64[ns]
BPS_INCLUDE 53025 non-null int64
dtypes: datetime64[ns](2), int64(2), object(10)
memory usage: 5.7+ MB
^所有的东西看起来都一样,除了尺寸,就像我提到的
In:
common = bdf.merge(dftest,on=['SITE', 'CUST', 'ORDER_NUMBER', 'ORDER_DATE', 'PURCHASE_ORDER', 'CHANNEL', 'SHIP_TO', 'PROD_LINE', 'GROUP_NUMBER', 'DESCRIPTION', 'ITEM', 'ORDER_QTY', 'FW_END_DT', 'BPS_INCLUDE'], how='outer', indicator=True)
print(common[common['_merge'] != 'both'])
Out:
Empty DataFrame
Columns: [SITE, CUST, ORDER_NUMBER, ORDER_DATE, PURCHASE_ORDER, CHANNEL, SHIP_TO, PROD_LINE, GROUP_NUMBER, DESCRIPTION, ITEM, ORDER_QTY, FW_END_DT, BPS_INCLUDE, _merge]
Index: []
尝试合并和选择不在两个df中的行
In:
bdf[(~bdf.SITE.isin(common.SITE))&(~bdf.CUST.isin(common.CUST))&(~bdf.ORDER_NUMBER.isin(common.ORDER_NUMBER))&(~bdf.ORDER_DATE.isin(common.ORDER_DATE))&(~bdf.PURCHASE_ORDER.isin(common.PURCHASE_ORDER))&(~bdf.CHANNEL.isin(common.CHANNEL))&(~bdf.SHIP_TO.isin(common.SHIP_TO))&(~bdf.PROD_LINE.isin(common.PROD_LINE))&(~bdf.GROUP_NUMBER.isin(common.GROUP_NUMBER))&(~bdf.DESCRIPTION.isin(common.DESCRIPTION))&(~bdf.ITEM.isin(common.ITEM))&(~bdf.ORDER_QTY.isin(common.ORDER_QTY))&(~bdf.FW_END_DT.isin(common.FW_END_DT))&(~bdf.BPS_INCLUDE.isin(common.BPS_INCLUDE))]
Out:
SITE CUST ORDER_NUMBER ORDER_DATE PURCHASE_ORDER CHANNEL SHIP_TO PROD_LINE GROUP_NUMBER DESCRIPTION ITEM ORDER_QTY FW_END_DT BPS_INCLUDE
什么也没做
In:
(bdf == dftest).all().all()
Out:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-6c2f52f55e60> in <module>()
----> 1 (bdf == dftest).all().all()
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\ops.py in f(self, other)
1611 # Another DataFrame
1612 if not self._indexed_same(other):
-> 1613 raise ValueError('Can only compare identically-labeled '
1614 'DataFrame objects')
1615 return self._compare_frame(other, func, str_rep)
ValueError: Can only compare identically-labeled DataFrame objects
它们的标签不一样
当我尝试搜索以下内容时,有人建议我尝试:
In:
bdf.eq(dftest)
Out:
SITE CUST ORDER_NUMBER ORDER_DATE PURCHASE_ORDER CHANNEL SHIP_TO PROD_LINE GROUP_NUMBER DESCRIPTION ITEM ORDER_QTY FW_END_DT BPS_INCLUDE
0 False False False False False False False False False False False False False False
1 False False False False False False False False False False False False False False
2 False False False False False False False False False False False False False False
3 False False False False False False False False False False False False False False
4 False False False False False False False False False False False False False False
5 False False False False False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
52995 False False False False False False False False False False False False False False
106050 rows × 14 columns
在这种情况下,似乎每对细胞都不匹配……:(
我是不是漏掉了什么明显的东西
您的数据中是否有任何nan/null/缺失值
如果是这样,groupby.sum()可以将这些值替换为0,例如,对于数字类型
如果以上是罪魁祸首,groupby.first()结果将与原始输入相同
相关问题 更多 >
编程相关推荐