如何在python/pandas中比较两个csv文件(csv1和csv2)列中的所有条目?

2024-05-15 01:34:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我有2个CSV文件:

CSV1版本:

"Hypervisor","IP","ABCD","Operating System","Domain","Memory","No. CPU","Availability (%)","Last Collection Time","lol"
"lglac125.lss.com","10.247.52.125","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599031E9"
"lglac126.lss.com","10.247.52.126","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9"
"lglac127.lss.com","10.247.52.127","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","0.0","1.558599031E9"
"lglac128.lss.com","10.247.52.128","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9"
"lglac129.lss.com","10.247.52.129","VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9"

第2版:

"Hypervisor","IP","Arrays","Operating System","Domain","Memory","No. CPU","Availability (%)","Last Collection Time","DummyColumn"
"lglac125.lss.com","10.247.52.125",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599031E9","A"
"lglac126.lss.com","10.247.52.126",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9","B"
"lglac127.lss.com","10.247.52.127",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","0.0","1.558599031E9","C"
"lglac128.lss.com","10.247.52.128",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9","D"
"lglac129.lss.com","10.247.52.129",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9","E"
"DummyRow","10.247.52.129",,"VMware ESXi 5.5.0 build-9919047","lss.com","524278.03125","4.0","100.0","1.558599931E9","F"

我试图将每列的所有条目(如果在csv2中可用)与相应的行进行比较。如果有任何条目丢失或更改,我需要升起一个标志。这两个文件中都有可能添加或删除任何列。因此我需要首先检查csv2中是否有x列,然后匹配csv1中相同列的条目。你知道吗

我已经为此挣扎了三天,但还是没有找到解决办法。我非常感谢你的帮助。你知道吗


Tags: 文件nobuildipcomdomainoperating条目
2条回答

IIUC公司

假设导入熊猫的csv1、csv2为df1df2。在列上使用intersection查找匹配的列并对其排序。将其传递给df1df2。最后,eqdf1df2的匹配列的子集上

matched_list = df1.columns.intersection(df2.columns).sort_values()
df1_mask = df1[matched_list].eq(df2[matched_list])

Out[853]:
   Availability (%)  Domain  Hypervisor     IP  Last Collection Time  Memory  \
0              True    True        True  False                  True    True
1              True    True        True   True                  True    True
2              True    True        True   True                  True    True
3              True    True        True   True                  True    True
4              True    True        True   True                  True    True
5             False   False       False  False                 False   False

   No. CPU  Operating System
0     True              True
1     True              True
2     True              True
3     True              True
4     True              True
5    False             False

注意:我将df1.loc[0, 'IP']改为10.247.52.124,以便在df1的第0行的一个值中显示False,以便演示

从这个df1_mask,您可以将它插入df1以检查NaN。任何NaN要么是原始值NaN,要么在df1df2之间更改

df1[df1_mask]

Out[854]:
         Hypervisor             IP                 Operating System   Domain  \
0  lglac125.lss.com            NaN  VMware ESXi 5.5.0 build-9919047  lss.com
1  lglac126.lss.com  10.247.52.126  VMware ESXi 5.5.0 build-9919047  lss.com
2  lglac127.lss.com  10.247.52.127  VMware ESXi 5.5.0 build-9919047  lss.com
3  lglac128.lss.com  10.247.52.128  VMware ESXi 5.5.0 build-9919047  lss.com
4  lglac129.lss.com  10.247.52.129  VMware ESXi 5.5.0 build-9919047  lss.com

         Memory  No. CPU  Availability (%)  Last Collection Time  lol
0  524278.03125      4.0             100.0          1.558599e+09  NaN
1  524278.03125      4.0             100.0          1.558600e+09  NaN
2  524278.03125      4.0               0.0          1.558599e+09  NaN
3  524278.03125      4.0             100.0          1.558600e+09  NaN
4  524278.03125      4.0             100.0          1.558600e+09  NaN

注意:您的df1lol列,但没有值,因此它原来是NaN


或者你可以查看df2

df2[df1_mask]

Out[855]:
         Hypervisor             IP  Arrays                 Operating System  \
0  lglac125.lss.com            NaN     NaN  VMware ESXi 5.5.0 build-9919047
1  lglac126.lss.com  10.247.52.126     NaN  VMware ESXi 5.5.0 build-9919047
2  lglac127.lss.com  10.247.52.127     NaN  VMware ESXi 5.5.0 build-9919047
3  lglac128.lss.com  10.247.52.128     NaN  VMware ESXi 5.5.0 build-9919047
4  lglac129.lss.com  10.247.52.129     NaN  VMware ESXi 5.5.0 build-9919047
5               NaN            NaN     NaN                              NaN

    Domain        Memory  No. CPU  Availability (%)  Last Collection Time  \
0  lss.com  524278.03125      4.0             100.0          1.558599e+09
1  lss.com  524278.03125      4.0             100.0          1.558600e+09
2  lss.com  524278.03125      4.0               0.0          1.558599e+09
3  lss.com  524278.03125      4.0             100.0          1.558600e+09
4  lss.com  524278.03125      4.0             100.0          1.558600e+09
5      NaN           NaN      NaN               NaN                   NaN

  DummyColumn
0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
5         NaN

您可以用^{}indicator=True^{}来尝试both

matching_cols=df1.columns.intersection(df2.columns).tolist() #find matching columns to merge
df1.merge(df2,on=matching_cols,how='outer',indicator=True).query("_merge!='both'")

这将显示数据帧之间的不常见数据


             Hypervisor             IP                 Operating System  \
0  lglac125.lss.emc.com  10.247.52.125  VMware ESXi 5.5.0 build-9919047   
5  lglac125.lss.emc.com                 VMware ESXi 5.5.0 build-9919047   
6              DummyRow  10.247.52.129  VMware ESXi 5.5.0 build-9919047   

        Domain        Memory  No. CPU  Availability (%)  Last Collection Time  \
0  lss.emc.com  524278.03125      4.0             100.0          1.558599e+09   
5  lss.emc.com  524278.03125      4.0             100.0          1.558599e+09   
6  lss.emc.com  524278.03125      4.0             100.0          1.558600e+09   

   Arrays DummyColumn      _merge  
0     NaN         NaN   left_only  
5     NaN           A  right_only  
6     NaN           F  right_only  

相关问题 更多 >

    热门问题