<p>如果所有文件中的所有列都相同,我想您可以按以下方式使用<code>pd.duplicated()</code>:</p>
<pre><code>import pathlib
import pandas as pd
def read_txt_files(dir_path):
df_list = []
for filename in pathlib.Path(dir_path).glob('*.txt'):
# print(filename)
df = pd.read_csv(filename, index_col=0)
df['filename'] = filename # just to save filename as an optional key
df_list.append(df)
return pd.concat(df_list)
df = read_txt_files(r'C:\...path') # probably you should change path in this line
df.set_index('filename', append=True, inplace=True)
print(df)
Name Description ...
timestamp filename
00000000B42852FA first.txt ADM_EIG Administratief eigenaar ...
000000005880959E first.txt OPZ Opzeggingen ...
00000000B42852FA second.txt ADM_EIG Administratief eigenaar ...
000000005880959K second.txt XYZ Opzeggingen ...
</code></pre>
<p>因此,您可以使用重复数据获取索引:</p>
<pre><code>df.duplicated(keep='first')
Out:
timestamp filename
00000000B42852FA first.txt False
000000005880959E first.txt False
00000000B42852FA second.txt True
000000005880959K second.txt False
dtype: bool
</code></pre>
<p>并使用它来过滤数据:</p>
<pre><code>df[~df.duplicated(keep='first')]
Out:
Name Description ...
timestamp filename
00000000B42852FA first.txt ADM_EIG Administratief eigenaar ...
000000005880959E first.txt OPZ Opzeggingen ...
000000005880959K second.txt XYZ Opzeggingen ...
</code></pre>
<p><strong>编辑:</strong>不同文件中不同列但方案相同的示例。
第一.txt地址:</p>
<pre><code>timestamp,Name,Descr,Column Layout,Analysis View Name
00000000B42852FA,ADM_EIG,Administratief eigenaar,ADM_EIG,ADM_EIG
000000005880959E,OPZ,Opzeggingen,STANDAARD,
</code></pre>
<p>你知道吗秒.txt地址:</p>
<pre><code>timestamp,Descr,Default Column Layout,Analysis View Name
00000000B42852FA,Administratief,ADM_EIG,ADM_EIG
000000005880959K,Opzeggingen,STANDAARD,
</code></pre>
<p>你知道吗第三.txt你知道吗</p>
<pre><code>timestamp,Descr,Default Column Layout,Analysis View Name
00000000B42852FA,Administratief eigenaar,ADM_EIG,ADM_EIG
000000005880959K,Opzeggingen,STANDAARD,
</code></pre>
<p>最后一行秒.txt以及第三.txt是重复的。你知道吗</p>
<p>应用相同代码:</p>
<pre><code>...
print(df)
Out: # partial because it's to wide
Analysis View Name Column Layout ...
timestamp filename
00000000B42852FA first.txt ADM_EIG ADM_EIG ...
000000005880959E first.txt NaN STANDAARD ...
00000000B42852FA second.txt ADM_EIG NaN ...
000000005880959K second.txt NaN NaN ...
00000000B42852FA third.txt ADM_EIG NaN ...
000000005880959K third.txt NaN NaN ...
</code></pre>
<p>缺少的值(如果.txt中没有这样的列)用NaN填充。
定位重复列:</p>
<pre><code>df.duplicated(keep='first')
Out:
timestamp filename
00000000B42852FA first.txt False
000000005880959E first.txt False
00000000B42852FA second.txt False
000000005880959K second.txt False
00000000B42852FA third.txt False
000000005880959K third.txt True
</code></pre>