<p>由于这似乎仍然是一个问题,即使是较新的pandas版本,我编写了一些函数来规避这一点,作为一个更大的pyspark helpers库的一部分:</p>
<pre><code>import pandas as pd
import datetime
def read_parquet_folder_as_pandas(path, verbosity=1):
files = [f for f in os.listdir(path) if f.endswith("parquet")]
if verbosity > 0:
print("{} parquet files found. Beginning reading...".format(len(files)), end="")
start = datetime.datetime.now()
df_list = [pd.read_parquet(os.path.join(path, f)) for f in files]
df = pd.concat(df_list, ignore_index=True)
if verbosity > 0:
end = datetime.datetime.now()
print(" Finished. Took {}".format(end-start))
return df
def read_parquet_as_pandas(path, verbosity=1):
"""Workaround for pandas not being able to read folder-style parquet files.
"""
if os.path.isdir(path):
if verbosity>1: print("Parquet file is actually folder.")
return read_parquet_folder_as_pandas(path, verbosity)
else:
return pd.read_parquet(path)
</code></pre>
<p>这假设拼花“文件”中的相关文件实际上是一个文件夹,以“.parquet”结尾。这适用于数据块导出的拼花文件,也可能适用于其他人(未经测试,对评论中的反馈感到满意)。</p>
<p>如果事先不知道函数<code>read_parquet_as_pandas()</code>是否是文件夹,则可以使用该函数。</p>