我无法从hdfs加载数据,我使用python pyarrow库。在docker容器中

2024-06-16 11:20:48 发布

您现在位置:Python中文网/ 问答频道 /正文

还配置了必要的参数,以便python可以从hdfsh读取:

export ARROW_LIBHDFS_DIR='/opt/hadoop/lib/native'
export HADOOP_COMMON_LIB_NATIVE_DIR='/opt/hadoop/lib/native'
export HADOOP_OPTS="-Djava.library.path=/opt/hadoop/lib/"

对于ls $ARROW_LIBHDFS_DIR,我得到了:

libhadoop.a   libhadooppipes.a    libhdfs.so        libnativetask.so
libhadoop.so  libhadooputils.a    libhdfs.so.0.0.0  libnativetask.so.1.0.0

我的python代码:

import pandas as pd
pd.read_parquet('hdfs:///tmp/data/test.parquet', engine='pyarrow')

我得到的错误:

WARN util.NativeCodeLoader:无法为您的平台加载本机hadoop库。。。在适用的情况下使用内置java类

 hdfsGetPathInfo(hdfs:///tmp/data/test.parquet): getFileInfo error:
    ClassCastException: org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto cannot be cast to com.google.protobuf.Messagejava.lang.ClassCastException: org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto cannot be cast to com.google.protobuf.Message
            at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:225)
            at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
            at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
            at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:498)
            at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
            at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
            at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
            at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
            at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
            at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
            at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1654)
            at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
            at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
            at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
            at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
            at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1734)
    hdfsGetPathInfo(hdfs:///tmp/data/test.parquet): getFileInfo error:
    IllegalStateException: java.lang.IllegalStateException
            at com.google.common.base.Preconditions.checkState(Preconditions.java:129)
            at org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:117)
            at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:162)
            at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
            at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
            at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
            at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1654)
            at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
            at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
            at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
            at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
            at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1734)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python3.5/dist-packages/pandas/io/parquet.py", line 296, in read_parquet
        return impl.read(path, columns=columns, **kwargs)
      File "/usr/local/lib/python3.5/dist-packages/pandas/io/parquet.py", line 125, in read
        path, columns=columns, **kwargs
      File "/usr/local/lib/python3.5/dist-packages/pyarrow/parquet.py", line 1544, in read_table
        partitioning=partitioning)
      File "/usr/local/lib/python3.5/dist-packages/pyarrow/parquet.py", line 1173, in __init__
        open_file_func=partial(_open_dataset_file, self._metadata)
      File "/usr/local/lib/python3.5/dist-packages/pyarrow/parquet.py", line 1368, in _make_manifest
        .format(path))
    OSError: Passed non-file path: hdfs:///tmp/data/test.parquet

Tags: ioorgcomhadooplibapachehdfsjava