AWSCredentialsProvider在本地使用pyspark 3从s3读取拼花地板文件时出错

2024-06-11 10:41:31 发布

您现在位置:Python中文网/ 问答频道 /正文

试图在本地从s3读取拼花地板文件时出现java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider错误

错误发生在以下两个方面:

  • pyspark3.0.2与hadoop aws2.7.4
  • pyspark3.1.2与hadoop aws3.2.0

它使用conda 4.10在python 3.8下运行,aws java sdk包的版本为1.11.901。并且hadoop-awshttps://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws添加到jars目录

这是我的密码:

import os
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, Row

os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.4 pyspark-shell"

sc = SparkContext(
    conf=SparkConf() \
         .set('spark.executor.extraJavaOptions', '-Dcom.amazonaws.services.s3.enableV4=true') \
         .set('spark.driver.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true')
)
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', '...')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', '...')
spark = SparkSession(sc)

spark.read.parquet('s3a://bucket/parquet_files_path/')

输出:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-3-61076a366db7> in <module>
----> 1 spark.read.parquet('s3a://bucket/parquet_files_path/')

~/anaconda3/lib/python3.8/site-packages/pyspark/sql/readwriter.py in parquet(self, *paths, **options)
    351         self._set_opts(mergeSchema=mergeSchema, pathGlobFilter=pathGlobFilter,
    352                        recursiveFileLookup=recursiveFileLookup)
--> 353         return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
    354 
    355     @ignore_unicode_prefix

~/anaconda3/lib/python3.8/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

~/anaconda3/lib/python3.8/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
    126     def deco(*a, **kw):
    127         try:
--> 128             return f(*a, **kw)
    129         except py4j.protocol.Py4JJavaError as e:
    130             converted = convert_exception(e.java_exception)

~/anaconda3/lib/python3.8/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o37.parquet.
: java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider
    at java.base/java.lang.Class.forName0(Native Method)
    at java.base/java.lang.Class.forName(Class.java:398)
    at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:376)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
    at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:758)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: com.amazonaws.auth.AWSCredentialsProvider
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
    ... 30 more

我错过了什么


Tags: orgselfhadooplangsqlbaseapachejava