To access data stored in Amazon S3 from Spark applications, you could
use Hadoop file APIs (SparkContext.hadoopFile,
JavaHadoopRDD.saveAsHadoopFile, SparkContext.newAPIHadoopRDD, and
JavaHadoopRDD.saveAsNewAPIHadoopFile) for reading and writing RDDs,
providing URLs of the form s3a://bucket_name/path/to/file.txt.
You can read and write Spark SQL DataFrames using the Data Source API.
# 1 楼答案
我建议遵循Cloudera教程Accessing Data Stored in Amazon S3 through Spark
关于文件扩展名,几乎没有解决方案。 您只需按文件名获取扩展名(即
file.txt
)如果S3存储桶中存储的文件删除了扩展名,那么查看为每个S3资源添加的元数据时,仍然可以知道内容类型
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectHEAD.html