如何使用Scala或Python列出存储在Hadoop HDFS上的Spark集群中所有可用文件？

Question

在Spark中，列出所有本地可用文件名的最有效方法是什么？我在使用Scala API，不过Python也应该没问题。

Answer 1

import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import scala.collection.mutable.Stack


 val fs = FileSystem.get( sc.hadoopConfiguration )
 var dirs = Stack[String]()
 val files = scala.collection.mutable.ListBuffer.empty[String]
 val fs = FileSystem.get(sc.hadoopConfiguration)
 dirs.push("/user/username/")

 while(!dirs.isEmpty){
     val status = fs.listStatus(new Path(dirs.pop()))
     status.foreach(x=> if(x.isDirectory) dirs.push(x.getPath.toString) else 
     files+= x.getPath.toString)
 }

files.foreach(println)

当然可以！请把你想要翻译的内容发给我，我会帮你用简单易懂的语言解释清楚。

如何使用Scala或Python列出存储在Hadoop HDFS上的Spark集群中所有可用文件？

1 个回答

撰写回答