如何有效地将spark中的数据帧与小文件目录连接起来？

1条回答

网友

1楼 · 发布于 2024-05-16 10:38:43

选项1，读取所有文件：

您可以使用wholeTextFiles阅读说明。它返回一个RDD，将文件路径映射到它们的内容，然后将结果与数据帧连接起来

val descriptions = sc
    .wholeTextFiles(".../descriptions")
    // We need to extract the id from the file path
    .map{ case (path, desc) => {
        val fileName = path.split("/").last
        val id = "[0-9]+".r.findFirstIn(fileName)
        id.get.toLong -> desc
    }}
    .toDF("id", "description")

val result = df1.join(descriptions, Seq("id"))

选项2，仅读取所需文件

为此，您可以使用binaryFiles。它创建一个RDD，将每个文件路径映射到DataStream。因此，不会立即读取这些文件。然后，您可以从df1中选择所有不同的id，将它们与RDD连接起来，然后只读取所需文件的内容。代码如下所示：

val idRDD = df1
    .select("id").distinct
    .rdd.map(_.getAs[Long]("id") -> true)

val descriptions = sc.binaryFiles(".../descriptions")
    // same as before, but the description is not read yet
    .map{ case (path, descFile) => {
        val fileName = path.split("/").last
        val id = "[0-9]+".r.findFirstIn(fileName)
        id.get.toLong -> descFile
    }} // inner join with the ids we are interested in
    .join(idRDD)
    .map{ case(id, (file, _)) => id -> file}
    // reading the files
    .mapValues(file => {
         val reader = scala.io.Source.fromInputStream(file.open)
         val desc = reader.getLines.mkString("\n")
         reader.close
         desc
    })
    .toDF("id", "description")

val result = df1.join(descriptions, Seq("id"))

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何有效地将spark中的数据帧与小文件目录连接起来？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >