如何从同一个数据库中读取多个表并将它们保存到自己的CSV文件中？

conf = new SparkConf().setAppName("test").setMaster("local").set("spark.driver.allowMultipleContexts", "true"); sc = new SparkContext(conf) sqlContext = new SQLContext(sc) df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://DBServer:PORT").option("databaseName","xxx").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable","xxx").option("user","xxx").option("password","xxxx").load() df.registerTempTable("test") df.write.format("com.databricks.spark.csv").save("poc/amitesh/csv") exit()

1条回答

网友

1楼 · 发布于 2024-05-20 02:31:32

where in I have to save 4 table from same database in CSV format in 4 different files at a time through pyspark code.

必须为数据库中的每个表编写一个转换（读写）（使用sqlContext.read.format）。在

特定于表的ETL管道之间的唯一区别是每个表有不同的dbtable选项。一旦你有了一个数据帧，保存到它自己的CSV文件。在

代码可以如下所示（在Scala中，因此我将其转换为Python作为家庭练习）：

val datasetFromTABLE_ONE: DataFrame = sqlContext.
  read.
  format("jdbc").
  option("url","jdbc:sqlserver://DBServer:PORT").
  option("databaseName","xxx").
  option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").
  option("dbtable","TABLE_ONE").
  option("user","xxx").
  option("password","xxxx").
  load()

// save the dataset from TABLE_ONE into its own CSV file
datasetFromTABLE_ONE.write.csv("table_one.csv")

对每个要保存到CSV的表重复相同的代码。在

完成！

100桌案例-公平安排

解决方案需要另一个：

What when I have 100 or more tables? How to optimize the code for that? How to do it effectively in Spark? Any parallelization?

位于SparkSession后面的SparkContext是线程安全的，这意味着您可以从多个线程使用它。如果你考虑每个表有一个线程，这是正确的方法。在

你可以生成尽可能多的线程，比如说100个，然后启动它们。然后Spark可以决定什么时候执行。在

这是Spark使用Fair Scheduler Pools做的事情。Spark的这一特性并不广为人知，但在本案例中值得考虑：

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).

使用它，您的加载和保存管道可能会更快。在

100桌案例-公平安排

相关问题更多 >

编程相关推荐

热门问题

热门文章