从Amazon Ion文件中提取“数据”

2条回答

网友

1楼 · 编辑于 2024-05-14 15:41:34

诺西菲维

AWS Glue能够读取亚马逊离子输入。但是，许多其他服务和应用程序不能，因此使用Glue将离子数据转换为JSON是一个好主意。请注意，Ion是JSON的超集，向JSON添加了一些数据类型，因此将Ion转换为JSON可能会导致一些down-conversion

从QLDB S3导出访问QLDB文档的一个好方法是使用Glue提取文档数据，将其作为JSON存储在S3中，并使用Amazon Athena进行查询。过程如下：

Export your ledger data to S3
创建一个Glue crawler来对导出的数据进行爬网和编目
运行Glue ETL job从导出文件中提取revision data，将其转换为JSON，并将其写入S3
创建一个Glue crawler来对提取的数据进行爬网和编目
使用Amazon Athena查询提取的文档修订数据

看看下面的PySpark脚本。它仅从QLDB导出文件中提取修订元数据和数据负载

QLDB导出映射每个文档的表，但与修订数据分开。您必须进行一些额外的编码，以便在输出的修订数据中包含表名。下面的代码没有做到这一点，因此您将在输出的一个表中完成所有修订

还请注意，您将获得导出数据中的任何修订。也就是说，对于给定的文档ID，您可能会获得多个文档修订版。根据您对数据的预期用途，您可能需要了解如何仅获取每个文档ID的最新修订版

from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql.functions import explode
from pyspark.sql.functions import col
from awsglue.dynamicframe import DynamicFrame

# Initializations
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

# Load data.  'vehicle-registration-ion' is the name of your database in the Glue catalog for the export data.  '2020' is the name of your table in the Glue catalog.
dyn0 = glueContext.create_dynamic_frame.from_catalog(database = "vehicle-registration-ion", table_name = "2020", transformation_ctx = "datasource0")

# Only give me exported records with revisions
dyn1 = dyn0.filter(lambda line: "revisions" in line)

# Now give me just the revisions element and convert to a Spark DataFrame.
df0 = dyn1.select_fields("revisions").toDF()

# Revisions is an array, so give me all of the array items as top-level "rows" instead of being a nested array field.
df1 = df0.select(explode(df0.revisions))

# Now I have a list of elements with "col" as their root node and the revision 
# fields ("data", "metadata", etc.) as sub-elements.  Explode() gave me the "col"
# root node and some rows with null "data" fields, so filter out the nulls.
df2 = df1.where(col("col.data").isNotNull())

# Now convert back to a DynamicFrame
dyn2 = DynamicFrame.fromDF(df2, glueContext, "dyn2")

# Prep and send the output to S3
applymapping1 = ApplyMapping.apply(frame = dyn2, mappings = [("col.data", "struct", "data", "struct"), ("col.metadata", "struct", "metadata", "struct")], transformation_ctx = "applymapping1")
datasink0 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://YOUR_BUCKET_NAME_HERE/YOUR_DESIRED_OUTPUT_PATH_HERE/"}, format = "json", transformation_ctx = "datasink0")

我希望这有帮助

网友
2楼 · 编辑于 2024-05-14 15:41:34

你试过使用Amazon Ion库吗
假设问题中提到的数据存在于一个名为“myIonFile.ion”的文件中，并且如果该文件中只有ion对象，我们可以按如下方式从该文件中读取数据：
from amazon.ion import simpleion file = open("myIonFile.ion", "rb") # opening the file data = file.read() # getting the bytes for the file iondata = simpleion.loads(data, single_value=False) # Loading as ion data print(iondata['PersonId']) # should print "4tPW8xtKSGF5b6JyTihI1U"
关于使用离子库的进一步指导见Ion Cookbook
此外，我不确定您的用例，但是与QLDB的交互也可以通过QLDB Driver完成，它直接依赖于离子库

相关问题更多 >

编程相关推荐

热门问题

热门文章