有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

java对一个数据集进行分区和存储,该数据集有一个字符串列,其值看起来是数字。再次读取时,数据仍然是“字符串”,但丢失了零

Spark 3.0.2中,我正在拼花地板文件中写一个Dataset。我的代码就是这样结束的:

etablissements = etablissements.repartition(col("codeDepartement"));
etablissements = etablissements.sortWithinPartitions(col("siret"));
etablissements = etablissements.persist();
      
// Write it in a file named with the year of data, selections, and sorting in it's name.
// Underlying statement writing the parquet file is :
// ds.write().partitionBy(colonnesPartionnement /* = codeDepartement */)
saveToStore(etablissements, new String[] {"codeDepartement"}, 
   "{0}_{1,number,#0}_{2}_{3}", "etablissements", anneeSIRENE,  actifsSeulement, 
   communesValides);

{}有一个{},因为法国的部门代码是三个字符的代码

# schema() :
|-- codeDepartement: string (nullable = true)

它在show()输出的最后三分之一处可见(城市名称前三列大写),并且具有for值:"01"

+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+---------------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+
|siren    |nic  |siret         |statutDiffusionEtablissement|dateCreationEtablissement|trancheEffectifSalarie|anneeEffectifsEtablissement|activiteArtisanRegistreDesMetiers|dateDernierTraitement|etablissementSiege|nombrePeriodesEtablissement|complementAdresse         |numeroVoie|indiceRepetition|typeDeVoie|libelleVoie              |codePostal|nomCommuneEtrangere|distributionSpeciale|codeCommune|cedex|libelleCedex         |codePaysEtranger|nomPaysEtranger|complementAdresseSecondaire|numeroVoieSecondaire|indiceRepetitionSecondaire|typeDeVoieSecondaire|libelleVoieSecondaire|codePostalSecondaire|nomCommuneSecondaire|nomCommuneEtrangereSecondaire|distributionSpecialeSecondaire|codeCommuneSecondaire|cedexSecondaire|libelleCedexSecondaire|codePaysEtrangerSecondaire|nomPaysEtrangerSecondaire|dateDebutHistorisation|etatAdministratifEtablissement|enseigne1          |enseigne2|enseigne3|denominationEtablissement|activitePrincipale|nomenclatureActivitePrincipale|caractereEmployeurEtablissement|active|anneeValiditeEffectifSalarie|caractereEmployeur|siege|nombrePeriodes|typeCommune|codeRegion|codeDepartement|arrondissement|typeNomEtCharniere|nomMajuscules           |nomCommune              |libelle                 |codeCanton|codeCommuneParente|strateCommune|sirenCommune|populationTotale|populationMunicipale|populationCompteApart|sirenCommuneMembre|codeEPCI |nomEPCI                           |libelleNAF                                                                                   |
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+---------------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+
|015850944|00024|01585094400024|O                           |2007-04-01               |11                    |2017                       |null                             |2019-11-14T14:00:12  |false             |2                          |ZONE INDUSTRIELLE         |null      |null            |CHE       |DE THIL                  |01700     |null               |null                |01376      |null |null                 |null            |null           |null                       |null                |null                      |null                |null                 |null                |null                |null                         |null                          |null                 |null           |null                  |null                      |null                     |2008-01-01            |A                             |null               |null     |null     |null                     |25.73B            |NAFRev2                       |O                              |true  |2017                        |true              |false|2             |COM        |84        |01             |012           |0                 |SAINT MAURICE DE BEYNOST|Saint-Maurice-de-Beynost|Saint-Maurice-de-Beynost|0113      |null              |5            |210103768   |4006            |3967                |39                   |210103768         |240100800|CC de Miribel et du Plateau       |Fabrication d'autres outillages                                                              |
|015851793|00479|01585179300479|O                           |2005-01-01               |11                    |2017                       |null                             |2019-06-24T13:04:28  |false             |2                          |null                      |null      |null            |null      |ZONE INDUST LA FONTAINE  |01290     |null               |null                |01134      |null |null                 |null            |null           |null                       |null                |null                      |null                |null                 |null                |null                |null                         |null                          |null                 |null           |null                  |null                      |null                     |2008-01-01            |A                             |null               |null     |null     |null                     |46.73A            |NAFRev2                       |O                              |true  |2017                        |true              |false|2             |COM        |84        |01             |012           |0                 |CROTTET                 |Crottet                 |Crottet                 |0123      |null              |3            |210101341   |1777            |1734                |43                   |210101341         |200070555|CC de la Veyle                    |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction         |
|015851793|00743|01585179300743|O                           |2012-09-01               |02                    |2017                       |null                             |2019-06-24T13:04:28  |false             |1                          |ZA ACTIPARC               |null      |null            |null      |PRE LION                 |01190     |null               |null                |01057      |null |null                 |null            |null           |null                       |null                |null                      |null                |null                 |null                |null                |null                         |null                          |null                 |null           |null                  |null                      |null                     |2012-09-01            |A                             |null               |null     |null     |DORAS                    |46.73A            |NAFRev2                       |O                              |true  |2017                        |true              |false|1             |COM        |84        |01             |012           |0                 |BOZ                     |Boz                     |Boz                     |0117      |null              |3            |210100574   |519             |512                 |7                    |210100574         |200071371|CC Bresse et Saône                |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction         |
|015851793|00917|01585179300917|O                           |2020-01-01               |null                  |null                       |null                             |2020-01-31T16:13:25  |false             |1                          |null                      |28        |null            |AV        |DE MARBOZ                |01000     |null               |null                |01053      |null |null                 |null            |null           |null                       |null                |null                      |null                |null                 |null                |null                |null                         |null                          |null                 |null           |null                  |null                      |null                     |2020-01-01            |A                             |CLEAU              |null     |null     |null                     |46.73A            |NAFRev2                       |O                              |true  |null                        |true              |false|1             |COM        |84        |01             |012           |0                 |BOURG EN BRESSE         |Bourg-en-Bresse         |Bourg-en-Bresse         |0199      |null              |8            |210100533   |43306           |41527               |1779                 |210100533         |200071751|CA du Bassin de Bourg-en-Bresse   |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction         |

我看到我的拼花文件下的文件夹很好:

codeDepartement=01
codeDepartement=2A
codeDepartement=75
codeDepartement=971

注意:由于2A(对于Corse)等一些值,部门代码永远不能转换为数值

snappy.parquet块分别存储在/data/tmp/etablissements_2020_true_true/codeDepartement=01文件夹中,这样就可以了

在阅读时,我试图阅读该商店的内容。搜索城市代码(在法国以部门代码开头)以"01"开头的城市:适当的拼花地板文件和区块如下:

2021-03-24 07:14:33.825  INFO 13860 --- [er for task 106] o.a.s.s.e.datasources.FileScanRDD        : Reading File path: file:/data/tmp/etablissements_2020_true_true/codeDepartement=01/part-00024-f7d33eea-6d79-4f1a-bf35-0666dcc5e0f5.c000.snappy.parquet, range: 0-5246504, partition values: [1]

当显示部门时(现在位于数据集show()命令的末尾),它现在有值"1"而不是"01"

+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+---------------+
|siren    |nic  |siret         |statutDiffusionEtablissement|dateCreationEtablissement|trancheEffectifSalarie|anneeEffectifsEtablissement|activiteArtisanRegistreDesMetiers|dateDernierTraitement|etablissementSiege|nombrePeriodesEtablissement|complementAdresse         |numeroVoie|indiceRepetition|typeDeVoie|libelleVoie              |codePostal|nomCommuneEtrangere|distributionSpeciale|codeCommune|cedex|libelleCedex         |codePaysEtranger|nomPaysEtranger|complementAdresseSecondaire|numeroVoieSecondaire|indiceRepetitionSecondaire|typeDeVoieSecondaire|libelleVoieSecondaire|codePostalSecondaire|nomCommuneSecondaire|nomCommuneEtrangereSecondaire|distributionSpecialeSecondaire|codeCommuneSecondaire|cedexSecondaire|libelleCedexSecondaire|codePaysEtrangerSecondaire|nomPaysEtrangerSecondaire|dateDebutHistorisation|etatAdministratifEtablissement|enseigne1          |enseigne2|enseigne3|denominationEtablissement|activitePrincipale|nomenclatureActivitePrincipale|caractereEmployeurEtablissement|active|anneeValiditeEffectifSalarie|caractereEmployeur|siege|nombrePeriodes|typeCommune|codeRegion|arrondissement|typeNomEtCharniere|nomMajuscules           |nomCommune              |libelle                 |codeCanton|codeCommuneParente|strateCommune|sirenCommune|populationTotale|populationMunicipale|populationCompteApart|sirenCommuneMembre|codeEPCI |nomEPCI                           |libelleNAF                                                                                   |codeDepartement|
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+---------------+
|015850944|00024|01585094400024|O                           |2007-04-01               |11                    |2017                       |null                             |2019-11-14T14:00:12  |false             |2                          |ZONE INDUSTRIELLE         |null      |null            |CHE       |DE THIL                  |01700     |null               |null                |01376      |null |null                 |null            |null           |null                       |null                |null                      |null                |null                 |null                |null                |null                         |null                          |null                 |null           |null                  |null                      |null                     |2008-01-01            |A                             |null               |null     |null     |null                     |25.73B            |NAFRev2                       |O                              |true  |2017                        |true              |false|2             |COM        |84        |012           |0                 |SAINT MAURICE DE BEYNOST|Saint-Maurice-de-Beynost|Saint-Maurice-de-Beynost|0113      |null              |5            |210103768   |4006            |3967                |39                   |210103768         |240100800|CC de Miribel et du Plateau       |Fabrication d'autres outillages                                                              |1              |
|015851793|00479|01585179300479|O                           |2005-01-01               |11                    |2017                       |null                             |2019-06-24T13:04:28  |false             |2                          |null                      |null      |null            |null      |ZONE INDUST LA FONTAINE  |01290     |null               |null                |01134      |null |null                 |null            |null           |null                       |null                |null                      |null                |null                 |null                |null                |null                         |null                          |null                 |null           |null                  |null                      |null                     |2008-01-01            |A                             |null               |null     |null     |null                     |46.73A            |NAFRev2                       |O                              |true  |2017                        |true              |false|2             |COM        |84        |012           |0                 |CROTTET                 |Crottet                 |Crottet                 |0123      |null              |3            |210101341   |1777            |1734                |43                   |210101341         |200070555|CC de la Veyle                    |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction         |1              |
|015851793|00743|01585179300743|O                           |2012-09-01               |02                    |2017                       |null                             |2019-06-24T13:04:28  |false             |1                          |ZA ACTIPARC               |null      |null            |null      |PRE LION                 |01190     |null               |null                |01057      |null |null                 |null            |null           |null                       |null                |null                      |null                |null                 |null                |null                |null                         |null                          |null                 |null           |null                  |null                      |null                     |2012-09-01            |A                             |null               |null     |null     |DORAS                    |46.73A            |NAFRev2                       |O                              |true  |2017                        |true              |false|1             |COM        |84        |012           |0                 |BOZ                     |Boz                     |Boz                     |0117      |null              |3            |210100574   |519             |512                 |7                    |210100574         |200071371|CC Bresse et Saône                |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction         |1              |
|015851793|00917|01585179300917|O                           |2020-01-01               |null                  |null                       |null                             |2020-01-31T16:13:25  |false             |1                          |null                      |28        |null            |AV        |DE MARBOZ                |01000     |null               |null                |01053      |null |null                 |null            |null           |null                       |null                |null                      |null                |null                 |null                |null                |null                         |null                          |null                 |null           |null                  |null                      |null                     |2020-01-01            |A                             |CLEAU              |null     |null     |null                     |46.73A            |NAFRev2                       |O                              |true  |null                        |true              |false|1             |COM        |84        |012           |0                 |BOURG EN BRESSE         |Bourg-en-Bresse         |Bourg-en-Bresse         |0199      |null              |8            |210100533   |43306           |41527               |1779                 |210100533         |200071751|CA du Bassin de Bourg-en-Bresse   |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction         |1              |

即使拼花文件仍将其声明为StringType

|-- codeDepartement: string (nullable = true)

发生了什么事

我倾向于把repartition()语句作为造成这场混乱的原因,但我不知道怎么做。如果这个命令很复杂,而且分区不能按字符串值进行分区,那么程序如何按字母中的红色蓝色黄色进行数据分区呢

我不理解整体行为(问题?)我要面对


共 (2) 个答案

  1. # 1 楼答案

    您可以禁用选项spark.sql.sources.partitionColumnTypeInference.enabled

    从文件Partition Discovery中:

    [...] Sometimes users may not want to automatically infer the data types of the partitioning columns. For these use cases, the automatic type inference can be configured by spark.sql.sources.partitionColumnTypeInference.enabled, which is default to true. When type inference is disabled, string type will be used for the partitioning columns.

    要设置选项,请执行以下操作:

    spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")
    
  2. # 2 楼答案

    我能重现这个问题

    spark.sql("select '01' key, 123 val union all select 'ab', 456").show()
    + -+ -+
    |key|val|
    + -+ -+
    | 01|123|
    | ab|456|
    + -+ -+
    
    spark.sql("select '01' key, 123 val union all select 'ab', 456").write().partitionBy("key").parquet("test")
    
    spark.read().parquet("test").show()
    + -+ -+
    |val|key|
    + -+ -+
    |456| ab|
    |123|  1|
    + -+ -+
    

    要解决这个问题,您可以在阅读时提供一个模式:

    spark.read().schema(spark.read().parquet("test").schema).parquet("test").show()
    + -+ -+
    |val|key|
    + -+ -+
    |456| ab|
    |123| 01|
    + -+ -+
    

    (在Pyspark中测试,希望可以在Java中使用)