如何将文档类型转换为spark RDD

2024-04-16 18:48:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试将文档类型转换为sparkrdd,但我不知道怎么做。 基本上,我尝试在apachespark中实现googlecloudnlpapi。下面是我的代码:

编辑

from pyspark.sql.types import *
from pyspark.sql import SparkSession
import six
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types

spark = SparkSession.builder.master('yarn-client').appName('SparkNLP').getOrCreate()
gcs_uri = 'gs://mybucket/reddit.json'
document = types.Document(gcs_content_uri=gcs_uri,type=enums.Document.Type.PLAIN_TEXT)
readRDD = spark.read.text(document)

当然,最后一行会给出一个错误:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 328, in text
    return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(paths)))
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1248, in __call__
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1218, in _build_args
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 298, in get_command_part
AttributeError: _get_object_id

有人能给我指路吗?你知道吗


Tags: infrompyimportcloudsqllibusr