将句子的数据帧“规范化”为更大的单词数据帧

1条回答

网友

1楼 · 发布于 2024-05-15 15:48:18

其实很简单。让我们从创建DataFrame开始：

from pyspark.sql import Row

df = sc.parallelize([
    Row(sentence_id=1, sentence=u'the dog ran the fastest.'),
     Row(sentence_id=2, sentence=u'the cat sat down.')
]).toDF()

接下来我们需要一个标记器：

^{pr2}$

最后我们删除sentence和{}words：

from pyspark.sql.functions import explode, col

transformed = (tokenized
    .drop("sentence")
    .select(col("sentence_id"), explode(col("words")).alias("word")))

最终结果是：

transformed.show()

## +     -+   -+
## |sentence_id|   word|
## +     -+   -+
## |          1|    the|
## |          1|    dog|
## |          1|    ran|
## |          1|    the|
## |          1|fastest|
## |          2|    the|
## |          2|    cat|
## |          2|    sat|
## |          2|   down|
## +     -+   -+

注意事项：

根据数据，explode可能相当昂贵，因为它复制其他列。在应用explode之前，请确保应用所有的过滤器，例如使用StopWordsRemover

相关问题更多 >

编程相关推荐

热门问题

热门文章

将句子的数据帧“规范化”为更大的单词数据帧

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >