如何在python中展开RDD?

2024-05-16 15:43:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个垃圾短信的数据集,它的数据类型是:

pyspark.rdd.PipelinedRDD

{cd2>当}得到时:

[["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"], ['WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.'], ['Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030']]

如您所见,它内有括号以分隔列表中的每个元素。我怎样才能去掉那些括号?我试过很多方法把它弄平,但似乎都不管用。在


Tags: to数据infreeupdate短信垃圾pyspark
3条回答

你可以使用rdd的flatMap方法。它允许您从一行生成多行。在

spams.flatMap(lambda x:x).take(3)

由于您的问题不清楚您是希望删除列表中收集之后的括号,还是删除收集之前的括号,以及其他用户在之后已经回答了的问题,我将在数据仍然是rdd的情况下回答。很直接

spams = spams.map(lambda x:x[0])
print spams.take(3)

这将移除内部“支架”。在

这些代码行将有所帮助。在

    >>> msg = [["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 0
8452810075over18's"],
...  ['WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid
12 hours only.'],
...  ['Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on
08002986030']]
>>> msg = [x[0] for x in msg]
>>> msg
["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075o
ver18's", 'WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Va
lid 12 hours only.', 'Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Upd
ate Co FREE on 08002986030']

相关问题 更多 >