将pandas数据帧转换为spark DataFram时出错

2024-04-27 00:29:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我在一些堆溢出的帖子中创建了一个熊猫数据帧。使用lxml.eTree要分隔代码块和文本块。下面的代码显示了基本大纲:

import lxml.etree

a1 = tokensentRDD.map(lambda (a,b): (a,''.join(map(str,b))))
a2 = a1.map(lambda (a,b): (a, b.replace("&lt;", "<")))
a3 = a2.map(lambda (a,b): (a, b.replace("&gt;", ">")))

def parsefunc (x):
    html = lxml.etree.HTML(x)
    code_block = html.xpath('//code/text()')
    text_block = html.xpath('// /text()') 

    a4 =  code_block
    a5 =  len(code_block)
    a6 =  text_block
    a7 =  len(text_block)
    a8 = ''.join(map(str,text_block)).split(' ')
    a9 =  len(a8)
    a10 = nltk.word_tokenize(''.join(map(str,text_block)))

    numOfI = 0
    numOfQue = 0
    numOfExclam = 0

    for x in a10:
        if x == 'I':
            numOfI +=1
        elif x == '?':
            numOfQue +=1
        elif x == '!':
            numOfExclam
    return (a4,a5,a6,a7,a9,numOfI,numOfQue, numOfExclam)

a11 = a3.take(6)
a12 = map(lambda (a,b): (a, parsefunc(b)), a11)

columns = ['code_block', 'len_code', 'text_block', 'len_text', 'words@text_block', 'numOfI', 'numOfQ', 'numOfExclam']
index = map(lambda x:x[0], a12)
data = map(lambda x:x[1], a12)

df = pd.DataFrame(data = data, columns = columns, index = index)
df.index.name = 'Id'
df

    code_block  len_code    text_block  len_text    words@text_block    numOfI  numOfQ  numOfExclam
Id                              
4   [decimal    3   [I want to use a track-bar to change a form's ...   18  72  5   1   0
6   [div, ]     5   [I have an absolutely positioned , div, conta...    22  96  4   4   0
9   [DateTime]  1   [Given a , DateTime, representing a person's ...    4   21  2   2   0
11  [DateTime]  1   [Given a specific , DateTime, value, how do I...    12  24  2   1   0

我需要创建一个Spark数据帧,以便在输出上应用机器学习算法。我试过:

^{pr2}$

我收到的错误是:

^{pr3}$

有人能告诉我一种将熊猫数据帧转换成Spark数据帧的正确方法吗?


Tags: 数据lambdatextmapdatetimeindexlenhtml
1条回答
网友
1楼 · 发布于 2024-04-27 00:29:27

你的问题与熊猫无关。code_blocka4)和text_blocka6)都包含不能使用SparkSQL类型编码的lxml特定对象。把它们转换成字符串就足够了。在

a4 = [str(x) for x in code_block]
a6 = [str(x) for x in text_block]

相关问题 更多 >