转换我的数据帧，其中每一行包含每个句子的元组列表

1 Should O 2 students O 3 be O 4 taught O 5 to O 6 compete O 7 or O 8 to O 9 cooperate O 10 ? O ------------------> THIS SHOWS, STARTING OF THE NEXT SENTENCES 1 It O 2 is O 3 always O 4 said O 5 that O 6 competition O 7 can O 8 effectively O 9 promote O 10 the O 11 development O 12 of O 13 economy O 14 . O

['1', 'Should', 'O'] ['2', 'students', 'O'] ['3', 'be', 'O'] ['4', 'taught', 'O'] ['5', 'to', 'O'] ['6', 'compete', 'O'] ['7', 'or', 'O'] ['8', 'to', 'O'] ['9', 'cooperate', 'O'] ['10', '?', 'O'] [] ['1', 'It', 'O'] ['2', 'is', 'O'] ['3', 'always', 'O'] ['4', 'said', 'O'] ['5', 'that', 'O'] ['6', 'competition', 'O'] ['7', 'can', 'O'] ['8', 'effectively', 'O'] ['9', 'promote', 'O'] ['10', 'the', 'O'] ['11', 'development', 'O'] ['12', 'of', 'O'] ['13', 'economy', 'O'] ['14', '.', 'O']

2条回答

网友

1楼 · 编辑于 2024-05-15 02:29:46

简单地说，解决方案是用所需的数据列表分隔临时列表中的每一行，然后将每个临时列表追加到MyList中，最后形成数据框，如下所示：

import pandas as pd

datContent = open("..\\data\\train.dat.abs", 'r', encoding='utf-8' )

MyList = []
tmp_list = []

for line in datContent:
    a = line.split()
    if len(a) == 0: # space between sentences
        MyList.append(tmp_list)
        tmp_list = []
        continue
    tmp_list.append((a[1], a[2]))

if len(tmp_list) > 0: # to append the last sentence if not space.
    MyList.append(tmp_list)

df = pd.DataFrame({'sentence': MyList})

print(df)

网友

2楼 · 编辑于 2024-05-15 02:29:46

试试这个：

有关更多信息，请参见regex demo

#form: abc['row1'], abc['row2'] ...
def getRowContainer(data):
    rowContainer={}
    rowData=[]
    rowCount=1
    dataSet=re.findall(r'(?:^\d{1,14}\s+([a-zA-Z0-9?!.,]{1,20})\s+([^\s]+))|^-{1,20}>',data,flags=re.MULTILINE)
    for item in (dataSet):
        if item[0]=='':
            rowCount+=1
            rowData=[]
            continue
        rowData.append(item)
        rowContainer[f'row{rowCount}']=rowData
    return rowContainer

rows=getRowContainer(data)

for x in range(1,len(rows)+1):
    print (f'row {x}')
    print (rows[f'row{x}'])

我对您的输入数据截图如下：

data='''
1   Should  O
2   students    O
3   be  O
4   taught  O
5   to  O
6   compete O
7   or  O
8   to  O
9   cooperate   O
10  ?   O

         > THIS SHOWS, STARTING OF THE NEXT SENTENCES

1   It  O
2   is  O
3   always  O
4   said    O
5   that    O
6   competition O
7   can O
8   effectively O
9   promote O
10  the O
11  development O
12  of  O
13  economy O
14  .   O'''

我得到的输出：

row 1
[('Should', 'O'), ('students', 'O'), ('be', 'O'), ('taught', 'O'), ('to', 'O'), ('compete', 'O'), ('or', 'O'), ('to', 'O'), ('cooperate', 'O'), ('?', 'O')]
row 2
[('It', 'O'), ('is', 'O'), ('always', 'O'), ('said', 'O'), ('that', 'O'), ('competition', 'O'), ('can', 'O'), ('effectively', 'O'), ('promote', 'O'), ('the', 'O'), ('development', 'O'), ('of', 'O'), ('economy', 'O'), ('.', 'O')]

相关问题更多 >

编程相关推荐

热门问题

热门文章