将对话文档重新排列到Datafram

file = open('the document','r') Name = [] sentence = [] for line in file: if line.find("Column") != -1: continue if line.find("Section") or line.find("Index") or line.find("Home Page"): continue if line.find(':') != -1: tokens = line.split(":") Name.append(tokens[0]) else: sentence.append(line + " ")

1条回答

网友

1楼 · 发布于 2024-05-13 07:29:23

在这里，我提出了一个简单的解决方案。这个简单的解决方案有三个部分

当有空行时
当该行以:结尾时
否则

代码如下：

import re
from collections import defaultdict


def clean_speaker(sp):
    sp = re.sub(r"(\(\w+\))", "", sp) #remove single words within parentheses
    sp = re.sub(r"(\d+\.?)", "", sp) #remove digits such as 1. or 2.
    return sp.strip()



document = []
with open('the document','r') as fin:
    foundSpeaker = False
    dialogue = defaultdict(str)
    for line in fin.readlines():
        line = line.strip() #remove white-spaces
        #  - when line is empty   -
        if not line:
            dialogue = defaultdict(str)
            foundSpeaker = False
        #  - When line ends with :   -
        elif line[-1] == ":":
            if dialogue:
                document.append(dialogue)
                dialogue = defaultdict(str)
            foundSpeaker = True
            dialogue["Speaker"] = clean_speaker(line[:-1])
        #  - Otherwise   -
        else:
            if foundSpeaker:
                dialogue["Sentence"] += " " + line
            else:
                if dialogue:
                    document.append(dialogue)
                    dialogue = defaultdict(str)
                    foundSpeaker = False
                continue

现在，变量document拥有给定文件中的所有对话。。。这是一个字典列表，其中每个字典只有两个键（speaker，和sentence）。因此，我们可以看到document内的内容如下：


for d in document:
    for key, value in d.items():
        print(key+":", value)

或者，您可以做一些更聪明的事情，将该列表转换为pandas.dataframe，并将该数据帧写入csv，如下所示：

import pandas as pd

pd.DataFrame.from_dict(document).to_csv('document.csv')

现在，打开document.csv，你会发现一切都井然有序。。。我希望这对你有帮助

相关问题更多 >

编程相关推荐

热门问题

热门文章