从嵌套JSON生成热图

2024-04-24 14:17:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试从给定here的SQUAD v1.1数据集生成热图。你知道吗


小队数据集描述如下:

Document/
├── Paragraph1/
│   ├── Question
│   ├── Answer1
│   ├── Answer2
│   └── Answer3
├── Paragraph2/
│   ├── Question
│   └── Answer1

一个文档可以有多个段落/上下文。每个段落(上下文)可能有多个问题和答案。它描绘了here。你知道吗

我计划将JSON反规范化为CSV,这可能是错误的:

Context,Question,Answer
Context1,Question1,Answer1
Context1,Question1,Answer2
Context1,Question2,Answer1
...

到目前为止,我使用以下代码将嵌套的JSON规范化为CSV文件:

import json
import csv

with open(r'SQUAD v1.json') as squad_data_file_handle:
    squad_data = json.load(squad_data_file_handle)

with open('SQUAD_11_CSV.csv', 'w', newline='', encoding='UTF-8') as squad_csv_handle:
    writer = csv.writer(squad_csv_handle, dialect='excel', delimiter=',')
    writer.writerow(["Context", "Question", "Answer"])
    for data in squad_data["data"]:
        for paragraph in data["paragraphs"]:
            context = str(paragraph["context"])
            question_answer_pairs = paragraph.get("qas", [])

            for qa_pair in question_answer_pairs:
                    question = str(qa_pair["question"])
                    answers = list(set([str(answer.get("text")) for answer in qa_pair.get("answers", [])]))
                    for answer in answers:
                        writer.writerow([context, question, answer])

因此,它生成如下CSV(前两行):

Context,Question,Answer
"Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the ""golden anniversary"" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as ""Super Bowl L""), so that the logo could prominently feature the Arabic numerals 50.",Which NFL team represented the AFC at Super Bowl 50?,Denver Broncos
"Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the ""golden anniversary"" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as ""Super Bowl L""), so that the logo could prominently feature the Arabic numerals 50.",Which NFL team represented the NFC at Super Bowl 50?,Carolina Panthers

这是我用来生成热图的代码:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv(r"SQUAD_11_CSV.csv")
df = df.pivot("Context", "Question", "Answer")
sns.heatmap(df)
plt.show()

所以,当我试图生成热图时,它抛出了一个异常:

ValueError: Index contains duplicate entries, cannot reshape

因此,任何关于如何生成热图和将团队JSON数据建模为完美CSV的明显错误的提示/指针都将受到赞赏。你知道吗


更新:

热图必须如下所示:

enter image description here


Tags: csvtheansweringamefordataas