使用多行读取表列

2024-06-07 05:34:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在处理一个文本文件(ClassTest.txt文件)还有熊猫。文本文件有3个以制表符分隔的列:Title、Description和Category-Title和Description是普通字符串,Category是(非零)整数。你知道吗

我收集的数据如下:

data = pd.read_table('ClassTest.txt')

feature_names = ['Title', 'Description']
X = data[feature_names]
y = data['Category']

但是,由于“说明”列中的值本身可以包含新行,“y”数据框包含的行太多,因为“说明”列中的大多数项都有多行。我试图通过将文件中的换行符设为“|”(通过重新填充它)并使用:

data = pd.read_table('ClassTest.txt', lineterminator='|')
X = data[feature_names]
y = data['Category']

这一次,我得到了一个错误:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 20, saw 5

有人能帮我解决这个问题吗?你知道吗

编辑:添加上一个代码

con = lite.connect('JobDetails.db')
cur = con.cursor()

cur.execute('''SELECT Title, Description, Category FROM ReviewJobs''')

results = [list(each) for each in cur.fetchall()]

cur.execute('''SELECT Title, Description, Category FROM Jobs''')

for each in cur.fetchall():
    results.append(list(each))

a = open('ClassTest.txt', 'ab')

newLine = "|"

a.write(u''.join(c for c in 'Title\tDescription\tCategory' + newLine).encode('utf-8'))

for r in results:
    toWrite = "".encode('utf-8')
    title = u''.join(c for c in r[0].replace("\n", " ")).encode('utf-8') + "\t".encode('utf-8')
    description = u''.join(c for c in r[1]).encode('utf-8') + "\t".encode('utf-8')
    toWrite += title + description
    toWrite += str(r[2]).encode('utf-8') + newLine.encode('utf-8')
    a.write(toWrite)

a.close()

Tags: intxtfordatanamestitledescriptionfeature
1条回答
网友
1楼 · 发布于 2024-06-07 05:34:02

pandas.read_table()已弃用–请改用read_csv()。然后真正使用CSV格式,而不是编写大量代码来编写类似的内容,这些内容无法处理字段中的记录或字段分隔符。Python标准库中有csv模块。你知道吗

将文件作为文本文件打开并将编码传递给open()可以避免在不同的位置对所有内容进行编码。你知道吗

#!/usr/bin/env python3
from contextlib import closing
import csv
import sqlite3


def main():
    with sqlite3.connect("JobDetails.db") as connection:
        with closing(connection.cursor()) as cursor:
            #
            # TODO Having two tables with the same columns for essentially
            #   the same kind of records smells like a broken DB design.
            #
            rows = list()
            for table_name in ["reviewjobs", "jobs"]:
                cursor.execute(
                    f"SELECT title, description, category FROM {table_name}"
                )
            rows.extend(cursor.fetchall())

    with open("ClassTest.txt", "a", encoding="utf8") as csv_file:
        writer = csv.writer(csv_file, delimiter="\t")
        writer.write(["Title", "Description", "Category"])
        for title, description, category in rows:
            writer.writerows([title.replace("\n", " "), description, category])


if __name__ == "__main__":
    main()

在另一个程序中:

data = pd.read_csv("ClassTest.txt", delimiter="\t")

相关问题 更多 >