如何修复“UnicodeDecodeError:”charmap“编解码器无法解码位置29815中的字节0x9d:字符映射到<undefined>”?

2024-03-29 13:07:02 发布

您现在位置:Python中文网/ 问答频道 /正文

目前,我正试图让一个Python 3程序通过Spyder IDE/GUI对一个充满信息的文本文件进行一些操作。但是,在尝试读取文件时,出现以下错误:

  File "<ipython-input-13-d81e1333b8cd>", line 77, in <module>
    parser(f)

  File "<ipython-input-13-d81e1333b8cd>", line 18, in parser
    data = infile.read()

  File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 29815: character maps to <undefined>

程序代码如下:

import os

os.getcwd()

import glob
import re
import sqlite3
import csv

def parser(file):

    # Open a TXT file. Store all articles in a list. Each article is an item
    # of the list. Split articles based on the location of such string as
    # 'Document PRN0000020080617e46h00461'

    articles = []
    with open(file, 'r') as infile:
        data = infile.read()
    start = re.search(r'\n HD\n', data).start()
    for m in re.finditer(r'Document [a-zA-Z0-9]{25}\n', data):
        end = m.end()
        a = data[start:end].strip()
        a = '\n   ' + a
        articles.append(a)
        start = end

    # In each article, find all used Intelligence Indexing field codes. Extract
    # content of each used field code, and write to a CSV file.

    # All field codes (order matters)
    fields = ['HD', 'CR', 'WC', 'PD', 'ET', 'SN', 'SC', 'ED', 'PG', 'LA', 'CY', 'LP',
              'TD', 'CT', 'RF', 'CO', 'IN', 'NS', 'RE', 'IPC', 'IPD', 'PUB', 'AN']

    for a in articles:
        used = [f for f in fields if re.search(r'\n   ' + f + r'\n', a)]
        unused = [[i, f] for i, f in enumerate(fields) if not re.search(r'\n   ' + f + r'\n', a)]
        fields_pos = []
        for f in used:
            f_m = re.search(r'\n   ' + f + r'\n', a)
            f_pos = [f, f_m.start(), f_m.end()]
            fields_pos.append(f_pos)
        obs = []
        n = len(used)
        for i in range(0, n):
            used_f = fields_pos[i][0]
            start = fields_pos[i][2]
            if i < n - 1:
                end = fields_pos[i + 1][1]
            else:
                end = len(a)
            content = a[start:end].strip()
            obs.append(content)
        for f in unused:
            obs.insert(f[0], '')
        obs.insert(0, file.split('/')[-1].split('.')[0])  # insert Company ID, e.g., GVKEY
        # print(obs)
        cur.execute('''INSERT INTO articles
                       (id, hd, cr, wc, pd, et, sn, sc, ed, pg, la, cy, lp, td, ct, rf,
                       co, ina, ns, re, ipc, ipd, pub, an)
                       VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
                       ?, ?, ?, ?, ?, ?, ?, ?)''', obs)

# Write to SQLITE
conn = sqlite3.connect('factiva.db')
with conn:
    cur = conn.cursor()
    cur.execute('DROP TABLE IF EXISTS articles')
    # Mirror all field codes except changing 'IN' to 'INC' because it is an invalid name
    cur.execute('''CREATE TABLE articles
                   (nid integer primary key, id text, hd text, cr text, wc text, pd text,
                   et text, sn text, sc text, ed text, pg text, la text, cy text, lp text,
                   td text, ct text, rf text, co text, ina text, ns text, re text, ipc text,
                   ipd text, pub text, an text)''')
    for f in glob.glob('*.txt'):
        print(f)
        parser(f)

# Write to CSV to feed Stata
with open('factiva.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    with conn:
        cur = conn.cursor()
        cur.execute('SELECT * FROM articles WHERE hd IS NOT NULL')
        colname = [desc[0] for desc in cur.description]
        writer.writerow(colname)
        for obs in cur.fetchall():
            writer.writerow(obs)

Tags: totextinposimportrefieldsfor
3条回答

我通过将文件的格式从.csv更改为.xlsx来解决了这个问题。

https://en.wikipedia.org/wiki/Windows-1252中可以看到,代码0x9D未在CP1252中定义。

“错误”例如在open函数中:您没有指定编码,因此python(仅在windows中)将使用一些系统编码。一般来说,如果您读取的文件可能不是在同一台计算机中创建的,那么最好指定编码。

我建议在您的open上也添加一个用于编写csv的代码。说得明白点确实更好。

我不知道原始文件格式,但是添加到open, encoding='utf-8'通常是一件好事(在Linux和MacOs中是默认的)。

以上对我不起作用,请改为尝试:, errors='ignore' 创造奇迹!

相关问题 更多 >