从cs中读取俄语数据

2-комнатная квартира РДТ', мкр Тастак-3, Аносова — Толе би;Алматы 2-комнатная квартира БГР', мкр Таугуль, Дулати (Навои) — Токтабаева;Алматы 2-комнатная квартира ЦФМ', мкр Тастак-2, Тлендиева — Райымбека;Алматы

def loadCsv(filename): lines = csv.reader(open(filename, "rb"),delimiter=";" ) dataset = list(lines) for i in range(len(dataset)): dataset[i] = [str(x) for x in dataset[i]] return dataset

[['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-3, \xc0\xed\xee\xf1\xee\xe2\xe0 \x97 \xd2\xee\xeb\xe5 \xe1\xe8', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf3\xe3\xf3\xeb\xfc, \xc4\xf3\xeb\xe0\xf2\xe8 (\xcd\xe0\xe2\xee\xe8) \x97 \xd2\xee\xea\xf2\xe0\xe1\xe0\xe5\xe2\xe0', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-2, \xd2\xeb\xe5\xed\xe4\xe8\xe5\xe2\xe0 \x97 \xd0\xe0\xe9\xfb\xec\xe1\xe5\xea\xe0', '\xc0\xeb\xec\xe0\xf2\xfb']]

3条回答

网友

1楼 · 编辑于 2024-05-16 03:03:02

\ea是用于к的windows-1251/cp5347编码。因此，您需要使用windows-1251解码，而不是UTF-8。

在Python2.7中，CSV库不正确地支持Unicode—请参见https://docs.python.org/2/library/csv.html中的“Unicode”

他们提出了一个简单的解决方案，使用：

class UnicodeReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)

    def next(self):
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]

    def __iter__(self):
        return self

这将允许您：

def loadCsv(filename):
    lines = UnicodeReader(open(filename, "rb"), delimiter=";", encoding="windows-1251" )
    # if you really need lists then uncomment the next line
    # this will let you do call exact lines by doing `line_12 = lines[12]`
    # return list(lines)

    # this will return an "iterator", so that the file is read on each call
    # use this if you'll do a `for x in x`
    return lines

如果您尝试打印dataset，那么您将获得列表中列表的表示，其中第一个列表是行，第二个列表是列。任何编码的字节或文本都将用\x或\u表示。要打印值，请执行以下操作：

for csv_line in loadCsv("myfile.csv"):
    print u", ".join(csv_line)

如果需要将结果写入另一个文件（相当典型），可以执行以下操作：

with io.open("my_output.txt", "w", encoding="utf-8") as my_ouput:
    for csv_line in loadCsv("myfile.csv"):
        my_output.write(u", ".join(csv_line))

这将自动转换/编码您的输出到UTF-8。

网友

2楼 · 编辑于 2024-05-16 03:03:02

你的.csv可以是另一种编码，而不是UTF-8吗？（考虑到错误消息，甚至应该be）。尝试其他西里尔文编码，如Windows-1251、CP866或KOI8。

网友

3楼 · 编辑于 2024-05-16 03:03:02

你不能尝试：

import pandas as pd 
pd.read_csv(path_file , "cp1251")

或者

import csv
with open(path_file,  encoding="cp1251", errors='ignore') as source_file:
        reader = csv.reader(source_file, delimiter=",")

相关问题更多 >

编程相关推荐

热门问题

热门文章