Python将utf8特殊字符(重音符号)转换为扩展的ascii等价符

2024-06-07 04:59:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我想使用Python将utf8特殊字符(重音符号等)转换为它们的扩展ascii(纯粹主义者会说没有这样的东西,所以这里有一个link到我的意思)等价物。在

所以基本上我想读入一个UTF-8文件,如果需要这些信息,写一个扩展的ascii文件(类似于拉丁语-1(我用的是windows)。我读了所有的Unicode等博客,仍然一个字也不懂),但我想尽可能多地保留这些信息。所以对于UTF-8字符,我想把它转换成扩展的ascii等价物。我不想忽略或丢失字符,也不想使用a。对于没有等效扩展ascii字符的字符,我只想使用我选择的字符,如~,尽管有些字符如ß,如果扩展ascii中不存在ß,我想转换成ss。在

Python3中有什么可以做到的吗?或者你能给我一些例子来说明我应该怎么做吗?在

有没有人知道哪个站点列出了扩展ascii字符的utf8等价物?在

根据下面的注释,我提出了这段代码,遗憾的是,由于大多数特殊字符返回为?而不是(不知道为什么):

# -*- coding: utf-8 -*-

f_in = open(r'E:/work/python/lyman.txt', 'rU', encoding='utf8')
raw = f_in.read()

f_out = open(r'E:/work/python/lyman_ascii.txt', 'w', encoding='cp1252', errors='replace')

retval = []
for char in raw:
    codepoint = ord(char)
    if codepoint < 0x80: # Basic ASCII
        retval.append(str(char))
        continue
    elif codepoint > 0xeffff:
        continue # Characters in Private Use Area and above are ignored
    # ë
    elif codepoint == 235:
        retval.append(chr(137))
        continue
    # ê
    elif codepoint == 234:
        retval.append(chr(136))
        continue
    # ’
    elif codepoint == 8217:
        retval.append(chr(39)) # 146 gives ? for some reason
        continue
    else:
        print(char)
        print(codepoint)

print(''.join(retval))
f_out.write(''.join(retval))

Tags: 文件inasciiutf8字符utfprintretval
1条回答
网友
1楼 · 发布于 2024-06-07 04:59:02

这似乎有效:

# -*- coding: utf-8 -*-
import sys

# Don't use codecs in Python 3.
f_in = open(r'af_massaged.txt', 'rU', encoding='utf8')
raw = f_in.read()

f_out = open(r'af_massaged_ascii.txt', 'w', encoding='cp1252', errors='replace')

retval = []
for char in raw:
    codepoint = ord(char)
    if codepoint < 0x80:    # Basic ASCII.
        retval.append(str(char))
        continue
    elif codepoint > 0xeffff:
        continue    # Characters in Private Use Area and above are ignored.
    elif codepoint >= 128 and codepoint <= 159:
        continue    # Ignore control characters in Latin-1.
    # Don't use unichr in Python 3, chr uses unicode. Get character codes from here: https://en.wikipedia.org/wiki/List_of_Unicode_characters#Latin-1_Supplement
    # This was written on Windows 7 32 bit
    # For 160 to 255 Latin-1 matches unicode.
    elif codepoint >= 160 and codepoint <= 255:
        retval.append(str(char))
        continue
    # –
    elif codepoint == 8211:
        retval.append(chr(45))
        continue
    # ’
    elif codepoint == 8217:
        retval.append(chr(180)) # 39
        continue
    # “
    elif codepoint == 8220:
        retval.append(chr(34))
        continue
    # ”
    elif codepoint == 8221:
        retval.append(chr(34))
        continue
    # €
    elif codepoint == 8364:
        retval.append('Euro')
        continue
    # Find missing mappings.
    else:
        print(char)
        print(codepoint)

# Uncomment for debugging.
#for i in range(128, 256):
#    retval.append(str(i) + ': ' + chr(i) + chr(13))

#print(''.join(retval))
f_out.write(''.join(retval))

相关问题 更多 >