从字符串中提取一定范围内的unicode字符

def remove_junk(word): mylist=list() for i in word: if b'9' in (i.encode('ascii', 'backslashreplace')): mylist.append(i) return (''.join(mylist)) with open('sample2a.txt', 'w') as nf: with open('sample.txt') as f: for i in f: nf.write(remove_junk(i) + '\n')

2条回答

网友

1楼 · 编辑于 2024-04-19 14:50:24

我不知道Python，但我想可以像JavaScript一样在正则表达式中使用Unicode属性，因此可以通过使用天成文书脚本属性以某种方式调整以下脚本：

var text =
`‘भूमी
‘भूमी’
‘भूमी’ला
‘भैय्यासाहेब
‘भैरवनाथ
‘भैरवी
‘भैरव’
ﻇﻬﻴﺮ
（ページを閲覧しているビジターの使用言語）。
（缺少文字）
गावापासून
�गा`;
console.log (text.replace (/[^\r\n\p{Script=Devanagari}]/gu, ""));

由此产生：

भूमी
भूमी
भूमीला
भैय्यासाहेब
भैरवनाथ
भैरवी
भैरव



गावापासून
गा

网友

2楼 · 编辑于 2024-04-19 14:50:24

可以使用regex删除unicode范围U+0900-U+097F之外的所有字符。你知道吗

import re

p = re.compile(r'[^\u0900-\u097F\n]')   # preserve the trailing newline
with open('sample.txt') as f, open('sample2a.txt', 'w') as nf:
    for line in f:
        cleaned = p.sub('', line)
        if cleaned.strip():
            nf.write(cleaned)

最小代码示例

import re

text = '''
‘भूमी
‘भूमी’
‘भूमी’ला
‘भैय्यासाहेब
‘भैरवनाथ
‘भैरवी
‘भैरव’
ﻇﻬﻴﺮ
（ページを閲覧しているビジターの使用言語）。
（缺少文字）
गावापासून
गा
'''

p = re.compile(r'[^\u0900-\u097F\n]')
for line in text.splitlines():
    cleaned = p.sub('', line)
    if cleaned.strip():
        print(cleaned)

# भूमी
# भूमी
# भूमीला
# भैय्यासाहेब
# भैरवनाथ
# भैरवी
# भैरव
# गावापासून 
# गा

相关问题更多 >

编程相关推荐

热门问题

热门文章

从字符串中提取一定范围内的unicode字符

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >