从Python输出中去除Unicode格式的日文字符串

0 投票

1 回答

2082 浏览

提问于 2025-04-18 05:10

我有一个脚本，它从网上收集一些文本内容。这些内容是机器翻译的，结果里面混杂着原始语言和英语。我想把所有非拉丁字符去掉，但一直找不到合适的方法来实现。举个例子，我想去掉这个：\u30e6\u30fc\u30ba\u30c9，但保留其他所有内容。>> 我想去掉这个，但保留其他所有内容。

下面是我目前的代码，用来展示这个问题

import requests
from lxml import html
from pprint import pprint
import os
import re
import logging

header = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36', 'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language' : 'en-US,en;q=0.8', 'Cookie' : 'search_layout=grid; search.ab=test-A' }
# necesary to perform the http get request

def main():
    # get page content
    response = requests.get('http://global.rakuten.com/en/store/wanboo/item/w690-3/', headers=header)
    # return parsed body for the lxml module to process
    parsed_body = html.fromstring(response.text)
    # get the title tag
    dirtyname = unicode(parsed_body.xpath("//h1[contains(@class, 'b-ttl-main')]/text()"))
    # test that this tag returns undesired unicode output for the japanese characters
    print dirtyname
    # attempt to clean the unicode using a custom filter to remove any characters in this paticular range
    clean_name = ''.join(filter(lambda character:ord(character) < 0x3000, unicode(dirtyname)))
    # output of the filter should return no unicode characters but currently does not
    print clean_name
    # the remainder of the script is uncessary for the problem in question so I have removed it

if __name__ == '__main__':
    main()

文本处理 unicode 字符串操作文本清洗机器翻译

1 个回答

''.join(filter(lambda character:ord(character) < 0x3000,my_unicode_string))

 ''.join(filter(lambda character:ord(character) < 0xff,my_unicode_string))

>>> test_text = u'\u30e62\u30fcX\u30ba\u30c9T'
>>> ''.join(filter(lambda character:ord(character) < 0x3000,test_text))
u'2XT'

dirtyname = parsed_body.xpath() ... #this returns a list ... not a string so we will put out own list as a stand in to demonstrate the issue


dirtyname = [u"hello\u2345world"]

dirtyname = unicode(dirtyname)

>>> print repr(dirtyname)
u'[u"Hello\\u2345world"]' 
>>> for item in dirtyname:
...    print item
[
u
"
H
#and so on

>>> dirtyname = parsed_body.xpath("//h1[contains(@class, 'b-ttl-main')]/text()")[0]
>>> #notice that we got the unicode element that is in the array
>>> print repr(dirtyname)
u"Hello\u2345world"
>>> cleanname =  ''.join(filter(lambda character:ord(character) < 0x3000, dirtyname))
>>> print repr(clean_name)
u"Helloworld" 
>>> #notice that everything is correctly filtered

我觉得这样做应该可以...

或者你可能想限制字节大小的字符

基本上，过滤掉你想要的范围其实很简单...（实际上，过滤掉 codepoint < 0x100 可能是安全的）

举个例子

关于你在问题中提到的情况

你当时是在对那个列表调用unicode

现在如果你按照我在评论中建议的那样打印出表示形式，你会看到

注意现在它只是一个字符串... 它不是一个列表，并且字符串中没有unicode字符，因为反斜杠被转义了

你可以很容易地解决这个问题，只需获取数组中的元素，而不是整个数组... parsed_body.xpath(...)[0]

回答于 2025-04-18 由 Python大师

分享举报

从Python输出中去除Unicode格式的日文字符串

1 个回答

撰写回答