从Python输出中去除Unicode格式的日文字符串

0 投票
1 回答
2082 浏览
提问于 2025-04-18 05:10

我有一个脚本,它从网上收集一些文本内容。这些内容是机器翻译的,结果里面混杂着原始语言和英语。我想把所有非拉丁字符去掉,但一直找不到合适的方法来实现。举个例子,我想去掉这个:\u30e6\u30fc\u30ba\u30c9,但保留其他所有内容。>> 我想去掉这个,但保留其他所有内容。

下面是我目前的代码,用来展示这个问题

import requests
from lxml import html
from pprint import pprint
import os
import re
import logging

header = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36', 'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language' : 'en-US,en;q=0.8', 'Cookie' : 'search_layout=grid; search.ab=test-A' }
# necesary to perform the http get request

def main():
    # get page content
    response = requests.get('http://global.rakuten.com/en/store/wanboo/item/w690-3/', headers=header)
    # return parsed body for the lxml module to process
    parsed_body = html.fromstring(response.text)
    # get the title tag
    dirtyname = unicode(parsed_body.xpath("//h1[contains(@class, 'b-ttl-main')]/text()"))
    # test that this tag returns undesired unicode output for the japanese characters
    print dirtyname
    # attempt to clean the unicode using a custom filter to remove any characters in this paticular range
    clean_name = ''.join(filter(lambda character:ord(character) < 0x3000, unicode(dirtyname)))
    # output of the filter should return no unicode characters but currently does not
    print clean_name
    # the remainder of the script is uncessary for the problem in question so I have removed it

if __name__ == '__main__':
    main()

1 个回答

1
''.join(filter(lambda character:ord(character) < 0x3000,my_unicode_string))
 ''.join(filter(lambda character:ord(character) < 0xff,my_unicode_string))
>>> test_text = u'\u30e62\u30fcX\u30ba\u30c9T'
>>> ''.join(filter(lambda character:ord(character) < 0x3000,test_text))
u'2XT'
dirtyname = parsed_body.xpath() ... #this returns a list ... not a string so we will put out own list as a stand in to demonstrate the issue


dirtyname = [u"hello\u2345world"]
dirtyname = unicode(dirtyname)
>>> print repr(dirtyname)
u'[u"Hello\\u2345world"]' 
>>> for item in dirtyname:
...    print item
[
u
"
H
#and so on 
>>> dirtyname = parsed_body.xpath("//h1[contains(@class, 'b-ttl-main')]/text()")[0]
>>> #notice that we got the unicode element that is in the array
>>> print repr(dirtyname)
u"Hello\u2345world"
>>> cleanname =  ''.join(filter(lambda character:ord(character) < 0x3000, dirtyname))
>>> print repr(clean_name)
u"Helloworld" 
>>> #notice that everything is correctly filtered 

我觉得这样做应该可以...

或者你可能想限制字节大小的字符

基本上,过滤掉你想要的范围其实很简单...(实际上,过滤掉 codepoint < 0x100 可能是安全的)

举个例子

关于你在问题中提到的情况

你当时是在对那个列表调用unicode

现在如果你按照我在评论中建议的那样打印出表示形式,你会看到

注意现在它只是一个字符串... 它不是一个列表,并且字符串中没有unicode字符,因为反斜杠被转义了

你可以很容易地解决这个问题,只需获取数组中的元素,而不是整个数组... parsed_body.xpath(...)[0]

撰写回答