python:删除Unicode字符
import sys
import nltk
import unicodedata
import pymongo
conn = pymongo.Connection('mongodb://localhost:27017')
def jd_extract():
try:
iter = collection.find({},limit=1)
for item in iter:
return (item['jd'])
res=jd_extract()
print res
打印
[u'Software Engineer II', , u' ', , u' ', , u' ', Skills: C#,WPF,SQL , u' ', , u' ', Experience: 3-4.5 Yrs , u' ', , u' ', Job Location:- Gurgaon/Noida , u' ', , u' ', Job Summary: , u' ', The Software Engineer II's role is to develop and manage the application code for a system or part of a project. The Software Engineer II role typically has skills to work with multiple platforms and/or services. , u' ', , u' ', , u' \xa0', , u' ', , u' ', ][u' ', Salary: , u'\n', Not Disclosed by Recruiter , u'\n', , u'\n'][u' ', Industry: , u'\n', IT-Software / Software Services , u'\n', , u'\n'][u' ', Functional Area: , u'\n', IT Software - Application Programming, Maintenance , u'\n', , u'\n'][u' ', Role Category: , u'\n', Programming & Design , u'\n', , u'\n'][u' ', Role: , u'\n', Software Developer , u'\n', , u'\n'][u' ', Keyskills: , u'\n', wpf C# Sql Programming , u'\n', , u'\n'][u' ', Education: , u'\n',
UG - Any Graduate - Any Specialization, Graduation Not Required
PG - Any Postgraduate - Any Specialization, Post Graduation Not Required
Doctorate - Any Doctorate - Any Specialization, Doctorate Not Required , u'\n', , u'\n']
我想从res中去掉unicode字符。我试过用str(res),但没成功。
3 个回答
0
字符串、Unicode和整数类型的列表
>>> item_list = [ 'a', 3, u'b', 5, u'c', 8, 'd', 13, 'e' ]
>>> print item_list
['a', 3, u'b', 5, u'c', 8, 'd', 13, 'e']
将Unicode类型转换为字符串类型
>>> item_list = [ str(item) if isinstance(item, unicode) else item for item in item_list ]
>>> print item_list
['a', 3, 'b', 5, 'c', 8, 'd', 13, 'e']
将字符串类型转换为Unicode类型
>>> item_list = [ unicode(item) if isinstance(item, str) else item for item in item_list ]
>>> print item_list
[u'a', 3, u'b', 5, u'c', 8, u'd', 13, u'e']
字符串(str)和Unicode都是基础字符串(basestring)的子类
0
根据我的理解,你想在打印 res
(一个包含Unicode字符串的列表)时去掉 u''
。你可以单独打印每个字符串:
for unicode_string in res:
print unicode_string
你看到 u''
的原因是因为 print some_list
在列表中的每个项目上调用了 repr(item)
,而 u'..'
是Python中的Unicode字符串字面量:
>>> print [u'a']
[u'a']
>>> print repr(u'a')
u'a'
>>> print u'a'
a
0
试着把Unicode字符串编码成'utf-8'格式
res =[s.encode('utf-8') for s in res]
或者如果你更喜欢用for循环的话
ascii_strings = []
for s in res:
ascii_strings.append(s.encode('utf-8'))