如何处理带重音的字母、德文字母和其他字符?
我的Python脚本现在可以运行了,但我遇到了一点小麻烦:
这是输出结果:
from BeautifulSoup import BeautifulSoup
import urllib
langCode={
"arabic":"ar", "bulgarian":"bg", "chinese":"zh-CN",
"croatian":"hr", "czech":"cs", "danish":"da", "dutch":"nl",
"english":"en", "finnish":"fi", "french":"fr", "german":"de",
"greek":"el", "hindi":"hi", "italian":"it", "japanese":"ja",
"korean":"ko", "norwegian":"no", "polish":"pl", "portugese":"pt",
"romanian":"ro", "russian":"ru", "spanish":"es", "swedish":"sv" }
def setUserAgent(userAgent):
urllib.FancyURLopener.version = userAgent
pass
def translate(text, fromLang, toLang):
setUserAgent("Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008070400 SUSE/3.0.1-0.1 Firefox/3.0.1")
try:
postParameters = urllib.urlencode({"langpair":"%s|%s" %(langCode[fromLang.lower()],langCode[toLang.lower()]), "text":text,"ie":"UTF8", "oe":"UTF8"})
except KeyError, error:
print "Currently we do not support %s" %(error.args[0])
return
page = urllib.urlopen("http://translate.google.com/translate_t", postParameters)
content = page.read()
page.close()
htmlSource = BeautifulSoup(content)
translation = htmlSource.find('span', title=text )
return translation.renderContents()
print translate("Good morning to you friend!", "English", "German")
print translate("Good morning to you friend!", "English", "Italian")
print translate("Good morning to you friend!", "English", "Spanish")
Guten Morgen, du Freund!
Buongiorno a te amico!
Buenos dÃas a ti amigo!
我该如何处理那些不是基本英语字母的字母呢?你有什么建议可以帮我解决这个问题吗?我在想用一个字典来把某些字符替换成其他字符,但我相信Python应该已经有类似的功能了。自带工具什么的。:P
谢谢。
2 个回答
0
从 urlopen()
返回的头信息中提取正确的字符集,然后把这个字符集作为 fromEncoding
参数传给 BeautifulSoup
的构造函数。
1
不要去解析 http://translate.google.com/translate_t
这个链接,因为谷歌已经提供了一个AJAX服务来处理这个问题。通过 ajax.googleapis.com
返回的 json
数据中的 translatedText
已经是一个unicode字符串了。
import urllib2
import urllib
import sys
import json
LANG={
"arabic":"ar", "bulgarian":"bg", "chinese":"zh-CN",
"croatian":"hr", "czech":"cs", "danish":"da", "dutch":"nl",
"english":"en", "finnish":"fi", "french":"fr", "german":"de",
"greek":"el", "hindi":"hi", "italian":"it", "japanese":"ja",
"korean":"ko", "norwegian":"no", "polish":"pl", "portugese":"pt",
"romanian":"ro", "russian":"ru", "spanish":"es", "swedish":"sv" }
def translate(text,lang1,lang2):
base_url='http://ajax.googleapis.com/ajax/services/language/translate?'
langpair='%s|%s'%(LANG.get(lang1.lower(),lang1),
LANG.get(lang2.lower(),lang2))
params=urllib.urlencode( (('v',1.0),
('q',text.encode('utf-8')),
('langpair',langpair),) )
url=base_url+params
content=urllib2.urlopen(url).read()
try: trans_dict=json.loads(content)
except AttributeError:
try: trans_dict=json.load(content)
except AttributeError: trans_dict=json.read(content)
return trans_dict['responseData']['translatedText']
print translate("Good morning to you friend!", "English", "German")
print translate("Good morning to you friend!", "English", "Italian")
print translate("Good morning to you friend!", "English", "Spanish")
会得到
Guten Morgen, du Freund!
Buongiorno a te amico!
Buenos días a ti amigo!