python regex unicode从utf8 fi中提取数据

賓宾 [bin1] /visitor/guest/object (in grammar)/ 賓主宾主 [bin1 zhu3] /host and guest/ 賓利宾利 [Bin1 li4] /Bentley/ 賓士宾士 [Bin1 shi4] /Taiwan equivalent of 奔馳|奔驰[Ben1 chi2]/ 賓夕法尼亞宾夕法尼亚 [Bin1 xi1 fa3 ni2 ya4] /Pennsylvania/ 賓夕法尼亞大學宾夕法尼亚大学 [Bin1 xi1 fa3 ni2 ya4 Da4 xue2] /University of Pennsylvania/ 賓夕法尼亞州宾夕法尼亚州 [Bin1 xi1 fa3 ni2 ya4 zhou1] /Pennsylvania/

3条回答

网友

1楼 · 编辑于 2024-06-11 06:22:27

这将构建一个字典来查找简化或繁体字符的翻译，并且在Python 2.7和3.3中都可以使用：

# coding: utf8
import re
import codecs

# Process the whole file decoding from UTF-8 to Unicode
with codecs.open('cedict_ts.u8',encoding='utf8') as datafile:
    D = {}
    for line in datafile:
        # Skip comment lines
        if line.startswith('#'):
            continue
        trad,simp,pinyin,trans = re.match(r'(.*?) (.*?) \[(.*?)\] /(.*)/',line).groups()
        D[trad] = (simp,pinyin,trans)
        D[simp] = (trad,pinyin,trans)

输出（Python 3.3）：

^{pr2}$

输出（Python2.7，必须打印字符串才能看到非ASCII字符）：

>>> D[u'马克']
(u'\u99ac\u514b', u'Ma3 ke4', u'Mark (name)')
>>> print D[u'马克'][0]
馬克

网友

2楼 · 编辑于 2024-06-11 06:22:27

我以前也做过同样的事。基本上你只需要在分组中使用regex。不幸的是，我对python regex不是很了解（我使用C#做了同样的事情），但是您应该做这样的事情：

匹配器="(\b\w+\b) (\b\w+\b) \[(\.*?)\] /(.*?)/"

基本上，您使用一个表达式匹配整行，但随后使用( )将每个项分隔为一个regex组。那你只需要读一下小组，瞧！在

网友

3楼 · 编辑于 2024-06-11 06:22:27

在给定maximum split number的情况下，我将继续使用拆分而不是正则表达式。这取决于输入文件格式的一致性。在

elements = translation.split(' ',2)
traditionnal = elements[0]
simplified = elements[1]
rest = elements[2]
print "Traditionnal:" + traditionnal
print "Simplified:" + simplified
elems = rest.split(']')
tr = elems[0].strip('[')
print "Pronouciation:" + tr

输出：

^{pr2}$

编辑：要将最后一个字段拆分为列表，请在/上拆分：

translations = elems[1].strip().strip('/').split('/')
#strip the spaces, then the first and last slash, 
#then split on the slashes

输出（对于输入的第一行）：

['visitor', 'guest', 'object (in grammar)']

相关问题更多 >

编程相关推荐

热门问题

热门文章