统计机器翻译的短语提取算法

8 投票

1 回答

3642 浏览

提问于 2025-04-18 15:51

我写了一个代码，用来提取短语，目的是为了做统计机器翻译。

# -*- coding: utf-8 -*-

def phrase_extraction(srctext, trgtext, alignment):
    """
    Phrase extraction algorithm. 
    """
    def extract(f_start, f_end, e_start, e_end):
        phrases = set()
        # return { } if f end == 0
        if f_end == 0:
            return
        # for all (e,f) ∈ A do
        for e,f in alignment:
            # return { } if e < e start or e > e end
            if e < e_start or e > e_end:        
                return

        fs = f_start
        # repeat-
        while True:
            fe = f_end
            # repeat-
            while True:
                # add phrase pair ( e start .. e end , f s .. f e ) to set E
                trg_phrase = " ".join(trgtext[i] for i in range(fs,fe))
                src_phrase = " ".join(srctext[i] for i in range(e_start,e_end))
                phrases.add("\t".join([src_phrase, trg_phrase]))
                fe+=1 # fe++
                # -until fe aligned
                if fe in f_aligned or fe > trglen:
                    break
            fs-=1 # fe--
            # -until fs aligned
            if fs in f_aligned or fs < 0:
                break
        return phrases

    # Calculate no. of tokens in source and target texts.
    srctext = srctext.split()
    trgtext = trgtext.split()
    srclen = len(srctext)
    trglen = len(trgtext)
    # Keeps an index of which source/target words are aligned.
    e_aligned = [i for i,_ in alignment]
    f_aligned = [j for _,j in alignment] 

    bp = set() # set of phrase pairs BP
    # for e start = 1 ... length(e) do
    for e_start in range(srclen):
        # for e end = e start ... length(e) do       
        for e_end in range(e_start, srclen):
            # // find the minimally matching foreign phrase
            # (f start , f end ) = ( length(f), 0 )
            f_start, f_end = trglen, 0
            # for all (e,f) ∈ A do
            for e,f in alignment:
                # if e start ≤ e ≤ e end then
                if e_start <= e <= e_end:
                    f_start = min(f, f_start)
                    f_end = max(f, f_end)
            # add extract (f start , f end , e start , e end ) to set BP
            phrases = extract(f_start, f_end, e_start, e_end)
            if phrases:
                bp.update(phrases)
    return bp

srctext = "michael assumes that he will stay in the house"
trgtext = "michael geht davon aus , dass er im haus bleibt"
alignment = [(0,0), (1,1), (1,2), (1,3), (2,5), (3,6), (4,9), (5,9), (6,7), (7,7), (8,8)]

phrases = phrase_extraction(srctext, trgtext, alignment)

for i in phrases:
    print i

这个短语提取算法来自Philip Koehn的《统计机器翻译》一书，第133页，具体内容如下：

enter image description here

我希望得到的输出结果是：

enter image description here

但是用我的代码，只能得到这些输出：

michael assumes that he will stay in the - michael geht davon aus , dass er im haus

michael assumes that he will stay in the - michael geht davon aus , dass er im haus bleibt

有没有人能发现我实现中的问题？

我的代码确实提取了短语，但输出结果并不完整，跟上面的翻译表不一样：

enter image description here

代码调试算法实现自然语言处理机器翻译短语提取统计机器翻译语言对齐翻译模型

1 个回答

我看了这个问题，现在可以复现出想要的结果。结果发现有几个问题：

这个算法还不够完整。在书的在线版本（2012年第三版）中，提取函数的第4行已经更新了。（可能有勘误表）
这个算法假设索引是从1开始，一直到并且包括结束位置。
而Python是从0开始索引，一直到但不包括结束位置。
这特别影响了一些调用范围的停止值和一些比较操作。
集合中的项目已经被修改，以便更容易复现想要的输出。

示例输出（匹配19个短语，其中一些匹配重复了5次，总共提取了24个）：

$ python2.7 phrase_extract_new.py 
( 1) (0, 1) michael — michael
( 2) (0, 2) michael assumes — michael geht davon aus ; michael geht davon aus ,
( 3) (0, 3) michael assumes that — michael geht davon aus , dass
( 4) (0, 4) michael assumes that he — michael geht davon aus , dass er
( 5) (0, 9) michael assumes that he will stay in the house — michael geht davon aus , dass er im haus bleibt
( 6) (1, 2) assumes — geht davon aus ; geht davon aus ,
( 7) (1, 3) assumes that — geht davon aus , dass
( 8) (1, 4) assumes that he — geht davon aus , dass er
( 9) (1, 9) assumes that he will stay in the house — geht davon aus , dass er im haus bleibt
(10) (2, 3) that — dass ; , dass
(11) (2, 4) that he — dass er ; , dass er
(12) (2, 9) that he will stay in the house — dass er im haus bleibt ; , dass er im haus bleibt
(13) (3, 4) he — er
(14) (3, 9) he will stay in the house — er im haus bleibt
(15) (4, 6) will stay — bleibt
(16) (4, 9) will stay in the house — im haus bleibt
(17) (6, 8) in the — im
(18) (6, 9) in the house — im haus
(19) (8, 9) house — haus
$ python2.7 phrase_extract_new.py | grep -c ';'
5

下面是对这个算法的建议解释。这个算法需要进行相当多的重构，但在此之前，需要用不同的例子进行更多测试。复现书中的例子只是个开始：

# -*- coding: utf-8 -*-

def phrase_extraction(srctext, trgtext, alignment):
    """
    Phrase extraction algorithm.
    """
    def extract(f_start, f_end, e_start, e_end):
        if f_end < 0:  # 0-based indexing.
            return {}
        # Check if alignement points are consistent.
        for e,f in alignment:
            if ((f_start <= f <= f_end) and
               (e < e_start or e > e_end)):
                return {}

        # Add phrase pairs (incl. additional unaligned f)
        # Remark:  how to interpret "additional unaligned f"?
        phrases = set()
        fs = f_start
        # repeat-
        while True:
            fe = f_end
            # repeat-
            while True:
                # add phrase pair ([e_start, e_end], [fs, fe]) to set E
                # Need to +1 in range  to include the end-point.
                src_phrase = " ".join(srctext[i] for i in range(e_start,e_end+1))
                trg_phrase = " ".join(trgtext[i] for i in range(fs,fe+1))
                # Include more data for later ordering.
                phrases.add(((e_start, e_end+1), src_phrase, trg_phrase))
                fe += 1 # fe++
                # -until fe aligned or out-of-bounds
                if fe in f_aligned or fe == trglen:
                    break
            fs -=1  # fe--
            # -until fs aligned or out-of- bounds
            if fs in f_aligned or fs < 0:
                break
        return phrases

    # Calculate no. of tokens in source and target texts.
    srctext = srctext.split()   # e
    trgtext = trgtext.split()   # f
    srclen = len(srctext)       # len(e)
    trglen = len(trgtext)       # len(f)
    # Keeps an index of which source/target words are aligned.
    e_aligned = [i for i,_ in alignment]
    f_aligned = [j for _,j in alignment]

    bp = set() # set of phrase pairs BP
    # for e start = 1 ... length(e) do
    # Index e_start from 0 to len(e) - 1
    for e_start in range(srclen):
        # for e end = e start ... length(e) do
        # Index e_end from e_start to len(e) - 1
        for e_end in range(e_start, srclen):
            # // find the minimally matching foreign phrase
            # (f start , f end ) = ( length(f), 0 )
            # f_start ∈ [0, len(f) - 1]; f_end ∈ [0, len(f) - 1]
            f_start, f_end = trglen-1 , -1  #  0-based indexing
            # for all (e,f) ∈ A do
            for e,f in alignment:
                # if e start ≤ e ≤ e end then
                if e_start <= e <= e_end:
                    f_start = min(f, f_start)
                    f_end = max(f, f_end)
            # add extract (f start , f end , e start , e end ) to set BP
            phrases = extract(f_start, f_end, e_start, e_end)
            if phrases:
                bp.update(phrases)
    return bp

# Refer to match matrix.
#             0      1      2   3  4     5   6   7    8
srctext = "michael assumes that he will stay in the house"
#             0      1    2    3  4  5   6  7   8     9
trgtext = "michael geht davon aus , dass er im haus bleibt"
alignment = [(0,0), (1,1), (1,2), (1,3), (2,5), (3,6), (4,9), (5,9), (6,7), (7,7), (8,8)]

phrases = phrase_extraction(srctext, trgtext, alignment)

# Keep track of translations of each phrase in srctext and its
# alignement using a dictionary with keys as phrases and values being
# a list [e_alignement pair, [f_extractions, ...] ]
dlist = {}
for p, a, b in phrases:
    if a in dlist:
        dlist[a][1].append(b)
    else:
        dlist[a] = [p, [b]]

# Sort the list of translations based on their length.  Shorter phrases first.
for v in dlist.values():
    v[1].sort(key=lambda x: len(x))


# Function to help sort according to book example.
def ordering(p):
    k,v = p
    return v[0]
#
for i, p in enumerate(sorted(dlist.items(), key = ordering), 1):
    k, v = p
    print "({0:2}) {1} {2} — {3}".format( i, v[0], k, " ; ".join(v[1]))

最后准备输出的部分可能还可以改进……而且算法代码肯定可以写得更符合Python的风格。

回答于 2025-04-18 由 Python大师

分享举报

统计机器翻译的短语提取算法

1 个回答

撰写回答