在代码翻译方面需要一些帮助（从Python到C#）

################ Word Segmentation (p. 223) @memo def segment(text): "Return a list of words that is the best segmentation of text." if not text: return [] candidates = ([first]+segment(rem) for first,rem in splits(text)) return max(candidates, key=Pwords) def splits(text, L=20): "Return a list of all possible (first, rem) pairs, len(first)<=L." return [(text[:i+1], text[i+1:]) for i in range(min(len(text), L))] def Pwords(words): "The Naive Bayes probability of a sequence of words." return product(Pw(w) for w in words) #### Support functions (p. 224) def product(nums): "Return the product of a sequence of numbers." return reduce(operator.mul, nums, 1) class Pdist(dict): "A probability distribution estimated from counts in datafile." def __init__(self, data=[], N=None, missingfn=None): for key,count in data: self[key] = self.get(key, 0) + int(count) self.N = float(N or sum(self.itervalues())) self.missingfn = missingfn or (lambda k, N: 1./N) def __call__(self, key): if key in self: return self[key]/self.N else: return self.missingfn(key, self.N) def datafile(name, sep='\t'): "Read key,value pairs from file." for line in file(name): yield line.split(sep) def avoid_long_words(key, N): "Estimate the probability of an unknown word." return 10./(N * 10**len(key)) N = 1024908267229 ## Number of tokens Pw = Pdist(datafile('count_1w.txt'), N, avoid_long_words)

2条回答

网友

1楼 · 编辑于 2024-05-23 19:25:18

让我们先来处理第一个函数：

def segment(text): 
    "Return a list of words that is the best segmentation of text." 
    if not text: return [] 
    candidates = ([first]+segment(rem) for first,rem in splits(text)) 
    return max(candidates, key=Pwords)

它接受一个单词并返回它可能是的最可能的单词列表，因此它的签名将是static IEnumerable<string> segment(string text)。显然，如果text是一个空字符串，那么它的结果应该是一个空列表。否则，它创建一个递归列表理解，定义可能的候选单词列表，并根据其概率返回最大值。在

^{pr2}$

当然，现在我们要翻译splits函数。它的任务是返回一个单词开头和结尾的所有可能元组的列表。翻译起来相当简单：

static IEnumerable<Tuple<string, string>> splits(string text, int L = 20)
{
    return from i in Enumerable.Range(1, Math.Min(text.Length, L))
           select Tuple.Create(text.Substring(0, i), text.Substring(i));
}

接下来是Pwords，它只是对输入列表中每个单词的Pw的结果调用product函数：

static double Pwords(IEnumerable<string> words)
{
    return product(from w in words select Pw(w));
}

而且product非常简单：

static double product(IEnumerable<double> nums)
{
    return nums.Aggregate((a, b) => a * b);
}

附录：

查看完整的源代码，很明显，Norvig打算将segment函数的结果存储起来以提高速度。以下是提供这种加速的版本：

static Dictionary<string, IEnumerable<string>> segmentTable =
   new Dictionary<string, IEnumerable<string>>();

static IEnumerable<string> segment(string text)
{
    if (text == "") return new string[0]; // C# idiom for empty list of strings
    if (!segmentTable.ContainsKey(text))
    {
        var candidates = from pair in splits(text)
                         select new[] {pair.Item1}.Concat(segment(pair.Item2));
        segmentTable[text] = candidates.OrderBy(Pwords).First().ToList();
    }
    return segmentTable[text];
}

网友

2楼 · 编辑于 2024-05-23 19:25:18

我根本不懂C，但我可以解释Python代码是如何工作的。在

@memo
def segment(text):
    "Return a list of words that is the best segmentation of text."
    if not text: return []
    candidates = ([first]+segment(rem) for first,rem in splits(text))
    return max(candidates, key=Pwords)

第一条线

^{pr2}$

是一个decorator。这将导致函数（在后面的行中定义）被包装在另一个函数中。装饰器通常用于过滤输入和输出。在本例中，根据它所包装的函数的名称和角色，我认为这个函数memoizes调用segment。在

下一步：

def segment(text):
    "Return a list of words that is the best segmentation of text."
    if not text: return []

正确声明函数，给出docstring，并设置此函数递归的终止条件。在

接下来是最复杂的一行，也可能是给你带来麻烦的那一行：

    candidates = ([first]+segment(rem) for first,rem in splits(text))

外圆括号与for..in构造相结合，创建一个generator expression。这是迭代序列的有效方法，在本例中是splits(text)。生成器表达式是一种紧凑的for循环，可以产生值。在这种情况下，这些值将成为迭代candidates的元素。”Genexps“类似于list comprehensions，但是通过不保留它们产生的每个值来实现更高的内存效率。在

因此，对于splits(text)返回的迭代中的每个值，生成器表达式都会生成一个列表。在

来自splits(text)的每个值都是(first, rem)对。在

每个生成的列表都以对象first开头；这是通过将first放在列表文本中来表示的，即[first]。然后将另一个列表添加到其中；第二个列表由对segment的递归调用确定。在Python中添加列表将它们串联起来，即[1, 2] + [3, 4]给出{}。在

最后，在

    return max(candidates, key=Pwords)

递归确定的列表iteration和一个键函数被传递给max。对迭代中的每个值调用key函数，以获取用于确定该列表在迭代中是否具有最高值的值。在

附录：

相关问题更多 >

编程相关推荐

热门问题

热门文章