将文本与模板比较以检测异常（反向模板）

8 投票

3 回答

1380 浏览

提问于 2025-04-17 14:51

我在寻找一种算法，或者说一种算法的思路，来解决验证短文本（比如电子邮件）是否符合已知模板的问题。编程语言可能会用Python或Perl，但这方面是可以灵活调整的。

问题是这样的：

需要访问生产数据的服务器能够发送电子邮件，这些邮件能够到达互联网：

Dear John Smith,

We received your last payment for $123.45 on 2/4/13. We'd like you to be aware of the following charges:
      $12.34 Spuznitz, LLC on 4/1
      $43.21 1-800-FLOWERS on 4/2
As always, you can view these transactions in our portal. 
Thank you for your business!

显然，电子邮件的内容会有所不同，比如称呼（“约翰·史密斯”）、金额（“$123.45，日期：2/4/13”）以及打印出来的交易记录等。而有些部分（比如“我们收到了您上次的付款”）则是非常固定的。我想要能够匹配文本中固定的部分，并且量化动态部分在某个合理的范围内（比如我可能知道最多打印的交易记录行数是5行）。

因为我担心数据泄露，所以我想确保不符合这个模板的电子邮件不会被发送出去——我想检查电子邮件，并对任何看起来不符合预期的内容进行隔离。因此，我需要自动化这个模板匹配，并阻止任何与模板相差太远的邮件。

所以问题是，我该去哪里寻找过滤机制呢？贝叶斯过滤器试图验证特定消息与非特定内容之间的相似性，这其实是相反的问题。像Perl的模板模块是一个很好的匹配——但它是用于输出，而不是输入或比较。简单的“差异”类型比较对有限的动态信息处理得并不好。

我该如何测试这些即将发送的电子邮件是否“像鸭子一样叫”呢？

自动化处理文本验证数据安全模板匹配信息隔离电子邮件过滤异常检测动态内容分析

3 个回答

我推荐使用“最长公共子序列”这个方法。你可以在这里找到一个标准的实现：

http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Longest_common_subsequence

如果你需要一个更好的算法，或者想要了解更多关于字符串不完全匹配的想法，推荐你看看这本书：

http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198

别被书名迷惑了。计算生物学主要是关于如何匹配大量长字符串的数据（也就是DNA序列）。

回答于 2025-04-17 由 Python大师

分享举报

如果你想把一个已经存在的模板，比如说里面有控制流程的元素像 {% for x in y %}，和它应该输出的结果进行对比，你就得解析这个模板语言——这听起来工作量很大。

另一方面，如果你愿意为验证目的写一个第二个模板——类似于：

Dear {{customer}},

We received your last payment for {{currency}} on {{full-date}}\. We'd like you to be aware of the following charges:
(      {{currency}} {{supplier}} on {{short-date}}
){,5}As always, you can view these transactions in our portal\.

... 这其实只是正则表达式语法的一个简单扩展，弄出一个能和它进行验证的东西其实挺简单的：

import re

FIELDS = {
    "customer": r"[\w\s\.-]{,50}",
    "supplier": r"[\w\s\.,-]{,30}",
    "currency": r"[$€£]\d+\.\d{2}",
    "short-date": r"\d{,2}/\d{,2}",
    "full-date": r"\d{,2}/\d{,2}/\d{2}",
}

def validate(example, template_file):
    with open(template_file) as f:
        template = f.read()
    for tag, pattern in FIELDS.items():
        template = template.replace("{{%s}}" % tag, pattern)
    valid = re.compile(template + "$")
    return (re.match(valid, example) is not None)

上面的例子绝对不是史上最好的Python代码，但它足够让你明白大概的意思。

回答于 2025-04-17 由 Python大师

分享举报

你可以使用语法来进行精确匹配。可以把正则表达式组织成语法，这样更容易理解和使用。具体可以参考这个链接：http://www.effectiveperlprogramming.com/blog/1479

或者你也可以使用一个专门的语法引擎，比如Marpa。

如果你想用更统计的方法，可以考虑n-grams。首先，把文本分割成小块，然后用有意义的占位符替换这些小块，比如用CURRENCY和DATE。接着，构建这些n-grams。这样你就可以用Jaccard指数来比较两段文本。

这里有一个用纯Perl实现的三元组（trigrams）示例：

#!/usr/bin/env perl
use strict;
use utf8;
use warnings;

my $ngram1 = ngram(3, tokenize(<<'TEXT1'));
Dear John Smith,

We received your last payment for $123.45 on 2/4/13. We'd like you to be aware of the following charges:
      $12.34 Spuznitz, LLC on 4/1
      $43.21 1-800-FLOWERS on 4/2
As always, you can view these transactions in our portal. 
Thank you for your business!
TEXT1

my $ngram2 = ngram(3, tokenize(<<'TEXT2'));
Dear Sally Bates,

We received your last payment for $456.78 on 6/9/12. We'd like you to be aware of the following charges:
      $123,43 Gnomovision on 10/1
As always, you can view these transactions in our portal. 
Thank you for your business!
TEXT2

my %intersection =
    map { exists $ngram1->[2]{$_} ? ($_ => 1) : () }
    keys %{$ngram2->[2]};
my %union =
    map { $_ => 1 }
    keys %{$ngram1->[2]}, keys %{$ngram2->[2]};

printf "Jaccard similarity coefficient: %0.3f\n", keys(%intersection) / keys(%union);

sub tokenize {
    my @words = split m{\s+}x, lc shift;

    for (@words) {
        s{\d{1,2}/\d{1,2}(?:/\d{2,4})?}{ DATE }gx;
        s{\d+(?:\,\d{3})*\.\d{1,2}}{ FLOAT }gx;
        s{\d+}{ INTEGER }gx;
        s{\$\s(?:FLOAT|INTEGER)\s}{ CURRENCY }gx;
        s{^\W+|\W+$}{}gx;
    }

    return @words;
}

sub ngram {
    my ($size, @words) = @_;
    --$size;

    my $ngram = [];
    for (my $j = 0; $j <= $#words; $j++) {
        my $k = $j + $size <= $#words ? $j + $size : $#words;
        for (my $l = $j; $l <= $k; $l++) {
            my @buf;
            for my $w (@words[$j..$l]) {
                push @buf, $w;
            }
            ++$ngram->[$#buf]{join(' ', @buf)};
        }
    }

    return $ngram;
}

你可以把一段文本当作模板，然后用它来匹配你的邮件。可以查看String::Trigram，这是一个高效的实现。还有Google Ngram Viewer，这是一个很好的资源，可以帮助你理解n-gram匹配。

回答于 2025-04-17 由 Python大师

分享举报

将文本与模板比较以检测异常（反向模板）

3 个回答

撰写回答