在正则表达式字符串搜索后有效找到令牌（单词）索引的方法

3 投票

2 回答

2582 浏览

数据工程师

提问于 2025-04-16 17:56

我需要在一个字符串y中用正则表达式查找另一个字符串x，但我还需要知道这个匹配的第一个字符在分词（用其他正则表达式，比如空格）后的位置。因为第一个正则表达式可能会找到一个子串，所以我不能保证它会停在单词的开头。

那么，最好的算法该怎么实现呢？一个简单的方法可以这样做：

用第一个正则表达式在y中查找x，并得到字符位置z
用第二个正则表达式把y分割成一个元素数组
遍历这个数组，把每个元素的长度加到一个变量LENGTH上，同时给一个计数器COUNTER加1
当LENGTH大于或等于z时停止循环
匹配的第一个字符所在单词的索引就是COUNTER的值

（这里假设分割函数会把分割字符（比如空格）也当作数组元素，这样会很浪费。）

举个简单的例子：假设我想知道在字符串"The moon is made of cheese"中查找"ade"时，单词的索引是多少。这个函数应该返回3（因为数组是从0开始计数的）。

==编辑==
这个算法还需要在正则表达式查找跨越单词边界时也能工作。例如，当在"The moon is made of cheese"中查找"de of ch"时，它也应该返回索引"3"。

正则表达式字符串处理数组遍历匹配算法分词字符位置令牌化单词索引

2 个回答

在字符串中找到第一个模式，然后在第一个模式之前的部分，统计第二个模式字符串出现的次数。

下面是一个用perl写的脚本来完成这个任务：

    #!/bin/perl -w

    my $string = 'The moon is made of cheese';
    my $lookedfor = 'de of che';
    my $separator = q/\W+/;

    my $count = undef;
    if ($string =~ /(.*?)$lookedfor/) {
        # Keep the smallest (.*?) part of string before the match.
        my $firstpart = $1;

        $count = 0;
        # Count the number of separator 
        $count++ while $firstpart =~ m/$separator/g;
    }

    if (defined $count) {
        printf "index of '%s' in '%s' is %d\n", $lookedfor, $string, $count;
    } else {
        printf "No occurence of '%s' in '%s'\n", $lookedfor, $string;
    }

回答于 2025-04-16 由 Python大师

分享举报

根据你的更新：

#!/usr/bin/perl -l
use strict;
use warnings;

my $string = "The moon is made of cheese";
my $search = 'de of ch';
my $pos = index($string, $search);
if ($pos != -1) {
    my $substr = substr($string, 0, $pos);
    my @words = split /\s+/, $substr;
    print "found in word #", $#words, "\n";
} else {
    print "not found\n";
}

输出：

found in word #3

回答于 2025-04-16 由 Python大师

分享举报

在正则表达式字符串搜索后有效找到令牌（单词）索引的方法

2 个回答

撰写回答