从R或Python中的列获取经常出现的字符串模式

2024-05-16 12:41:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一列字符串名称,我想找到经常出现的模式(单词)。 有没有一种方法可以返回长度大于X(或等于X)的字符串,并且在整列中出现的次数比Y还多?在

column <- c("bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla")
getOftenOccuringPatterns <- function(.....) 
getOftenOccuringPatterns(column, atleaststringsize=3, atleasttimes=4)
>     what   times 
[1]   bla    5
[2]   okay   5

引用Tim的评论:

我想删除嵌套的,所以如果有“aaaaaaa”和“aaaa”,并且这两个都会出现在输出中,那么只有“aaaaaaa”和其中一个出现的次数才算数。在

如果atleaststringsize=3和{},则两个输出都将相同。假设atleasttimes=10,“aaaaaaaa”出现15次,“aaaaaa”出现15次,然后:

^{pr2}$

以及

getOftenOccurringPatterns(column, atleaststringsize=4, atleasttimes=10) 
>    what      times
[1]  aaaaaaaa    15

停留时间最长的一个,并且至少=3和至少=4都是一样的。在


Tags: 方法字符串名称模式column单词次数what
3条回答

这将创建所有子字符串的所有出现的向量;它这样做很天真,迭代输入字符串max(nchar(x))的最大长度,并查找长度为1、2、。。。max(nchar(x)),所以在多项式时间内缩放它对于超大型问题是无效的。在

本修订版包含以下更改:

  1. 在前一个版本的内部和外部循环中,.accumulate实现了可怕的“复制并追加”模式;现在我们在预先分配的列表answer0中累积结果,然后在内部循环之后累积这些结果。

  2. allSubstrings()有参数min_occurmin_nchar(和max_nchar)来限制搜索空间。尤其是,min_occur(子串必须出现以保留的最小次数)有助于减少搜索较长子串的字符向量的长度。

  3. 函数.filter()可用于更积极地删除不包含长度为i的子字符串的字符串;这可能代价高昂,因此可以设置一个启发式和参数useFilter。过滤器的使用使整个解决方案看起来更像是一个黑客而不是一个算法-关于子串的信息已经被提取出来了,所以我们不必再回去搜索它们的出现。

这是修改后的主要功能

allSubstrings <-
    function(x, min_occur=1L, min_nchar=1L, max_nchar=max(nchar(x)),
             ..., useFilter=max(nchar(x)) > 100L)
{
    len <- nchar(x)
    x <- x[len >= min_nchar]; len <- len[len >= min_nchar]
    answer <- vector("list", max_nchar - min_nchar + 1L)
    for (i in seq(min_nchar, max_nchar)) {
        ## suffix of length i, starting at character j
        x0 <- x; len0 <- len; n <- max(len0) - i + 1L
        answer0 <- vector("list", n)
        for (j in seq_len(n)) {
            end <- j + i - 1L
            f <- factor(substr(x0, j, end))
            answer0[[j]] <- setNames(tabulate(f), levels(f))
            x0 <- x0[len0 != end]; len0 <- len0[len0 != end]
        }
        answer0 <- unlist(answer0)        # accumulate across start positions
        answer0 <- vapply(split(answer0, names(answer0)), sum, integer(1))
        answer0 <- answer0[answer0 >= min_occur]
        if (length(answer0) == 0L)
            break
        answer[[i - min_nchar + 1L]] <- answer0

        idx <- len != i                   # no need to process some strings
        if (useFilter)
            idx[idx] <- .filter(x[idx], names(answer0))
        x <- x[idx]; len <- len[idx]
        if (length(x) == 0L)
            break
    }
    unlist(answer[seq_len(i)])
}

以及.filter函数

^{pr2}$

与之前一样,result是一个命名向量,其中名称是字符串,值是它们出现的次数。在

> column <- c("bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla")
> xx <- allSubstrings(column)
> head(sort(xx, decreasing=TRUE))
 a  b  o  k  l  y 
10  5  5  5  5  5 
> xtabs(~nchar(names(xx)) + xx)
                xx
nchar(names(xx))  1  2  3  5 10
              1   2  1  1  5  1
              2   8  2  0  5  0
              3  15  1  0  3  0
              4  20  1  0  1  0
              5  22  0  0  0  0
....

类似于原始问题的查询就很容易执行,例如,>;=3个字符的所有子字符串出现次数超过4次:

> (ok <- xx[nchar(names(xx)) >= 3 & xx > 4])
 bla  oka  kay okay 
   5    5    5    5 

代码不能完全回答问题,例如,存在嵌套子字符串,但可能会替换@user1609452答案的嵌套lapply部分。后处理-消除嵌套子序列的结果有点不雅观,但由于后处理的结果不是很大,因此可能足够快,例如,消除嵌套子串

> fun <- function(p, q) length(grep(p, q, fixed=TRUE))
> ok[ sapply(names(ok), fun, names(ok)) == 1L ]   
 bla okay 
   5    5 

在这里,我们使用笔记本电脑上的99k单词字典进行输入,并为修改后的算法提供一些基本的时间安排

> timer <- function(n, x, ...)
    system.time(allSubstrings(head(x, n), ...))[[3]]
> n <- c(100, 1000, 10000, 20000)
> data.frame(n=n, elapsed=sapply(n, timer, words))
      n elapsed
1   100   0.050
2  1000   0.074
3 10000   0.490
4 20000   1.031

这大约比原始算法快10倍,这完全是因为修订版1(使用预分配和填充,然后是累加)。在

这里有一个较长句子的语料库

shakes <- readLines("http://www.gutenberg.org/cache/epub/100/pg100.txt")
shakes <- paste(shakes[nchar(shakes) != 0], collapse=" ")
shakes <- gsub(" +", " ", shakes)
shakes <- strsplit(shakes, "\\. +",)[[1]]

还有一些时间安排。这从指定min_occur参数和过滤器的使用中受益匪浅。在

> n <- c(100, 1000, 2000, 5000)
> data.frame(n=n, elapsed=sapply(n, timer, shakes, min_occur=10))
     n elapsed
1  100   1.725
2 1000   7.724
3 2000  12.415
4 5000  60.914

由于需要使用过滤器,并且在较长字符串上的性能较差,因此人们希望得到更好的算法,如suffix array;“Rlibstree”包也可能有用,尽管我不确定从何处获得当前版本,或者接口的公开部分是否足以回答最初的问题。在

它没有经过任何测试,也不会赢得任何速度比赛:

getOftenOccuringPatterns <- function(column, atleaststringsize, atleasttimes, uniqueInColumns = FALSE){

  res <- 
  lapply(column,function(x){
    lapply(atleaststringsize:nchar(x),function(y){
      if(uniqueInColumns){
        unique(substring(x, 1:(nchar(x)-y+1), y:nchar(x)))
      }else{
        substring(x, 1:(nchar(x)-y+1), y:nchar(x))
      }
    })
  })

  orderedRes <- unlist(res)[order(unlist(res))]
  encodedRes <- rle(orderedRes)

  partRes <- with(encodedRes, {check = (lengths >= atleasttimes);
                               list(what = values[check], times = lengths[check])})
  testRes <- sapply(partRes$what, function(x){length(grep(x, partRes$what)) > 1})

  lapply(partRes, '[', !testRes)

}


column <- c("bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla")
getOftenOccuringPatterns(column, atleaststringsize=3, atleasttimes=4)
$what

 "bla" "okay" 

$times

5 5 


getOftenOccuringPatterns(c("aaaaaaaa", "aaaaaaa", "aaaaaa", "aaaaa", "aaaa", "aaa"), atleaststringsize=3, atleasttimes=4)
$what
[1] "aaaaaa"

$times
[1] 6


getOftenOccuringPatterns(c("aaaaaaaa", "aaaaaaa", "aaaaaa", "aaaaa", "aaaa", "aaa"), atleaststringsize=3, atleasttimes=4, uniqueInColumn = TRUE)
$what
[1] "aaaaa"

$times
[1] 4

好的,我用Python编写了一个解决方案。对不起,我不能给你一个工作的R程序,但你应该能够实现一个从这个。正如您所看到的,这是一个非常暴力的解决方案,但是我真的看不出一种方法来从输入中的所有字符串构建所有可能的子字符串。在

我把这个问题分解成了简单、独立的步骤。这些应该很容易翻译成R。我确信R中有可比较的列表、集合和计数器的数据结构。在

from collections import Counter
strings = ["bla1okay", "okay1243bla", "blaokay", "bla12okay", "okaybla"]

def substrings(s, minlength=3):
    """Finds all possible unique substrings of s, given a minimum length.

    >>> substrings("12345")
    {'1234', '234', '345', '12345', '123', '2345'}
    >>> substrings("123123")
    {'2312', '123123', '12312', '123', '23123', '1231', '231', '3123', '312'}
    >>> substrings("aaaaa")
    {'aaaaa', 'aaaa', 'aaa'}
    """
    maxsize = current = len(s)
    result = []
    while current >= minlength:
        result.extend([s[start:start+current] 
                       for start in range(maxsize-current+1)])
                                  # range(5) is [0,1,2,3,4]
        current -= 1
    return set(result) # set() removes duplicates

def all_substrings(strings, minlength=3):
    """Returns the union of all the sets of substrings of a list of strings.

    >>> all_substrings(["abcd", "1234"])
    {'123', 'abc', 'abcd', '1234', 'bcd', '234'}
    >>> all_substrings(["abcd", "bcde"])
    {'abc', 'bcd', 'cde', 'abcd', 'bcde'}
    """
    result = set()
    for s in strings:
        result |= substrings(s, minlength)
        # "|=" is the set union operator
    return result

def count(strings, minlength=3):
    """Counts the occurrence of each substring within the provided list of strings,
    given a minimum length for each substring.

    >>> count(["abcd", "bcde"])
    Counter({'bcd': 2, 'bcde': 1, 'abc': 1, 'abcd': 1, 'cde': 1})
    """
    substrings = all_substrings(strings, minlength)
    counts = Counter()
    for substring in substrings:       # Check each substring
         for string in strings:        # against each of the original strings
             if substring in string:   # to see whether it is contained there
                 counts[substring] += 1
    return counts

def prune(counts, mincount=4):
    """Returns only the longest substrings whose count is >= mincount.
    First, all the substrings with a count < mincount are eliminated.
    Then, only those that aren't substrings of a longer string are kept.
    >>> prune(Counter({'bla': 5, 'kay': 5, 'oka': 5, 'okay': 5, 'la1': 2, 'bla1': 2}))
    [('okay', 5), ('bla', 5)]
    """
    # Throw out all counts < mincount. Sort result by length of the substrings.
    candidates = sorted(((s,c) for s,c in counts.items() if c >= mincount), 
                        key=lambda l: len(l[0]), reverse=True) # descending sort
    result = []
    seenstrings = set()      # Set of strings already in our result
    # (we could also look directly in the result, but set lookup is faster)
    for item in candidates:
        s = item[0]          # item[0] contains the substring
        # Make sure that s is not already in our result list
        if not any(s in seen for seen in seenstrings): 
            result.append(item)
            seenstrings.add(s)
    return result

counts = count(strings)
print(prune(counts))

输出:

^{pr2}$

相关问题 更多 >