Codility 基因组范围查询

6 投票

9 回答

4904 浏览

数据工程师

提问于 2025-04-18 01:09

我最近发现了Codility，并且正在进行演示培训。我写了一个解决基因组范围查询问题的方案，运行得很好，使用了动态规划的方法，但得分只有87%，而不是我预期的100%。

有没有人知道这是为什么呢？

你可以在这里找到这个问题，它在前缀部分。只需开始一个测试就能看到问题描述！ Codility培训

谢谢！

def solution(S, P, Q):
    # write your code in Python 2.6
    S = list(S)
    sol = [[0]*len(S),[0]*len(S),[0]*len(S),[0]*len(S)]

    mapping = {"A":1, "C":2, "G":3, "T":4}

    for i in range(0,len(S)):
        if S[i] == 'A':
            sol[0][i]+= 1

        elif S[i] == 'C':
            sol[1][i] += 1

        elif S[i] == 'G':
            sol[2][i] += 1

        elif S[i] == 'T':
            sol[3][i] += 1

        if i < len(S)-1:
            sol[0][i+1] = sol[0][i]
            sol[1][i+1] = sol[1][i]
            sol[2][i+1] = sol[2][i]
            sol[3][i+1] = sol[3][i]

    for n in range(0, len(P)):

            l = P[n]
            r = Q[n]
            pre_sum = [0,0,0,0]
            if l > 0:
                pre_sum = [sol[0][l],sol[1][l],sol[2][l],sol[3][l]]
            post_sum = [sol[0][r],sol[1][r],sol[2][r],sol[3][r]]
            if post_sum[0]-pre_sum[0] > 0:
                P[n] = 1
            elif post_sum[1]-pre_sum[1] > 0:
                P[n] = 2
            elif post_sum[2]-pre_sum[2] > 0:
                P[n] = 3
            elif post_sum[3]-pre_sum[3] > 0:
                P[n] = 4
            else:
                P[n] = mapping[S[P[n]]];

    return P


pass

性能优化编程挑战动态规划前缀和 Codility 基因组查询算法训练

9 个回答

我们可以计算从当前位置（i=0,1,...,N-1）到每种核苷酸最近的前一个核苷酸的距离，所有之前的核苷酸和当前的核苷酸（在当前位置）都要考虑在内。

距离数组 pre_dists 大概会是这样的：

    |   C   A    G    C    C    T    A  |
----|-----------------------------------|
 A  |  -1   0    1    2    3    4    0  |
 C  |   0   1    2    0    0    1    2  |
 G  |  -1  -1    0    1    2    3    4  |
 T  |  -1  -1   -1   -1   -1    0    1  |

根据这些距离数据，我可以得到任何片段的最小影响因子。

我在Python中的实现：

def solution(S, P, Q):
    
    N = len(S)
    M = len(P)

    # impact factors
    I = {'A': 1, 'C': 2, 'G': 3, 'T': 4}
    
    # distance from current position to the nearest nucleotide
    # for each nucleotide type (previous or current nucleotide are considered) 
    # e.g. current position is 'A' => the distance dist[0] = 0, index 0 for type A
    #                          'C' => the distance dist[1] = 0, index 1 for type C
    pre_dists = [[-1]*N,[-1]*N,[-1]*N,[-1]*N]

    # initial values
    pre_dists[I[S[0]]-1][0] = 0

    for i in range(1, N):
        
        for t in range(4):
            if pre_dists[t][i-1] >= 0:
                # increase the distances
                pre_dists[t][i] = pre_dists[t][i-1] + 1

        # reset distance for current nucleotide type
        pre_dists[I[S[i]]-1][i] = 0
    
    # result keeper
    res = [0]*M

    for k in range(M):
        p = P[k]
        q = Q[k]

        if pre_dists[0][q] >=0 and q - pre_dists[0][q] >= p:
            res[k] = 1
        elif pre_dists[1][q] >=0 and q - pre_dists[1][q] >= p:
            res[k] = 2
        elif pre_dists[2][q] >=0 and q - pre_dists[2][q] >= p:
            res[k] = 3
        else:
            res[k] = 4
    
    return res

希望这对你有帮助。谢谢！

回答于 2025-04-18 由 Python大师

分享举报

如果还有人对这个练习感兴趣，我分享一下我的Python解决方案（在Codility上得了满分100/100）

def solution(S, P, Q):

    count = []
    for i in range(3):
        count.append([0]*(len(S)+1))

    for index, i in enumerate(S):
        count[0][index+1] = count[0][index] + ( i =='A')
        count[1][index+1] = count[1][index] + ( i =='C')
        count[2][index+1] = count[2][index] + ( i =='G')

    result = []

    for i in range(len(P)):
      start = P[i]
      end = Q[i]+1

      if count[0][end] - count[0][start]:
          result.append(1)
      elif count[1][end] - count[1][start]:
          result.append(2)
      elif count[2][end] - count[2][start]:
          result.append(3)
      else:
          result.append(4)

    return result

回答于 2025-04-18 由 Python大师

分享举报

这是一个得分100分的算法，时间复杂度是O(N+M)，没有使用任何语言特定的技巧，比如in或contains这些操作符：

Lets define prefix as:
 * last index of particular nucleone before on in current position. If no prev occcurance put -1.
 * 
 * 
 * indexes:     0   1   2   3   4   5   6
 * factors:     2   1   3   2   2   4   1
 *              C   A   G   C   C   T   A
 *              
 * prefix : A  -1   1   1   1   1   1   6
 *          C   0   0   0   3   4   4   4
 *          G  -1  -1   2   2   2   2   2
 *          T  -1  -1  -1  -1  -1   5   5
 *
 * Having such defined prefix let us easily calculate answer question of minimal factor in following way:
 * subsequence S[p]S[p+1]...S[q-1]S[q] has the lowest factor:
 * 1 if prefix index [A][q] >= p
 * 2 if prefix index [C][q] >= p
 * 3 if prefix index [G][q] >= p
 * 4 if prefix index [T][q] >= p

这是我对这个想法的实现

回答于 2025-04-18 由 Python大师

分享举报

这个方法也能完美运行，成功率是100/100。

def solution(S, P, Q):
    res = []
    for i in range(len(P)):
        if 'A' in S[P[i]:Q[i]+1]:
            res.append(1)
        elif 'C' in S[P[i]:Q[i]+1]:
            res.append(2)
        elif 'G' in S[P[i]:Q[i]+1]:
            res.append(3)
        else:
            res.append(4)
    return res

回答于 2025-04-18 由 Python大师

分享举报

哦，我之前也在做这个，调试花了我很长时间，不过最后我还是成功了，得了满分100。

举个例子，当 S='AGT'，还有 P=[1] 和 Q=[2] 时，函数应该返回3，因为G的位置是3，但你写的（我最开始写的也是）会返回4，表示T的位置。

我觉得这样改就能解决问题：

if l > 0: pre_sum = [sol[0][l-1],sol[1][l-1],sol[2][l-1],sol[3][l-1]]

回答于 2025-04-18 由 Python大师

分享举报

Codility 基因组范围查询

9 个回答

撰写回答