一种用python进行热编码的快速方法

2024-04-26 14:48:19 发布

您现在位置:Python中文网/ 问答频道 /正文

在我的项目中,我需要让oneHotEncode对数百万个DNA序列进行大约100次编码(总共是类似序列的数十亿次)。所以一个有效的方法对我来说是非常重要的

下面是我的代码,10K序列需要4.5秒

import numpy as np
import os,sys,time

def dna2onehot(dnaSeq):
    seqLen = len(dnaSeq)
    dnaSeq = dnaSeq.upper()

    # initialize the matrix to seqlen x 4
    seqMatrix = np.zeros((seqLen,4))
    # change the value to matrix
    for i in range(0,seqLen):
            if dnaSeq[i] == 'A':
                    seqMatrix[i,0] = 1
            if dnaSeq[i] == 'C':
                    seqMatrix[i,1] = 1
            if dnaSeq[i] == 'G':
                    seqMatrix[i,2] = 1
            if dnaSeq[i] == 'T':
                    seqMatrix[i,3] = 1
    ret = np.array(seqMatrix.flat)
    return ret
#

sequence = "TCTGAGTCCCAATACACAAGAGGTTCCCTCACCTGTTCTGGTGTCAGACCCTCCCAGATGATCACCTCTCCTATGGCGGGGAAGGTGCCTGGATGTCTAAAGCCTGAAATGGGGATCTATCCCAGAAGCTGTGTAGCTTCTGCCTGTCCCAGAAGCTGTGTTGTTTCTGTATTCAGCTTGCTCACCCTCCGCAGTCCATTGATCTGCACAGACTGTTCTCAGATGGACTCGTGAGACAAGATGGCTCCTTCACCTGCTCTGGGGATCAGAACCCTCCCAGGTGGCCACCTCTCCTGTGGTGGGGAAGGTACCTGGAAGTCTTCAGCCCAAAACAGGGCCTGTCCCAGAAGCTGTGTCTCTTCTGCCTATCCCAGAAGCTGTATTGCTTCTGCTGTCCACTTGCTCACCCTCTGCAGTCTGCATGCTGATCTGCGCAGACTGTTCTCAGAGGGATCTGGCAGACAAGTTGGCTCCCTCACCTGCTCTGGGGCGGGGGGGGGGGGTTCAGAGCCCTCCTGGGCAGCCACCTCTCCTCTAGCAGAGAAGGTGCTGGGATGTCTTGAGCAGGAAACGGGGTATGTCCCAGAAGCTGTCTTGCTTCTGCAATCCACATGCTCAGCCTCTGCAGTCTGTGAGCTAATCTGGGCAGTCTGGTCTCAGGGGACTCTGGAGACAAGATGGCTCCCTCACCTGCTCTGGGGGTCAAAGCCCTCCTTGGCAGCCACCTTTTTCAGGCGGAGAAGGTGCCCGGATGTCTGGAGCCTGAAACAGGGGTATGTCCCAGACACTGTGTAGCTTCTGCCTGCCCCAGAAGATGTGTCACTTCCTCAGTCTGCTTGTTCACCCTCCACAGTCTGCAAGCTGATCTGCACAGACTGGTCTCAGAGGGACCTAGAAGACAAGATCAAGAAAAGTCTTATAGGTATAATGAATCAAGCAGAAAATGAAACATCAGAAGCTTAAGATAAAATACAGGATCTAGTCCAAATTAGCAAGAAGTA"

count = 10000
datalist = []
t1 = time.time()
for k in range(count):
    datalist.append(dna2onehot(sequence))
#
t2 = time.time()
print("time cost:",t2-t1)

你有什么建议来减少使用python的时间吗(我的整个项目都是基于python的)


Tags: theto项目inimportforiftime
1条回答
网友
1楼 · 发布于 2024-04-26 14:48:19

您可以使用scikit learn中的OneHotEncoder

import numpy as np
from sklearn.preprocessing import OneHotEncoder

# create the encoder object
encoder = OneHotEncoder()

sequence = 'TCTGAGTCCCAATACACAAGAGGTTCCCTCACCTGTTCTGGTGTCAGACCCTCCCAGATGATCACCTCTCCTATGGCG'
sequence += 'GGGAAGGTGCCTGGATGTCTAAAGCCTGAAATGGGGATCTATCCCAGAAGCTGTGTAGCTTCTGCCTGTCCCAGAAGC'
sequence += 'TGTGTTGTTTCTGTATTCAGCTTGCTCACCCTCCGCAGTCCATTGATCTGCACAGACTGTTCTCAGATGGACTCGTGA'
sequence += 'GACAAGATGGCTCCTTCACCTGCTCTGGGGATCAGAACCCTCCCAGGTGGCCACCTCTCCTGTGGTGGGGAAGGTACC'
sequence += 'TGGAAGTCTTCAGCCCAAAACAGGGCCTGTCCCAGAAGCTGTGTCTCTTCTGCCTATCCCAGAAGCTGTATTGCTTCT'
sequence += 'GCTGTCCACTTGCTCACCCTCTGCAGTCTGCATGCTGATCTGCGCAGACTGTTCTCAGAGGGATCTGGCAGACAAGTT'
sequence += 'GGCTCCCTCACCTGCTCTGGGGCGGGGGGGGGGGGTTCAGAGCCCTCCTGGGCAGCCACCTCTCCTCTAGCAGAGAAG'
sequence += 'GTGCTGGGATGTCTTGAGCAGGAAACGGGGTATGTCCCAGAAGCTGTCTTGCTTCTGCAATCCACATGCTCAGCCTCT'
sequence += 'GCAGTCTGTGAGCTAATCTGGGCAGTCTGGTCTCAGGGGACTCTGGAGACAAGATGGCTCCCTCACCTGCTCTGGGGG'
sequence += 'TCAAAGCCCTCCTTGGCAGCCACCTTTTTCAGGCGGAGAAGGTGCCCGGATGTCTGGAGCCTGAAACAGGGGTATGTC'
sequence += 'CCAGACACTGTGTAGCTTCTGCCTGCCCCAGAAGATGTGTCACTTCCTCAGTCTGCTTGTTCACCCTCCACAGTCTGC'
sequence += 'AAGCTGATCTGCACAGACTGGTCTCAGAGGGACCTAGAAGACAAGATCAAGAAAAGTCTTATAGGTATAATGAATCAA'
sequence += 'GCAGAAAATGAAACATCAGAAGCTTAAGATAAAATACAGGATCTAGTCCAAATTAGCAAGAAGTA'

# transform sequence to a Nx1 array, pass through fit/transform operation
seq_arr = np.array(list(sequence)).reshape(-1, 1)
seq_1hot = encoder.fit_transform(seq_arr).toarray()

seq_1hot
# returns:
array([[0., 0., 0., 1.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.],
       ...,
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.]])

通过查看以下内容,您可以看到哪个字母对应于哪个列:

encoder.categories_
# returns:
[array(['A', 'C', 'G', 'T'], dtype='<U1')]

所以在这种情况下,它们是按字母顺序排列的

相关问题 更多 >