Python Django - 将数组与模型比较

Question

我有一个Oracle数据库，无法在里面添加新表，所以我在Django里创建了一个sqlite数据库，主要是为了把Oracle数据库里的数据同步到sqlite。

目前，Oracle数据库里大约有50万条数据。

Oracle数据库里的主键是递增的，但这并没有保证。有时候网络会出现问题，导致Django和Oracle数据库之间的连接中断，这样在同步的时候我就会漏掉一些值。

因此，我在Django里设计了一个模型：

class sequential_missing(models.Model):
    database = models.CharField(max_length=200, primary_key=True)
    row = models.IntegerField(primary_key=True)

基本上，数据库里有一行数据在Oracle那边是缺失的，我会对比sqlite数据库中缺失的连续值，找出在Oracle数据库里实际上是空缺的那个连续数字。这样可以加快处理速度，而不需要检查所有缺失的连续值。

整个功能的实现如下：

def checkMissing(maxValue, databaseObjects, databaseName):
    missingValues = []

    #############SECTION 1##########################
    print "Database:" + databaseName
    print "Checking for Missing Sequential Numbers"
    set_of_pk_values = set(databaseObjects.objects.all().values_list('pk', flat=True))
    set_one_to_max_value = set(xrange(1, maxValue+1))
    missingValues = set_one_to_max_value.difference(set_of_pk_values)
    #############SECTION 1##########################

    #Even though missingValues could be enough, but the problem is that not even Oracle can
    #guarantee the automatic incremented number is sequential, hence we would look up the values
    #we thought it was missing, and remove them from missingValues, which should be faster than
    #checking all of them in the oracle database

    #############SECTION 2##########################
    print "Checking for numbers that are empty, Current Size:" + str(len(missingValues))
    emptyRow = []
    for idx, val in enumerate(missingValues):
        found = False
        for items in sequential_missing.objects.all():
            if(items.row == val and items.database == databaseName):
                found = True
                #print "Database:" + str(items.row) + ", Same as Empty Row:" + str(val)
        if(found == True):
            emptyRow.append(val)
    #############SECTION 2##########################

    #############SECTION 3##########################
    print "Removing empty numbers, Current Size:" + str(len(missingValues)) + ", Empty Row:" + str(len(emptyRow))
    missingValuesCompared = []
    for idx, val in enumerate(missingValues):
        found = False
        for items in emptyRow:
            if(val == items):
                found = True
                #print "Empty Row:" + str(items) + ", same as Missing Values:" + str(val)
        if(found == False):
            missingValuesCompared.append(val)

    print "New Size:" + str(len(missingValuesCompared))
    return missingValuesCompared
    #############SECTION 3##########################

代码分为三个部分：

找出缺失的连续值
检查模型中的值，看看是否有匹配的，并且是相同的
创建一个新的数组，不包括第二部分中包含的行。

问题是第二部分的处理时间很长，复杂度是O(n^2)，因为它需要遍历整个数据库，检查这些行是否原本是空的。

有没有更快的方法来做到这一点，同时又能尽量少占用内存呢？

编辑：

使用ROW IN会好很多，

setItem = []
for items in missingValues:
    setItem.append(items)
print "Items in setItem:" + str(len(setItem))

currentCounter = 0
currentEndCounter = 500
counterIncrement = 500
emptyRowAppend = []
end = False
firstPass = False
while(end == False):
    emptyRow = sequential_missing.objects.filter(database=databaseName, row__in = setItem[currentCounter:currentEndCounter])
    for items in emptyRow:
        emptyRowAppend.append(items.row)
    if(firstPass == True):
        end = True
    if ((currentEndCounter+counterIncrement)>maxValue):
        currentCounter += counterIncrement
        currentEndCounter = maxValue
        firstPass = True
    else:
        currentCounter += counterIncrement
        currentEndCounter += counterIncrement


print "Removing empty numbers," + "Empty Row Append Size:" + str(len(emptyRowAppend)) + ", Missing Value Size:" + str(len(missingValues)) + ", Set Item Size:" + str(len(setItem)) +  ", Empty Row:" + str(len(emptyRowAppend))
missingValuesCompared = []
for idx, val in enumerate(missingValues):
    found = False
    for items in emptyRowAppend:
        if(val == items):
            found = True
            break
    if(found == False):
        missingValuesCompared.append(val)

数据处理数据模型 oracle数据库数据库同步 sqlite数据库缺失值检测行级操作复杂度优化

Python Django - 将数组与模型比较

1 个回答

撰写回答