我需要多个游标对象同时循环记录集并更新吗？

Question

我有一个很大的数据库，无法一次性全部加载到内存中。我需要逐个处理表中的每一项，然后把处理后的数据放到表的另一列中。

在我遍历数据时，如果我尝试运行一个更新语句，记录集就会被截断（我认为是因为光标对象被重新使用了）。

问题：

如果我创建一个第二个光标对象来运行更新语句，是否可以继续遍历原来的选择语句？

我需要一个第二个数据库连接来拥有第二个光标对象吗，这样我才能做到这一点？

如果有两个连接，一个从表中读取，另一个写入，sqlite会有什么反应？

我的代码（简化版）：

import sqlite3

class DataManager():
    """ Manages database (used below). 
        I cut this class way down to avoid confusion in the question.
    """
    def __init__(self, db_path):
        self.connection = sqlite3.connect(db_path)
        self.connection.text_factory = str
        self.cursor = self.connection.cursor()

    def genRecordset(self, str_sql, subs=tuple()):
        """ Generate records as tuples, for str_sql.
        """
        self.cursor.execute(str_sql, subs)
        for row in self.cursor:
            yield row

select = """
            SELECT id, unprocessed_content 
            FROM data_table 
            WHERE processed_content IS NULL
         """

update = """
            UPDATE data_table
            SET processed_content = ?
            WHERE id = ?
         """
data_manager = DataManager(r'C:\myDatabase.db')
subs = []
for row in data_manager.genRecordset(str_sql):
    id, unprocessed_content = row
    processed_content = processContent(unprocessed_content)
    subs.append((processed_content, id))

    #every n records update the database (whenever I run out of memory)
    if len(subs) >= 1000:
        data_manager.cursor.executemany(update, subs)
        data_manager.connection.commit()
        subs = []
#update remaining records
if subs:
    data_manager.cursor.executemany(update, subs)
    data_manager.connection.commit()

我尝试的另一种方法是修改我的选择语句为：

select = """
            SELECT id, unprocessed_content 
            FROM data_table 
            WHERE processed_content IS NULL
            LIMIT 1000
         """

然后我会这样做：

recordset = data_manager.cursor.execute(select)
while recordset:
    #do update stuff...
    recordset = data_manager.cursor.execute(select)

我遇到的问题是，我的真实选择语句中有一个JOIN操作，并且执行起来比较慢，所以多次执行JOIN会非常耗时。我想通过只执行一次选择来加快这个过程，然后使用生成器，这样就不需要把所有数据都放在内存里。

解决方案：

好的，我前两个问题的答案是“否”。对于我的第三个问题，一旦连接到数据库，就会锁定整个数据库，因此另一个连接在第一个连接关闭之前无法执行任何操作。

我找不到相关的源代码，但根据我的经验，我相信一个连接一次只能使用一个光标对象，最后运行的查询会优先执行。这意味着，当我在遍历选定的记录集时，每次只返回一行数据，一旦我运行第一个更新语句，我的生成器就会停止返回数据。

我的解决方案是创建一个临时数据库，把处理后的内容和ID放进去，这样我就可以为每个数据库保持一个连接/光标对象，并继续遍历选定的记录集，同时定期插入到临时数据库中。一旦我遍历完选定的记录集，就把临时数据库中的数据转移回原来的数据库。

如果有人对连接/光标对象有确切的了解，请在评论中告诉我。

数据库生成器数据库连接 join操作临时数据库游标对象记录集更新语句

我需要多个游标对象同时循环记录集并更新吗？

3 个回答

撰写回答