我应该在数据库还是Python中进行更新?

2024-03-29 07:03:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我最初的问题是关于如何有效地链接大量的记录。你们有些人真的帮了大忙!这是我发现的你的情况可能不同

“最好在数据库里做繁重的工作。”

我听过,也看过很多次错了!这个小小的文字记录了我的产品开发中的一个重大突破

你的应用程序追踪汽车的所有权,并将数据与之关联。以下是一些模型:

# There are ~2M OwnershipRecord models
class OwnershipRecord(models.Model):
  vin = models.CharField(...)
  saledate = models.DateField(...)
  active = models.BooleanField(...)

# There are ~1M MaintenanceRecord models
class MaintenanceRecord(models.Model):
  vin = models.CharField(...)
  ownership_record = models.ForeignKey(OwnershipRecord, ...)
  mileage = models.DecimalField(...)
  date_in_shop = models.DateField(...)

那么如何链接这些记录呢?我们关心日期和车辆识别号,对吗?例如,如果一辆车是:

  • 6月1日售出
  • 5月12日进行了维修
  • 维修于7月15日进行

然后这些MaintenanceRecord链接到两个独立的OwnershipRecord。你可以看到这将是一项密集的任务。我尝试过几种不同的链接方法

  1. 把每一份新的保养记录都放到芹菜上。使用query查找正确的OwnershipRecord并创建ForeignKey关系

    match = OwnershipRecord.objects.filter(vin=mx_vin, saledate__lt=date_in_shop).order_by('-saledate').latest('saledate')
    
  2. 使用新的django1.11工具SubqueryOuterRef对所有记录(或某些批处理子集)执行查询。卸到芹菜上

    possible_matches = OwnershipRecord.objects.filter(vin=OuterRef('vin'), saledate__lt=OuterRef('date_in_shop'))
    
    MaintenanceRecord.objects.filter(<subset_criteria>).update(ownership_record = Subquery(possible_matches)[:1])
    
  3. 在内存中执行该操作,可以根据需要以小批量或大批量进行。把这个放到芹菜上

    # this code performs the "batch chunking"
    id_extremes =     MaintenanceRecord.objects.filter(<subset_criteria>).aggregate(Min('id'),     Max('id')).values
    min_id = min(id_extremes)
    max_id = max(id_extremes)
    chunksize = 2000
    while True:
      # the .delay here is for Celery to pick up the job
      perform_linking_task.delay(min_id, min_id+chunksize)
      min_id += chunksize
      if min_id > max_id:
        break
    
    # this is how I linked
    def perform_linking_task(min_id, max_id):
      from collections import defaultdict  # These are great
    
      # I've ommitted the .values(...) bit of these queries for brevity
      mx_records = MaintenanceRecord.objects.filter(id__lte=min_id, id__gte=max_id)
      vin_list = [mrec.vin for mrec in maintenance_records]
      possible_matches = OwnershipRecord.objects.filter(vin__in=vin_list)
      orecs=defaultdict(list)
    
      # this is the magic - it makes a "hash table" of vins to records
      for possible in possible_matches:
        orec[possible['vin']].append(possible)
    
      for mrec in mx_records:
        current = None
        for orec in orecs:
          # Do matching logic here, put the best into current
        if current is not None:
          mrec.ownership_record = current['id']
          mrec.save(update_fields=['ownership_record'])
    

Tags: theinidforobjectsmodels记录filter