我最初的问题是关于如何有效地链接大量的记录。你们有些人真的帮了大忙!这是我发现的你的情况可能不同强>
“最好在数据库里做繁重的工作。”
我听过,也看过很多次哇错了!这个小小的文字记录了我的产品开发中的一个重大突破
你的应用程序追踪汽车的所有权,并将数据与之关联。以下是一些模型:
# There are ~2M OwnershipRecord models
class OwnershipRecord(models.Model):
vin = models.CharField(...)
saledate = models.DateField(...)
active = models.BooleanField(...)
# There are ~1M MaintenanceRecord models
class MaintenanceRecord(models.Model):
vin = models.CharField(...)
ownership_record = models.ForeignKey(OwnershipRecord, ...)
mileage = models.DecimalField(...)
date_in_shop = models.DateField(...)
那么如何链接这些记录呢?我们关心日期和车辆识别号,对吗?例如,如果一辆车是:
然后这些MaintenanceRecord
链接到两个独立的OwnershipRecord
。你可以看到这将是一项密集的任务。我尝试过几种不同的链接方法
把每一份新的保养记录都放到芹菜上。使用query查找正确的OwnershipRecord
并创建ForeignKey关系
match = OwnershipRecord.objects.filter(vin=mx_vin, saledate__lt=date_in_shop).order_by('-saledate').latest('saledate')
使用新的django1.11工具Subquery
和OuterRef
对所有记录(或某些批处理子集)执行查询。卸到芹菜上
possible_matches = OwnershipRecord.objects.filter(vin=OuterRef('vin'), saledate__lt=OuterRef('date_in_shop'))
MaintenanceRecord.objects.filter(<subset_criteria>).update(ownership_record = Subquery(possible_matches)[:1])
在内存中执行该操作,可以根据需要以小批量或大批量进行。把这个放到芹菜上
# this code performs the "batch chunking"
id_extremes = MaintenanceRecord.objects.filter(<subset_criteria>).aggregate(Min('id'), Max('id')).values
min_id = min(id_extremes)
max_id = max(id_extremes)
chunksize = 2000
while True:
# the .delay here is for Celery to pick up the job
perform_linking_task.delay(min_id, min_id+chunksize)
min_id += chunksize
if min_id > max_id:
break
# this is how I linked
def perform_linking_task(min_id, max_id):
from collections import defaultdict # These are great
# I've ommitted the .values(...) bit of these queries for brevity
mx_records = MaintenanceRecord.objects.filter(id__lte=min_id, id__gte=max_id)
vin_list = [mrec.vin for mrec in maintenance_records]
possible_matches = OwnershipRecord.objects.filter(vin__in=vin_list)
orecs=defaultdict(list)
# this is the magic - it makes a "hash table" of vins to records
for possible in possible_matches:
orec[possible['vin']].append(possible)
for mrec in mx_records:
current = None
for orec in orecs:
# Do matching logic here, put the best into current
if current is not None:
mrec.ownership_record = current['id']
mrec.save(update_fields=['ownership_record'])
以下是结果
显然,在数据库中执行繁重的操作并不总是最好的
相关问题 更多 >
编程相关推荐