如何优化我的SQLAlchemy实例?

0 投票
1 回答
38 浏览
提问于 2025-04-12 15:21

我正在使用SQLAlchemy来获取大量数据。简单来说,我们有一个“交付”表,这个表通过多个其他表相连,而这些表又可以连接到更多的表。我需要从这些相连的表中获取数据,以便创建一个交付的JSON文件。

目前,我访问这些数据的方式非常慢。有没有人能给我一些建议(无论是具体的还是一般性的)来提高性能?下面是我访问数据的代码,我不想把所有的表定义都贴出来,因为它们比较大。所有表的定义都是标准的,没有设置懒加载之类的。

# Get the deliveries for the specified date
    delivery_rs = session.query(Delivery).join(Order) \
        .filter(and_(Delivery.DespatchDateTime.between(start_date, end_date), Order.ProductionSite == site_map.get(site))).all()
    # Setup our rowcount for the metadata later
    rowcount = 0

    # Go through each delivery in the resultset, formulate the full job/delivery/client/customer data and add it to the data array
    for delivery in delivery_rs:
        # Add to our rowcount
        rowcount = rowcount + 1

        # Add the jobs to our job array
        job_deliveries = delivery.JobDeliveries
        jobs = []
        quantity = 0
        for job_delivery in job_deliveries:
            job = job_delivery.Job
            web_ref = job.ClientJobReference
            if web_ref and not re.match(r'^CCW_', web_ref):
                web_ref = "" 
            elif web_ref:
                web_ref = re.sub(r'^CCW_', '', web_ref)
            jobs.append({
                "web_ref":      "CCW_{}".format(web_ref) if web_ref else "",
                "name":         job.JobName,
                # The artwork is stored in S3, so provide a link
                "thumbnail":    "https://example.com/{}.png".format(web_ref) if web_ref else ""
            })

            quantity = quantity + job_delivery.Quantity

        # Format our delivery data
        if delivery.AddressContact:
            address_contact = delivery.AddressContact
            contact_data =  {
                        "title":    title_map.get(address_contact.Title),
                        "name":     address_contact.ContactName,
                        "email":    address_contact.ContactEmail,
                        "phone":    address_contact.ContactNumber
                    }
        else:
            contact_data = {}
        
        order = delivery.Order
        client = order.Client
        delivery_method = delivery.DeliveryMethod
        address = delivery.Address
        
        result["data"].append(
            {
                "order_number": order.OrderSequenceId,
                "quantity":     quantity,
                "method":       delivery_method.Name,
                "client":       client.Name,
                "end_client":   client.EndCustomer,
                "jobs":         jobs,
                "contact":      contact_data,
                "address": {
                    "business": address.BusinessName,
                    "postcode": address.PostCode,
                    "town":     address.Town,
                    "county":   address.County,
                    "country":  address.Country.Name,
                    "lines": [
                        address.AddressLine1,
                        address.AddressLine2
                    ]
                }
            }
        )

我还没有尝试太多,因为网上有很多关于优化SQLAlchemy的信息,但我不太确定在我的情况下什么方法会有效。收集大约100个交付的数据需要15秒(大多数交付都有多个工作)。我尝试过把每个表都连接到查询上,但结果没有返回任何数据。

1 个回答

0

预加载或者说急切加载所有相关的数据,可能会大大提高性能。否则,每次你访问一个没有加载的关系时,就会执行另一个查询,这样会很慢。

关于关系加载的选项

在你的例子中,使用 SQLAlchemy 2.0 的 select 查询方式,可能看起来像这样(未经测试):

    from sqlalchemy.sql import select
    q = select(Delivery).join(Order).where(and_(Delivery.DespatchDateTime.between(start_date, end_date), Order.ProductionSite == site_map.get(site)))
    q = q.options(
        # We are already doing a join, so just use that join to load in the related orders and then join again to get those clients.
        joinedload(Delivery.Order).joinedload(Order.Client)
        # Load all these separately.
        selectinload(Delivery.DeliveryMethod),
        selectinload(Delivery.Address),
        selectinload(Delvery.AddressContact),
        # But here get this association object and the job too.
        selectinload(Delivery.JobDeliveries).joinedload(JobDelivery.Job)
        )
    delivery_rs = session.scalars(q).all()

很难分辨哪些是类,哪些是关系,哪些是值,因为看起来所有的名称都是用大驼峰命名法,所以我不太确定我理解得对不对。

为了检查你是否获取了所有的关系,可以开启 echo=True 并启动你的脚本。所有的查询应该在开始时就执行,而不是在你的循环里面。你可以在里面放一个打印语句之类的。

撰写回答