使用基本库优化python代码

3条回答

网友

1楼 · 编辑于 2024-04-25 08:25:57

我认为这里需要一种不同的方法，因为将每个产品相互比较总是会得到O（n^2）的时间复杂度

我通过升序position_min（和降序position_max，以防万一）对产品列表进行排序，并从上面的答案中反转检查：我没有查看comp是否包含ref，而是做了相反的操作。通过这种方式，可以仅针对具有较高position_min的产品检查每个产品，并且在发现comp高于ref的position_max时停止搜索

为了测试这个解决方案，我生成了100个产品的随机列表，并运行了从上面答案复制的一个函数和基于我建议的一个函数。后者执行大约1000次比较，而不是10000次，并且根据timeit，尽管初始排序带来了开销，但它的速度大约快了4倍

代码如下：

##reference function
def f1(basedata):
    outd={}
    for ref in basedata:
        for comp in basedata:
            if ref == comp:
                continue
            elif ref[1] >= comp[1] and ref[2] <= comp[2]:
                if not outd.get(ref[0], False) or comp[3] < outd[ref[0]][1]:
                    outd[ref[0]] = (comp[0], comp[3])
    return outd

##optimized(?) function
def f2(basedata):
    outd={}
    sorteddata = sorted(basedata, key=lambda x:(x[1],-x[2]))
    runs = 0
    for i,ref in enumerate(sorteddata):
        toohigh=False
        j=i
        while j < len(sorteddata)-1 and not toohigh:
            j+=1
            runs+=1
            comp=sorteddata[j]
            if comp[1] > ref[2]:
                toohigh=True
            elif comp[2] <= ref[2]:
                if not outd.get(comp[0], False) or ref[3] < outd[comp[0]][1]:
                    outd[comp[0]] = (ref[0], ref[3])
    print(runs)
    return outd

网友

2楼 · 编辑于 2024-04-25 08:25:57

使用sqlite3内存数据库，可以将搜索移动到比建议的方法更优化的B树索引。下面的方法比原来的方法快30倍。对于生成的2M行文件，计算每个项目的结果需要44小时（原始方法约1200小时）

import csv
import sqlite3
import sys
import time

with sqlite3.connect(':memory:') as con:
    cursor = con.cursor()
    cursor.execute('CREATE TABLE products (id integer PRIMARY KEY, product text, position_min int, position_max int, count_pos int)')
    cursor.execute('CREATE INDEX idx_products_main ON products(position_max, position_min, count_pos)')

    with open('product.csv', 'r') as products_file:
        reader = csv.reader(products_file, delimiter=';')
        # Omit parsing first row in file
        next(reader)

        for row in reader:
            row_id = row[0][len('A.'):] if row[0].startswith('A.') else row[0];
            cursor.execute('INSERT INTO products VALUES (?, ?, ?, ?, ?)', [row_id] + row)

    con.commit()

    with open('product_p.csv', 'wb') as write_file:
        with open('product.csv', 'r') as products_file:
            reader = csv.reader(products_file, delimiter=';')
            # Omit parsing first row in file
            next(reader)

            for row in reader:
                row_product_id, row_position_min, row_position_max, count_pos = row
                result_row = cursor.execute(
                    'SELECT product, count_pos FROM products WHERE position_min <= ? AND position_max >= ? ORDER BY count_pos, id LIMIT 1',
                    (row_position_min, row_position_max)
                ).fetchone()

                if (result_row and result_row[0] == row_product_id):
                    result_row = cursor.execute(
                        'SELECT product, count_pos FROM products WHERE product != ? AND position_min <= ? AND position_max >= ? ORDER BY count_pos, id LIMIT 1',
                        (row_product_id, row_position_min, row_position_max)
                    ).fetchone()

                if (result_row):
                    write_file.write(f'{row_product_id};{result_row[0]};{result_row[1]}\n'.encode())

如果需要，可以使用线程进行进一步优化，例如，可以使用10个线程将结果过程优化为需要4-5小时

网友

3楼 · 编辑于 2024-04-25 08:25:57

我删除了一些未使用的库，并尽可能简化代码的行为

代码中最重要的对象是listinput_data，它存储来自输入csv文件的数据，以及dictout_dict，它存储比较的输出

简言之，代码的作用是：

将product.csv（无标题）读入列表input_data
通过input_data迭代，将每一行与另一行进行比较
- 如果参考产品范围在比较产品范围内，我们检查一个新条件：out_dict中是否有参考产品？
  如果是，我们将其替换为新的比较产品If它的count_pos
  如果不是，我们将添加比较产品
将out_dict中的信息写入输出文件product_p.csv，但仅针对具有有效比较产品的产品

这是：

import csv

input_data = []
with open('product.csv', 'r') as csv_in:
    reader = csv.reader(csv_in, delimiter=';')
    next(reader)
    for row in reader:
        input_data.append(row)


out_dict = {}
for ref in input_data:
    for comp in input_data:
        if ref == comp:
            continue
        elif int(ref[1]) >= int(comp[1]) and int(ref[2]) <= int(comp[2]):
            if not out_dict.get(ref[0], False) or int(comp[3]) < out_dict[ref[0]][1]:
                out_dict[ref[0]] = (comp[0], int(comp[3]))
                # print(f"In '{ref[0]}': placed '{comp[0]}'")


with open('product_p.csv', "w") as csv_out:
    ecriture = csv.writer(csv_out, delimiter=';', quotechar='"', quoting=csv.QUOTE_ALL)
    for key, value in out_dict.items():
        if value[0]:
            ecriture.writerow([key, value[0]])

另外，我注释掉了一行print，它可以用一个只有几行的示例文件向您展示脚本正在做什么

注: 我相信你的预期产出是错误的。要么这样，要么我在解释中遗漏了什么。如果是这样，一定要告诉我。给出的代码考虑了这一点

从示例输入：

product;position_min;position_max;count_pos
A.16;167804;167870;20
A.18;167804;167838;15
A.15;167896;167768;18
A.20;238359;238361;33
A.35;167835;167837;8

预期产出将是：

"A.18";"A.16"
"A.15";"A.35"
"A.35";"A.18"

因为，对于“A.15”，“A.35”满足与“A.16”和“A.18”相同的条件，并且具有较小的count_pos

相关问题更多 >

编程相关推荐

热门问题

热门文章