比较两个大文件并合并匹配信息

2024-04-25 21:08:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个相当大的文件,JSON(185000行)和CSV(650000行)。我需要遍历JSON文件中的每个dict,然后在该文件中遍历part_numbers中的每个部分,并对其进行比较,以获得CSV中找到该部分的前三个字母。你知道吗

出于某种原因,我很难做到这一点。我的剧本的第一个版本太慢了,所以我想加快速度

JSON示例:

[
    {"category": "Dryer Parts", "part_numbers": ["ABC", "DEF", "GHI", "JKL", "MNO", "PQR"], "parent_category": "Dryers"},
    {"category": "Washer Parts", "part_numbers": ["ABC", "DEF", "GHI", "JKL", "MNO", "PQR"], "parent_category": "Washers"},
    {"category": "Sink Parts", "part_numbers": ["ABC", "DEF", "GHI", "JKL", "MNO", "PQR"], "parent_category": "Sinks"},
    {"category": "Other Parts", "part_numbers": ["ABC", "DEF", "GHI", "JKL", "MNO", "PQR"], "parent_category": "Others"}
]

CSV:

WCI|ABC
WPL|DEF
BSH|GHI
WCI|JKL

结束语如下:

{"category": "Other Parts",
 "part_numbers": ["WCIABC","WPLDEF","BSHGHI","JKLWCI"...]}

下面是一个我到目前为止所做的示例,它返回IndexError: list index out of rangeat if (part.rstrip() == row[1]):

import csv
import json
from multiprocessing import Pool

def find_part(item):
    data = {
        'parent_category': item['parent_category'],
        'category': item['category'],
        'part_numbers': []
    }

    for part in item['part_numbers']:
        for row in reader:
            if (part.rstrip() == row[1]):
                data['part_numbers'].append(row[0] + row[1])

    with open('output.json', 'a') as outfile:
        outfile.write('    ')
        json.dump(data, outfile)
        outfile.write(',\n')


if __name__ == '__main__':
    catparts = json.load(open('catparts.json', 'r'))
    partfile = open('partfile.csv', 'r')
    reader = csv.reader(partfile, delimiter='|')


    with open('output.json', 'w+') as outfile:
        outfile.write('[\n')

    p = Pool(50)
    p.map(find_part, catparts)

    with open('output.json', 'a') as outfile:
        outfile.write('\n]')

Tags: jsondefjklopenoutfileparentrowparts
3条回答

只要csv中存在所有零件号,这就可以工作。你知道吗

import json

# read part codes into a dictionary
with open('partfile.csv') as fp:
    partcodes = {}
    for line in fp:
        code, number = line.strip().split('|')
        partcodes[number] = code

with open('catparts.json') as fp:
    catparts = json.load(fp)

# modify the part numbers/codes 
for cat in catparts:
    cat['part_numbers'] = [partcodes[n] + n for n in cat['part_numbers']]

# output
with open('output.json', 'w') as fp:
    json.dump(catparts, fp)

正如我在评论中所说的,您的代码(现在)给了我一个NameError: name 'reader'函数中没有定义的find_part()。修复方法是将csv.reader的创建移到函数中。我还更改了文件的打开方式,以使用with上下文管理器和newline参数。这也解决了一堆单独的任务试图同时读取同一个csv文件的问题。你知道吗

您的方法非常低效,因为它读取'partfile.csv'文件中每个部分的整个item['part_numbers']。尽管如此,以下方法似乎有效:

import csv
import json
from multiprocessing import Pool

def find_part(item):
    data = {
        'parent_category': item['parent_category'],
        'category': item['category'],
        'part_numbers': []
    }

    for part in item['part_numbers']:
        with open('partfile.csv', newline='') as partfile:  # open csv in Py 3.x
            for row in csv.reader(partfile, delimiter='|'):
                if part.rstrip() == row[1]:
                    data['part_numbers'].append(row[0] + row[1])

    with open('output.json', 'a') as outfile:
        outfile.write('    ')
        json.dump(data, outfile)
        outfile.write(',\n')

if __name__ == '__main__':
    catparts = json.load(open('carparts.json', 'r'))

    with open('output.json', 'w+') as outfile:
        outfile.write('[\n')

    p = Pool(50)
    p.map(find_part, catparts)

    with open('output.json', 'a') as outfile:
        outfile.write(']')

下面是一个更有效的版本,每个子进程只读取一次整个'partfile.csv'文件:

import csv
import json
from multiprocessing import Pool

def find_part(item):
    data = {
        'parent_category': item['parent_category'],
        'category': item['category'],
        'part_numbers': []
    }

    with open('partfile.csv', newline='') as partfile:  # open csv for reading in Py 3.x
        partlist = [row for row in csv.reader(partfile, delimiter='|')]

    for part in item['part_numbers']:
        part = part.rstrip()
        for row in partlist:
            if row[1] == part:
                data['part_numbers'].append(row[0] + row[1])

    with open('output.json', 'a') as outfile:
        outfile.write('    ')
        json.dump(data, outfile)
        outfile.write(',\n')

if __name__ == '__main__':
    catparts = json.load(open('carparts.json', 'r'))

    with open('output.json', 'w+') as outfile:
        outfile.write('[\n')

    p = Pool(50)
    p.map(find_part, catparts)

    with open('output.json', 'a') as outfile:
        outfile.write(']')

虽然可以将'partfile.csv'数据读入主任务中的内存,并将其作为参数传递给find_part()子任务,但这样做只意味着每个进程都必须对数据进行pickle和unpickle。您需要运行一些计时测试来确定这是否比使用csv模块显式读取要快,如上图所示。你知道吗

还要注意的是,在将任务提交给Pool之前,对'carparts.json'文件中的数据加载进行预处理并从每一行的第一个元素中去掉尾随的空格也会更有效,因为这样就不需要反复执行find_part()中的part = part.rstrip()。同样,我不知道这样做是否值得付出努力,只有计时测试才能确定答案。你知道吗

我想我找到了。您的CSV阅读器与许多其他文件访问方法类似:您按顺序读取文件,然后点击EOF。当您尝试对第二部分执行相同的操作时,文件已经处于EOF,并且第一次read尝试返回空结果;这没有第二个元素。你知道吗

如果要再次访问所有记录,则需要重置文件书签。最简单的方法是使用

partfile.seek(0)

另一种方法是关闭并重新打开文件。你知道吗

这能让你动起来吗?你知道吗

相关问题 更多 >