如何在Python中聚合记录？

**001** Math **02/20/2013** A **001** Literature **03/02/2013** B **002** Biology **01/01/2013** A **003** Biology **04/08/2013** A **001** Biology **05/01/2013** B **002** Math **03/10/2013** C

1条回答

网友

1楼 · 发布于 2024-04-25 10:29:52

content='''
   **001**     Math        **02/20/2013**  A

   **001**     Literature  **03/02/2013**  B

   **002**     Biology     **01/01/2013**  A

   **003**     Biology     **04/08/2013**  A

   **001**     Biology     **05/01/2013**  B

   **002**     Math        **03/10/2013**  C
'''

from collections import defaultdict

lines = content.split("\n")
items_iter = (line.split() for line in lines if line.strip())

aggregated = defaultdict(list)

for items in items_iter:
    stud, class_, date, grade = (t.strip('*') for t in items)
    aggregated[stud].append((class_, grade, date))

for stud, data in aggregated.iteritems():
    full_grades = [';'.join(items) for items in data]
    print '{},#{}'.format(stud, '#'.join(full_grades))

输出：

^{pr2}$

当然，这是一个丑陋的黑客代码，只是为了向您展示如何在python中完成它。处理大数据流时，请使用generators和iterators，不要使用file.readlines()，just iterate。迭代器不会一次读取所有数据，而是在迭代时逐块读取，而不是更早。在

如果您担心200m记录是否适合内存，请执行以下操作：

按学生id将记录分类到单独的“bucket”（如bucket sort）中
cat all_records.txt | grep 001 > stud_001.txt # do if for other students also
对每个bucket进行处理
合并

grep只是一个例子。制作一个更漂亮的脚本（awk或python），它将按学生ID进行过滤，例如，用ID<；1000过滤所有内容，稍后使用1000<；ID<；2000等等。你可以安全地做这件事，因为你每个学生的记录是不连贯的。在

相关问题更多 >

编程相关推荐

热门问题

热门文章