在Python中对CSV数据进行分组和子分组
这是我一个示例数据集,格式是CSV:
Column[1], Column[2], Account, CostCentre, Rate, Ex VAT, VAT
000000000, 00000000, 4200213, G1023, 0, 10.50, 0.0
000000000, 00000000, 4200213, G1023, 20, 10.50, 2.1
000000000, 00000000, 4200213, G1023, 0, 10.50, 0.0
000000000, 00000000, 4200213, G1023, 20, 10.50, 2.1
我想创建一个输出文件,主要关注账户号码,并根据成本中心和税率进行分组。因此,任何包含账户号码4200213的记录都需要包含在输出中,其他的记录可以忽略。
其次,如果成本中心,比如说G1023,有重复的情况,我希望这个Python脚本能判断这些记录的税率是否一致。如果一致的话,我希望输出文件能把它们按税率分组,并把不含增值税和含增值税的总费用加起来,最终的结果应该是这样的:
Cost Centre, Rate, Ex VAT, VAT, In VAT
G1023, 0, 21, 0, 21
G1023, 20, 21, 4.2, 25.20
我一直在尝试解决这个问题,但没有成功。我的当前代码如下:
import os
import sys
import csv
os.path.dirname = "./"
InputFile_name = "Book1.csv"
InputFile = csv.reader(open(InputFile_name, "r"))
OutputFile_name = "Journal.csv"
OutputFile = open(OutputFile_name, "w")
mydict = []
OutputFile.write("Cost Centre, Tax Rate, Total Ex VAT, VAT, Total In VAT\n")
for line in InputFile:
if line[2] == "4200213":
Cost_Centre = line[3]
Rate = line[4]
Ex_VAT = line[5]
VAT = line[6]
if Cost_Centre in mydict:
continue
else:
mydict.append(Cost_Centre)
for item in mydict:
if item in Cost_Centre and Rate == "0":
Ex_VAT += Ex_VAT
VAT+= VAT
In_VAT = Ex_VAT + VAT
elif item in Cost_Centre and Rate == "20":
Ex_VAT += Ex_VAT
VAT+= VAT
In_VAT = Ex_VAT + VAT
OutputFile.write(",".join([Cost_Centre,Rate,Ex_VAT,VAT,In_VAT+"\n"]))
OutputFile.close()
print "Finished."
sys.exit()
这个脚本能运行,但离我想要的结果还有很大差距。你们可能已经发现我对Python不太熟悉,所以我希望你们不仅能指出我的错误,还能帮我修改一下脚本,并给我提供完整的代码,同时解释一下我哪里做错了。
3 个回答
0
我在你的代码里加了一些注释:
import os
import sys # not necessary (see comment below)
import csv
os.path.dirname = "./" # not necessary (current directory is always $PWD)
# I would do:
InputFile = csv.reader(open("Book1.csv", "r"))
OutputFile = open("Journal.csv", "w")
mydict = [] # Okay, but you can also use set() (that's the structure you want in the end)
# name "mydict" is confusion (it's a list)
OutputFile.write("Cost Centre, Tax Rate, Total Ex VAT, VAT, Total In VAT\n")
for line in InputFile:
if line[2] == "4200213":
Cost_Centre = line[3]
Rate = line[4]
Ex_VAT = line[5] # you mean float(line[5])
VAT = line[6] # you mean float(line[6])
if Cost_Centre in mydict:
continue
else:
mydict.append(Cost_Centre)
for item in mydict:
# Why do you have an if-else statement here? Inside each branch you are doing always the same!
# Why do not you delete this if else statement?
if item in Cost_Centre and Rate == "0": # I guess you mean: item == Cost_Centre
Ex_VAT += Ex_VAT
VAT+= VAT
In_VAT = Ex_VAT + VAT
elif item in Cost_Centre and Rate == "20": # I guess you mean: item == Cost_Centre
Ex_VAT += Ex_VAT
VAT+= VAT
In_VAT = Ex_VAT + VAT
# I would write
# OutputFile.write(",".join([Cost_Centre,Rate,Ex_VAT,VAT,In_VAT]) +"\n")
OutputFile.write(",".join([Cost_Centre,Rate,Ex_VAT,VAT,In_VAT+"\n"]))
OutputFile.close()
print "Finished."
sys.exit() # not necessary
在Python中,通常使用小写字母来命名(可以参考 这个链接 或者 这个链接,了解变量和函数命名的规范)。
关于你的问题,你需要先读取所有的行,然后再计算并写出最终的CSV文件。错误就在这里(举个例子):
if line[2] == "4200213":
...
Ex_VAT = float(line[5]) # new variable is read
...
Ex_VAT += Ex_VAT # here will always get EX_VAT * 2
更新:这是我的代码:
import csv
from collections import defaultdict
from operator import add
class vector(tuple):
def __add__(self, other):
return vector(other) if len(self) == 0 else vector(map(add, self, other))
mydict = defaultdict(vector)
with open("data.csv", "r") as fd:
for line in csv.reader(fd):
line = map(str.strip, line)
if line[2] == "4200213":
mydict[line[3], line[4]] += float(line[5]), float(line[6])
with open("journal.csv", "w") as fd:
writer = csv.writer(fd)
writer.writerow(["Cost Centre", "Tax Rate", "Total Ex VAT", "VAT", "Total In VAT"])
for k,v in mydict.iteritems():
print repr(v)
writer.writerow(list(k) + list(v) + [sum(v)])
带有注释的:
import csv
from collections import defaultdict # see https://docs.python.org/2/library/collections.html#collections.defaultdict
from operator import add # add(x,y) == x + y
# for having vector( (1,2,3) ) + vector( (4,5,6) ) = vector( (5,7,9) )
# see https://stackoverflow.com/questions/2576296/using-python-tuples-as-vectors
# and lookup operator overloading for python on the internet
class vector(tuple):
def __add__(self, other):
return vector(other) if len(self) == 0 else vector(map(add, self, other))
# will be in the end
# mydict = {
# ("G1023","20"): vector((21.0,4.2)),
# ("G1023","0"): vector((21.0,0.0))
# }
mydict = defaultdict(vector)
# === read csv file ===
with open("data.csv", "r") as fd: # we have not call fd.close() at the end -> very handy ;-) + exception save!
for line in csv.reader(fd):
line = map(str.strip, line) # delete whitespaces from all cells
if line[2] == "4200213":
mydict[line[3], line[4]] += float(line[5]), float(line[6])
# === write final csv file ===
with open("journal.csv", "w") as fd:
writer = csv.writer(fd)
writer.writerow(["Cost Centre", "Tax Rate", "Total Ex VAT", "VAT", "Total In VAT"])
for k,v in mydict.iteritems():
writer.writerow(list(k) + list(v) + [sum(v)]) # output each line in the csv
我建议你慢慢地逐行阅读上面的代码,直到你明白每一部分是怎么工作的(我用了Python的一些很酷的功能)。如果有你不懂的地方,可以上网查一下。如果你有问题,欢迎在评论里问我,或者在Stackoverflow上提问。
1
生活太短暂了。这就是像 pandas
这样的库特别擅长的事情。下面是完整的代码:
import pandas as pd
df = pd.read_csv("tax.csv", skipinitialspace=True)
d2 = df.groupby(["CostCentre", "Rate"])[["Ex VAT", "VAT"]].sum()
d2["IN VAT"] = d2["Ex VAT"] + d2["VAT"]
d2.reset_index().to_csv("taxout.csv", index=False)
这段代码会生成一个新的 csv
文件,内容大概是这样的:
CostCentre,Rate,Ex VAT,VAT,IN VAT
G1023,0,21.0,0.0,21.0
G1023,20,21.0,4.2,25.2
1
你可以使用 itertools.groupby
这个工具。我写了这个代码,不过可读性不太好。
import csv
import itertools
csvreader = csv.reader(open("Book1.csv", "r"))
lines = [line for line in csvreader]
#Sort
lines = sorted(lines[1:], key = lambda x: (x[4], x[3], x[2]))
#Grouping
newRows = []
for grp in itertools.groupby(lines, key = lambda x: (x[2], x[3], x[4])):
newRow = [0, 0] + list(grp[0]) + [0.0, 0.0, 0.0]
for col in grp[1]:
newRow[5] += float(col[5])
newRow[6] += float(col[6])
newRow[7] += float(col[5]) + float(col[6])
newRows.append(newRow)
#Filtering and write csv
with open("Journal.csv", "w") as fp:
csvwriter = csv.writer(fp)
csvwriter.writerow(["Cost Centre", "Tax Rate", "Total Ex VAT", "VAT", "Total In VAT"])
for r in filter(lambda x:x[2].strip() == "4200213", newRows):
csvwriter.writerow(r[3:])
希望对你有帮助。