嗨,我正在运行一个在github中找到的实用程序脚本 https://gist.github.com/Athmailer/4cdb424f03129248fbb7ebd03df581cd
更新1: 嗨,我修改了一点逻辑,这样就不再把csv分割成多个csv,而是创建一个包含多个拆分表的excel文件。下面是我的代码
import os
import csv
import openpyxl
import argparse
def find_csv_filenames( path_to_dir, suffix=".csv" ):
filenames = os.listdir(path_to_dir)
return [ filename for filename in filenames if filename.endswith( suffix ) ]
def is_binary(filename):
"""
Return true if the given filename appears to be binary.
File is considered to be binary if it contains a NULL byte.
FIXME: This approach incorrectly reports UTF-16 as binary.
"""
with open(filename, 'rb') as f:
for block in f:
if '\0' in block:
return True
return False
def split(filehandler, delimiter=',', row_limit=5000,
output_name_template='.xlsx', output_path='.', keep_headers=True):
class MyDialect(csv.excel):
def __init__(self, delimiter=','):
self.delimiter = delimiter
lineterminator = '\n'
my_dialect = MyDialect(delimiter=delimiter)
reader = csv.reader(filehandler, my_dialect)
index = 0
current_piece = 1
# Create a new Excel workbook
# Create a new Excel sheet with name Split1
current_out_path = os.path.join(
output_path,
output_name_template
)
wb = openpyxl.Workbook()
ws = wb.create_sheet(index=index, title="Split" + str(current_piece))
current_limit = row_limit
if keep_headers:
headers = reader.next()
ws.append(headers)
for i, row in enumerate(reader):
if i + 1 > current_limit:
current_piece += 1
current_limit = row_limit * current_piece
ws = wb.create_sheet(index=index, title="Split" + str(current_piece))
if keep_headers:
ws.append(headers)
ws.append(row)
wb.save(current_out_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Splits a CSV file into multiple pieces.',
prefix_chars='-+')
parser.add_argument('-l', '--row_limit', type=int, default=5000,
help='The number of rows you want in each output file. (default: 5000)')
args = parser.parse_args()
#Check if output path exists else create new output folder
output_path='Output'
if not os.path.exists(output_path):
os.makedirs(output_path)
with open('Logger.log', 'a+') as logfile:
logfile.write('Filename --- Number of Rows\n')
logfile.write('#Unsplit\n')
#Get list of all csv's in the current folder
filenames = find_csv_filenames(os.getcwd())
filenames.sort()
rem_filenames = []
for filename in filenames:
if is_binary(filename):
logfile.write('{} --- binary -- skipped\n'.format(filename))
rem_filenames.append(filename)
else:
with open(filename, 'rb') as infile:
reader_file = csv.reader(infile,delimiter=";",lineterminator="\n")
value = len(list(reader_file))
logfile.write('{} --- {} \n'.format(filename,value))
filenames = [item for item in filenames if item not in rem_filenames]
filenames.sort()
logfile.write('#Post Split\n')
for filename in filenames:
#try:
with open(filename, 'rb') as infile:
name = filename.split('.')[0]
split(filehandler=infile,delimiter=';',row_limit=args.row_limit,output_name_template= name + '.xlsx',output_path='Output')
我有一个名为“CSV文件”的文件夹,其中包含许多需要拆分的CSV文件。 我将这个实用程序脚本保存在同一个文件夹中
运行脚本时出现以下错误:
^{pr2}$有人能告诉我,如果我必须添加另一个for循环,去行中的每个单元格并将其附加到工作表中,还是可以一次性完成。而且我似乎已经使这个逻辑变得非常笨拙,这可以进一步优化。在
供参考的文件夹结构
必须将文件名作为命令行参数传递:
另外,我试着运行上面的脚本,不管我在什么CSS上运行它,它都不返回第一行(通常应该是头),尽管
keep_headers = True
。设置keep_headers = False
也会打印出标题行,这有点违反直觉。在此脚本用于读取单个CSV。如果要读取目录中的每个CSV,则需要创建另一个脚本来循环该目录中的所有文件。在
^{pr2}$相关问题 更多 >
编程相关推荐