优化将csv文件读入datafram的python脚本

2024-03-28 10:16:36 发布

您现在位置:Python中文网/ 问答频道 /正文

我编写了一个python脚本,将csv文件读入dataframe,根据两行(日期和事件)排序,然后从数据中生成两列X | Y。数据由不同的事件组成,必须独立处理。
对于每个事件ID,X列是一个事件列表,因此它从index=0增长到index=len(listOfIncidents),Y就是ListOfIncident(index+1)。
假设这是已排序并准备好拆分为X和Y的数据(为清楚起见,省略无用的列)
Sample Data Image
下图显示X和Y结果
Sample Result Image here

CSV文件的行数接近500k,因此首先;我将date列转换为Unix(为了方便排序);然后对数据帧进行排序。 我创建了一个函数create\u input\u output来处理创建X和Y并附加到maindfObj

import csv
import pandas as pd
import time
import datetime


# Function takes the computed IncidentActivity_Type and IncidentID as arguments
# Function Writes the IncidentActivity_Type into X and Y and create a CSV file using the IncidentID as file name
# Define the column Names
_ncols = ('X', 'Y')
# Define empty dataframe
maindfObj = pd.DataFrame([], columns=_ncols)
def create_input_output(xy, incident_id):
    global maindfObj
    # Define Empty List
    values = []
    xList = [];
    i = 0
    while i < len(xy):
        try:
            # Insert into X and Y so that X gets Index i and Y gets index i+1 etc
            xList = xy[0:i+1]
            values.append((xList, xy[i + 1]))
        except:
            xList = xy[0:i+1]
            # We have reached the end of list. Complete with NULL
            values.append((xList, "NULL"))
        i = i + 1
    # Write to dataframe
    subdfObj = pd.DataFrame(values, columns=_ncols)
    maindfObj = maindfObj.append(subdfObj)
    print("Appended "+ incident_id + ". New size: ") #print the size
    print(maindfObj.shape)
    # Create a filename
    #filename = incident_id + ".csv"
    # Write dataframe to file
    #subdfObj.to_csv(filename, sep=',')
    #print("Appended " + incident_id)

# Point to CSV file to be read. If file is in the same folder as python source just write filename
from google.colab import drive
drive.mount('/content/drive')
with open('/content/drive/My Drive/public/Incident_Activity.csv') as csv_file:
    # Read file. Possible delimiters are , ; and white space
    csv_reader = csv.reader(csv_file, delimiter=';')
    line_count = 0
    # Create a new list to hold the read data rows
    _newarr = []
    # Create Column
    _columns = []
    for row in csv_reader:
        # First Row is the header
        if line_count == 0:
            print(f'Column names are {", ".join(row)}')
            line_count += 1
            _columns = row[0].split(',')
            _columns = list(_columns)
            # Appending new column UnixEquiv to hold computed date
            _columns.append('UnixEquiv')
        # Other Rows are the data
        else:
            # Split
            rowstr = row[0].split(',')
            # Convert to list
            rowstr = list(rowstr)
            # Convert dateTime to Unix Timestamp and append to last column
            rowstr.append(time.mktime(datetime.datetime.strptime(rowstr[1], "%d/%m/%Y %H:%M").timetuple()))
            # Append new row to _newarr
            _newarr.append(tuple(rowstr))
            line_count += 1
    # Convert List to dataframe
    dfObj = pd.DataFrame(_newarr, columns=_columns)
    # Sort dataframe by the IncidentID and Unix Timestamp calculated earlier
    dfObj = dfObj.sort_values(by=['Incident ID', 'UnixEquiv'])

    # Create temporary XY List
    _tempxy = []
    # Initialize active Incident ID to empty
    _activeIncident = ""
    # Routine separates the big data based on Incident ID
    for index, row in dfObj.iterrows():
        if row['Incident ID'] == _activeIncident or _activeIncident == "":
            # Incident ID has not changed
            _tempxy.append(row['IncidentActivity_Type'])
            _activeIncident = row['Incident ID']
        else:
            # Incident ID has changed.
            _activeIncident = row['Incident ID']
            # Let's write that to file before we start the next batch
            create_input_output(_tempxy, _activeIncident)
            # Clear old Incident ID
            _tempxy.clear()
            # Start new batch
            _tempxy.append(row['IncidentActivity_Type'])
    #Dataframe is ready here. Use
    print(maindfObj.shape) #print the size

代码可以工作,但完成时间太长。我知道这里的人谁知道更好,可以帮助我优化脚本。你知道吗


Tags: columnsandcsvthetoiddataframefile