我编写了一个python脚本,将csv文件读入dataframe,根据两行(日期和事件)排序,然后从数据中生成两列X | Y。数据由不同的事件组成,必须独立处理。
对于每个事件ID,X列是一个事件列表,因此它从index=0增长到index=len(listOfIncidents),Y就是ListOfIncident(index+1)。
假设这是已排序并准备好拆分为X和Y的数据(为清楚起见,省略无用的列)
Sample Data Image
下图显示X和Y结果
Sample Result Image here
CSV文件的行数接近500k,因此首先;我将date列转换为Unix(为了方便排序);然后对数据帧进行排序。 我创建了一个函数create\u input\u output来处理创建X和Y并附加到maindfObj
import csv
import pandas as pd
import time
import datetime
# Function takes the computed IncidentActivity_Type and IncidentID as arguments
# Function Writes the IncidentActivity_Type into X and Y and create a CSV file using the IncidentID as file name
# Define the column Names
_ncols = ('X', 'Y')
# Define empty dataframe
maindfObj = pd.DataFrame([], columns=_ncols)
def create_input_output(xy, incident_id):
global maindfObj
# Define Empty List
values = []
xList = [];
i = 0
while i < len(xy):
try:
# Insert into X and Y so that X gets Index i and Y gets index i+1 etc
xList = xy[0:i+1]
values.append((xList, xy[i + 1]))
except:
xList = xy[0:i+1]
# We have reached the end of list. Complete with NULL
values.append((xList, "NULL"))
i = i + 1
# Write to dataframe
subdfObj = pd.DataFrame(values, columns=_ncols)
maindfObj = maindfObj.append(subdfObj)
print("Appended "+ incident_id + ". New size: ") #print the size
print(maindfObj.shape)
# Create a filename
#filename = incident_id + ".csv"
# Write dataframe to file
#subdfObj.to_csv(filename, sep=',')
#print("Appended " + incident_id)
# Point to CSV file to be read. If file is in the same folder as python source just write filename
from google.colab import drive
drive.mount('/content/drive')
with open('/content/drive/My Drive/public/Incident_Activity.csv') as csv_file:
# Read file. Possible delimiters are , ; and white space
csv_reader = csv.reader(csv_file, delimiter=';')
line_count = 0
# Create a new list to hold the read data rows
_newarr = []
# Create Column
_columns = []
for row in csv_reader:
# First Row is the header
if line_count == 0:
print(f'Column names are {", ".join(row)}')
line_count += 1
_columns = row[0].split(',')
_columns = list(_columns)
# Appending new column UnixEquiv to hold computed date
_columns.append('UnixEquiv')
# Other Rows are the data
else:
# Split
rowstr = row[0].split(',')
# Convert to list
rowstr = list(rowstr)
# Convert dateTime to Unix Timestamp and append to last column
rowstr.append(time.mktime(datetime.datetime.strptime(rowstr[1], "%d/%m/%Y %H:%M").timetuple()))
# Append new row to _newarr
_newarr.append(tuple(rowstr))
line_count += 1
# Convert List to dataframe
dfObj = pd.DataFrame(_newarr, columns=_columns)
# Sort dataframe by the IncidentID and Unix Timestamp calculated earlier
dfObj = dfObj.sort_values(by=['Incident ID', 'UnixEquiv'])
# Create temporary XY List
_tempxy = []
# Initialize active Incident ID to empty
_activeIncident = ""
# Routine separates the big data based on Incident ID
for index, row in dfObj.iterrows():
if row['Incident ID'] == _activeIncident or _activeIncident == "":
# Incident ID has not changed
_tempxy.append(row['IncidentActivity_Type'])
_activeIncident = row['Incident ID']
else:
# Incident ID has changed.
_activeIncident = row['Incident ID']
# Let's write that to file before we start the next batch
create_input_output(_tempxy, _activeIncident)
# Clear old Incident ID
_tempxy.clear()
# Start new batch
_tempxy.append(row['IncidentActivity_Type'])
#Dataframe is ready here. Use
print(maindfObj.shape) #print the size
代码可以工作,但完成时间太长。我知道这里的人谁知道更好,可以帮助我优化脚本。你知道吗
目前没有回答
相关问题 更多 >
编程相关推荐