在python中读取teradata表的速度慢

2024-06-16 13:02:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试从Teradata big读取一个表,这需要很多时间。我的表有500万行和60列,在内存中加载需要30分钟。我使用的是TeradasQL包,但使用RJDBC包在R中加载同一个表需要5分钟

Python代码(这需要30分钟)

import teradatasql
import pandas as pd

conn = teradatasql.connect(host=host, user=user_name, password=password, database=database)
df = pd.read_sql("SELECT * FROM big_table", conn)

R代码(仅需3分钟)

library(RJDBC)

# teradata conecction
con_tera <- dbConnect(drv_tera, "jdbc:teradata://{ip_host}/DATABASE=DBI_MIN,DBS_PORT=1025",Sys.getenv("TERA_DB_USER"), Sys.getenv("TERA_DB_PASS"))

# create query
final_query <- 'select * from big_table'

# get data
dataset_caribu <- dbGetQuery(con_tera,final_query)

我试图在python中增加游标的数组化,但这并没有大大提高执行时间


Tags: 代码importhost时间tablepasswordconnquery
1条回答
网友
1楼 · 发布于 2024-06-16 13:02:06

pandas.read_sql比直接使用TeradaSQL驱动程序慢

下面是一个简单的Python脚本,我用于测试500万行和60列,其中80%为非空值,20%为空列值:

import pandas
import teradatasql
import time
with teradatasql.connect (host="whomooz", user="guest", password="please") as con:
 with con.cursor () as cur:
  cur.execute ("create volatile table voltab (" + ",".join ([ "c{} integer".format (n) for n in range (1, 61) ]) + ") on commit preserve rows")
  cur.execute ("insert into voltab(c1) select row_number() over (order by calendar_date) as c1 from sys_calendar.calendar qualify c1 <= 62500")
  cur.execute ("insert into voltab(c1) select c1 + 62500 from voltab")
  cur.execute ("insert into voltab(c1) select c1 + 125000 from voltab")
  cur.execute ("insert into voltab(c1) select c1 + 250000 from voltab")
  cur.execute ("insert into voltab(c1) select c1 + 500000 from voltab")
  cur.execute ("insert into voltab(c1) select c1 + 1000000 from voltab")
  cur.execute ("insert into voltab(c1) select c1 + 2000000 from voltab")
  cur.execute ("insert into voltab(c1) select c1 + 4000000 from voltab where c1 <= 1000000")
  cur.execute ("update voltab set " + ",".join ([ "c{} = c1".format (n) for n in range (2, 49) ]))
  cur.execute ("select * from voltab")
  print ('beginning fetchall')
  dStartTime = time.time ()
  rows = cur.fetchall ()
  dElapsed = time.time () - dStartTime
  print ("fetchall took {} seconds, or {} minutes, and returned {} rows".format (dElapsed, dElapsed / 60, len (rows)))
  dStartTime = time.time ()
  df = pandas.read_sql ("select * from voltab", con)
  dElapsed = time.time () - dStartTime
  print ("read_sql took {} seconds, or {} minutes, and returned {} rows".format (dElapsed, dElapsed / 60, len (df)))

我的结果是:

fetchall took 638.6090559959412 seconds, or 10.64348426659902 minutes, and returned 5000000 rows
read_sql took 2293.84486413002 seconds, or 38.23074773550034 minutes, and returned 5000000 rows

相关问题 更多 >