Recently I am using Pandas for my project and save calculation results to MS SQL server with sqlalchemy. Reading from database and do the calculation are fast; however, save result back using to_sql is extremely slow. It takes minutes to save about 50,000 rows. I googled and found a few discussions like:
http://stackoverflow.com/questions/29706278/python-pandas-to-sql-with-sqlalchemy-how-to-speed-up-exporting-to-ms-sql
Bulk insert is out of option because I don't have permission on the server.
As discussed above, the slow insert is the problem. Since the bottleneck is I/O, we can use multithreading to solve the problem.
Here is the code to wrap Pandas's to_sql:
# Begin code
import threading
def tosql(df, *args, **kargs):
CHUNKSIZE = 1000
INITIAL_CHUNK = 100
if len(df) > CHUNKSIZE:
df.iloc[:INITIAL_CHUNK, :].to_sql(*args, **kargs)
if kargs['if_exists'] == 'replace':
kargs['if_exists'] = 'append'
workers = []
for i in range((len(df) - INITIAL_CHUNK)/CHUNKSIZE):
t = threading.Thread(target=lambda: df.iloc[INITIAL_CHUNK+i*CHUNKSIZE:INITIAL_CHUNK+(i+1)*CHUNKSIZE, :].to_sql(*args, **kargs))
t.start()
workers.append(t)
df.iloc[INITIAL_CHUNK+(i+1)*CHUNKSIZE:, :].to_sql(*args, **kargs)
[t.join() for t in workers]
# end code
source file: https://github.com/haolu23/misc.git
Logic is straight forward. The "CHUNKSIZE" can be changed to fit your need or be passed in as an argument. The time drops from about 5 mins to about 1min on saving to remote MS SQL server. I also tested on my home local postgresql server. Time saved is less significant: from 45 sec to 33 sec for a table of about 42,000 rows.
Great use of multithreading. Thumbs up!
ReplyDeleteFor a dataset of dim 5218x20, CHUNKSIZE = 500, Oracle DB on a remote location and target table with no constraints below are my observations
Time Taken:
------------
Without multithreading : 734s
With multithreading : 69s
Improvement : 90%
for me is really slow? any idea why?
ReplyDeletesave remainder of the rows if the number of the rows is not a multiplier of the CHUNKSIZE.
ReplyDeleteThis comment has been removed by the author.
ReplyDelete