hlu: speed up pandas to_sql with multithreading

Recently I am using Pandas for my project and save calculation results to MS SQL server with sqlalchemy. Reading from database and do the calculation are fast; however, save result back using to_sql is extremely slow. It takes minutes to save about 50,000 rows. I googled and found a few discussions like:

http://stackoverflow.com/questions/29706278/python-pandas-to-sql-with-sqlalchemy-how-to-speed-up-exporting-to-ms-sql

Bulk insert is out of option because I don't have permission on the server.
As discussed above, the slow insert is the problem. Since the bottleneck is I/O, we can use multithreading to solve the problem.

Here is the code to wrap Pandas's to_sql:

# Begin code

import threading

def tosql(df, *args, **kargs):
    CHUNKSIZE = 1000
    INITIAL_CHUNK = 100
    if len(df) > CHUNKSIZE:
        df.iloc[:INITIAL_CHUNK, :].to_sql(*args, **kargs)
    if kargs['if_exists'] == 'replace':
        kargs['if_exists'] = 'append'
    workers = []
    for i in range((len(df) - INITIAL_CHUNK)/CHUNKSIZE):
        t = threading.Thread(target=lambda: df.iloc[INITIAL_CHUNK+i*CHUNKSIZE:INITIAL_CHUNK+(i+1)*CHUNKSIZE, :].to_sql(*args, **kargs))
        t.start()
        workers.append(t)
    df.iloc[INITIAL_CHUNK+(i+1)*CHUNKSIZE:, :].to_sql(*args, **kargs)
    [t.join() for t in workers]

# end code

source file: https://github.com/haolu23/misc.git

Logic is straight forward. The "CHUNKSIZE" can be changed to fit your need or be passed in as an argument. The time drops from about 5 mins to about 1min on saving to remote MS SQL server. I also tested on my home local postgresql server. Time saved is less significant: from 45 sec to 33 sec for a table of about 42,000 rows.

4 comments:

Farhan MunirJune 23, 2016 at 5:14 PM
Great use of multithreading. Thumbs up!

For a dataset of dim 5218x20, CHUNKSIZE = 500, Oracle DB on a remote location and target table with no constraints below are my observations

Time Taken:
------------
Without multithreading : 734s
With multithreading : 69s

Improvement : 90%
FonsJanuary 31, 2017 at 5:22 AM
for me is really slow? any idea why?
hlSeptember 5, 2017 at 4:30 PM
save remainder of the rows if the number of the rows is not a multiplier of the CHUNKSIZE.
kpan93December 18, 2017 at 5:20 PM
This comment has been removed by the author.

Tuesday, October 13, 2015

speed up pandas to_sql with multithreading

4 comments: