Tuesday, October 13, 2015

speed up pandas to_sql with multithreading

Recently I am using Pandas for my project and save calculation results to MS SQL server with sqlalchemy.  Reading from database and do the calculation are fast; however, save result back using to_sql is extremely slow.  It takes minutes to save about 50,000 rows.  I googled and found a few discussions like:

http://stackoverflow.com/questions/29706278/python-pandas-to-sql-with-sqlalchemy-how-to-speed-up-exporting-to-ms-sql

Bulk insert is out of option because I don't have permission on the server.
As discussed above, the slow insert is the problem.  Since the bottleneck is I/O, we can use multithreading to solve the problem.

Here is the code to wrap Pandas's to_sql:

# Begin code

import threading

def tosql(df, *args, **kargs):
    CHUNKSIZE = 1000
    INITIAL_CHUNK = 100
    if len(df) > CHUNKSIZE:
        df.iloc[:INITIAL_CHUNK, :].to_sql(*args, **kargs)
    if kargs['if_exists'] == 'replace':
        kargs['if_exists'] = 'append'
    workers = []
    for i in range((len(df) - INITIAL_CHUNK)/CHUNKSIZE):
        t = threading.Thread(target=lambda: df.iloc[INITIAL_CHUNK+i*CHUNKSIZE:INITIAL_CHUNK+(i+1)*CHUNKSIZE, :].to_sql(*args, **kargs))
        t.start()
        workers.append(t)
    df.iloc[INITIAL_CHUNK+(i+1)*CHUNKSIZE:, :].to_sql(*args, **kargs)
    [t.join() for t in workers]

# end code

source file: https://github.com/haolu23/misc.git

Logic is straight forward.  The "CHUNKSIZE" can be changed to fit your need or be passed in as an argument.  The time drops from about 5 mins to about 1min on saving to remote MS SQL server.  I also tested on my home local postgresql server.  Time saved is less significant: from 45 sec to 33 sec for a table of about 42,000 rows.


4 comments:

  1. Great use of multithreading. Thumbs up!

    For a dataset of dim 5218x20, CHUNKSIZE = 500, Oracle DB on a remote location and target table with no constraints below are my observations

    Time Taken:
    ------------
    Without multithreading : 734s
    With multithreading : 69s

    Improvement : 90%

    ReplyDelete
  2. for me is really slow? any idea why?

    ReplyDelete
  3. save remainder of the rows if the number of the rows is not a multiplier of the CHUNKSIZE.

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete