1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] Writing Pandas dataframe to MS SQL Server is too slow even with fast parameter options

Discussão em 'Python' iniciado por Stack, Outubro 3, 2024 às 22:32.

  1. Stack

    Stack Membro Participativo

    I've been trying to insert a relatively small Pandas Dataframe (~200K records) to Azure Synapse. For this purpose I've tried a bunch of different methods and approaches, revolving around PYODBC and SQLAlchemy APIs.

    However, no matter the approach, I notice that the write speed of my dataframe to SQL Server is only ~ 10 records/second. This is obviously not viable, unexpected, but most of all, its cause is a mystery to me.

    I've spent a good amount of time searching the web for resources on how to increase the write speed, and found a bunch of people raising similar issues and asking for advice.

    I'm aware of the .fast_executemany option when creating the SQLAlchemy.engine which supposedly speeds up the write speed considerably, but as you can see, even with it, the results are not desirable.

    I've also tried using executemany on a cursor, but the results have the exact same performance.

    Another attempt I made was with turbodbc, which also yielded the same performance.

    My dataframe has 11 columns, all being float except one.

    From what I've read in documentation, fast_executemany option is supposed to attempt to fit the entire dataframe in-memory, however during the operation, I notice that my computer's Memory consumption doesn't change.

    I'm truly at a loss here. Any ideas on the cause of this, and some remedy?

    Here's my code:

    import pandas as pd
    import sqlalchemy
    import pyodbc
    from sqlalchemy.engine import URL, create_engine

    conn = pyodbc.connect('Driver=ODBC Driver 17 for SQL Server;'
    'Server=SERVERNAME;'
    'Database=DATABASENAME;'
    'MARS_Connection=yes;'
    'UID=USER;'
    'PWD=PASS;')

    connstring = "Driver={ODBC Driver 17 for SQL Server};Server=SERVERNAME;Database=DATABASENAME;UID=USER;PWD=PASS;"

    conurl = URL.create("mssql+pyodbc", query={"odbc_connect":connstring,'autocommit':'True'})

    dbEngine = sqlalchemy.create_engine(conurl, fast_executemany=True)

    #for the sake of this example I only have 2 columns
    d = {'userid': ['21005395', '20101499'], 'col1': [0.1, 0.25]}
    df = pd.DataFrame(data=d)

    #Even when only inserting 1K rows, the time to complete is ~130secs
    df.head(1000).to_sql(
    name = 'databaseTable'
    , con = dbEngine
    , schema = 'schema'
    , method = None
    , index = False
    , chunksize = 500
    , dtype = {
    'userid' : sqlalchemy.types.NVARCHAR(10)
    , if_exists='replace')


    environment, using Anaconda 2.6.3:

    Python: 3.10.14
    pyodbc: 5.1.0
    pandas: 2.2.3
    sqlalchemy: 2.0.34

    Continue reading...

Compartilhe esta Página