Intro
You have a bunch of data and need to load it into SQL Server using Python. With the abundance of options out there its difficult to know what the best tool is to use. Lets approach the problem with two requirements. The solution needs to be easy to implement and blazing fast.
Tooling
Initially bcpandas looked like a great solution for this task. While it does fit our two requirements, it becomes very difficult to work with for data that contains multiple separators or any odd strings. If the data you work with is like mine it is almost never perfect. I quickly pivoted away from this due to its inability to handle imperfect data.
To get this job done we are going to use pandas, SQLAlchemy, and pyodbc. pandas is the Python go to for most data operations. It allows us to easily read in a large dataset and transform it as needed. pyodbc provides a fast easy way to connect to our SQL Server database. Using SQLAlchemy we have a robust toolset at our disposal for interacting with a database.
Easy to Implement
Getting our tooling up and running is a very simple exercise, especially if SQL Server is installed on the machine you are running the python script on. All we need from the SQL Server side of things is Microsoft ODBC Driver 11, 13, 13.1, or 17 for SQL Server. If SQL Server is installed locally you will already have these requirements satisfied. If not use the links to go download these requirements.
For our Python environment you need to meet the following requirements:
2. pandas
3. sqlalchemy
4. pyodbc
Assuming you already have python installed the rest of the setup is fairly straightforward.
pip install pandas
pip install sqlalchemy
pip install pyodbc
Setup
For our test dataset I went in search of something with a reasonable size. To my surprise one of the most popular datasets on data.gov is real estate sales from my home state of Connecticut. At just shy of one million records and 105MB in size it works as a great test dataset for our bulk data load.
Lets write out Python script:
Import pandas and SQLAlchemy
import pandas as pd
from sqlalchemy.engine import create_engine, URL
Create a SQLAlchemy connection URL and engine
connect_url = URL.create(
'mssql+pyodbc',
username="<your username>",
password="<your password>",
host="<your server name>",
database="<your database name>",
query=dict(driver='ODBC Driver 17 for SQL Server'))
engine = create_engine(
url=connect_url,
fast_executemany=True,
)
Test the connection
try:
with engine.connect() as connection:
print("Connection successful!")
except Exception as e:
print(f"Error occurred: {e}")
Load the real estate dataset we downloaded
df = pd.read_csv ('C:\\Real_Estate_Sales_2001-2020_GL.csv', low_memory=False)
Load the data into SQL Server
df.to_sql(con=engine, schema="dbo", name="<target table>", if_exists="replace", index=False, chunksize=1000)
Blazing Fast
Putting this all together I executed the script 5 times to see what our average performance looks like. Below are the executions and their time in seconds, average is 71.6 seconds. Not bad for 1 million records being read from a 150 MB csv.
- 63
- 72
- 70
- 83
- 70
Leave a Reply