Efficiently Executing DDL-Statements & Stored Procedures on SQL Server using PySpark in Databricks.

Joshy Jonckheere
delaware
Published in
3 min readMar 22, 2021

--

There are many cases where it’s desirable to create or truncate a table from within Databricks before pushing data towards it. Also executing a stored procedure might be of help within a variety of data projects (e.g. performing a merge on the SQL-database). Although this might seem basic functionality, it’s something that isn’t as straightforward when using PySpark…

Photo by ARTHUR YAO on Unsplash

Introduction

The PySpark JDBC-connector doesn’t support executing DDL-statements and stored procedures. The PyODBC library does support this, but requires the installation of ODBC-drivers which has a big influence on cluster startup times.

A less known (and less documented) option is to use the native java JDBC-driver from the Spark context. This makes executing DDL-statements and Stored Procedures possible without the overhead. In this article we will provide some code examples on how to get started with it.

The PyODBC library

The most straightforward way to execute DDL-statements and stored procedures is by using the PyODBC library. Although it’s possible, it has some major drawbacks in our context:

  • You have to install separate drivers in order for it to work
  • It takes longer to install on cluster startup (>2 min)

Using this library causes a lot of overhead which we want to avoid as much as possible. More information on this library however can be found on the GitHub page.

The built-in java JDBC-driver

To circumvent the drawbacks of the PyODBC library, you could instead make use of the built-in java driver manager from your spark context. The first step is then to fetch a connection object from a JDBC-url by providing a SQL username and password. This object can then be utilized to execute SQL-statements on the SQL server:

DDL execution through username & password

The built-in java JDBC-driver + OAuth2

Since authentication through username and password is not seen as a security best practice, a better option would be to use an OAuth2 access token to authenticate with the built-in JDBC-driver.

When using Azure, this can be done by using a Service Principal User with sufficient permissions to the SQL-server to read, write and execute. This method is then similar to the previous one, but requires some extra steps:

DDL execution through access token

Executing stored procedures

Executing stored procedures is similar to executing regular SQL statements. In the following example a stored procedure is executed which takes an input parameter called ‘Name’:

SPR execution through access token

Output variables

In many cases it might be realistic to expect something back from the stored procedure. In this case we need to perform some extra steps:

  • For every expected output parameter, a ‘?’ has to be added to the end of the statement
  • The output parameters need to registered using the callable statement. This has to be done by providing the index of the question mark (starting at 1) and its datatype.
  • After execution, the output parameters need to be fetched using the correct get-method based on its (java) datatype.
SPR execution with output parameters through access token

Remarks

  • Never paste credentials in plain text into your notebooks! We recommend using a Key Vault-backed scope in Databricks.
  • The provided code samples are mainly to demonstrate the functionality of the spark JDBC-driver and might need some tweaking before being used in production-ready applications.
  • The described methods of using the built-in java JDBC-driver are rather simple, but less known. Since we are using java objects in a python context, it’s harder to find (correct) information online. More information is to be found in the java docs. (e.g. this article on executing stored procedures)

If you have any questions or concerns about this topic, do not hesitate to comment on this post or reach out to me or the Data Science & Engineering team at delaware. Some of our specializations are big data engineering, data analytics and Artificial Intelligence applications.

Mail: joshy.jonckheere@delaware.pro
LinkedIn: https://www.linkedin.com/in/joshy-jonckheere/

--

--

Joshy Jonckheere
delaware

Passionate about technology, data & AI. Currently active as a consultant in MS Data & AI @ delaware.