SQL Archives | Life Around Data http://www.lifearounddata.com/category/ai/data-science/sql/ On data science, engineering, humans, teams, and life in general Tue, 14 Jan 2020 14:15:47 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.3 Advanced SQL Templates In Python with JinjaSql http://www.lifearounddata.com/advanced-sql-templates-in-python-with-jinjasql/ Sun, 12 Jan 2020 17:38:36 +0000 http://www.lifearounddata.com/?p=192 In A Simple Approach To Templated SQL Queries In Python, I introduced the basics of SQL templates in Python using JinjaSql. This post further demonstrates the power of Jinja2 within JinjaSql templates using presets, loops, and custom functions. Let’s consider an everyday use case when we have a table with

The post Advanced SQL Templates In Python with JinjaSql by Sergei Izrailev appeared first on Life Around Data.

]]>

In A Simple Approach To Templated SQL Queries In Python, I introduced the basics of SQL templates in Python using JinjaSql. This post further demonstrates the power of Jinja2 within JinjaSql templates using presets, loops, and custom functions. Let’s consider an everyday use case when we have a table with some dimensions and some numerical values, and we want to find some metrics for a given dimension or a combination of dimensions. The example data below is tiny and entirely made up, but it suffices for demonstrating the advanced features. First, I introduce the data set and the questions to be answered with SQL queries. I’ll then build the queries without templates, and finally, show how to use the SQL templates to parameterize and generate these queries. To generate SQL from JinjaSql templates, I’ll use (and modify) the apply_sql_template function introduced in the previous blog and available on GitHub in sql_tempates_base.py.

Example data set

Let’s consider a table transactions that contains records about purchases in some stores. The purchase can be made by cash, with a credit or a debit card, which adds an extra dimension to the data. Here is the code for creating and populating the table.

create table transactions (
    transaction_id int,
    user_id int,
    transaction_date date,
    store_id int,
    payment_method varchar(10),
    amount float
)
;

insert into transactions
(transaction_id, user_id, transaction_date, store_id, payment_method, amount)
values
    (1, 1234, '2019-03-02', 1, 'cash', 5.25),
    (1, 1234, '2019-03-01', 1, 'credit', 10.75),
    (1, 1234, '2019-03-02', 2, 'cash', 25.50),
    (1, 1234, '2019-03-03', 2, 'credit', 17.00),
    (1, 4321, '2019-03-01', 2, 'cash', 20.00),
    (1, 4321, '2019-03-02', 2, 'debit', 30.00),
    (1, 4321, '2019-03-03', 1, 'cash', 3.00)
;

Metrics to compute

When exploring a data set, it is common to look at the main performance metrics across all dimensions. In this example, we want to compute the following metrics:

  • number of transactions
  • average transaction amount
  • the total amount of transactions

We want these metrics for each user, store, and payment method. We also want to look at these metrics by store and payment method together.

Template for a single dimension

The query to obtain the metrics for each store is:

select
    store_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
    , avg(amount) as avg_amount
from
    transactions
group by
    store_id
order by total_amount desc

To get the same metrics for other dimensions, we only need to change the store_id into user_id or payment_method in both the select and group by clauses. So the JinjaSql template may look like

_BASIC_STATS_TEMPLATE = '''
select
    {{ dim | sqlsafe }}
    , count(*) as num_transactions
    , sum(amount) as total_amount
    , avg(amount) as avg_amount
from
    transactions
group by
    {{ dim | sqlsafe }}
order by total_amount desc
'''

with parameters, for example, as

params = {
    'dim': 'payment_method'
}
sql = apply_sql_template(_BASIC_STATS_TEMPLATE, params)

The above template works for a single dimension, but what if we have more than one? To generate a generic query that works with any number of dimensions, let’s create a skeleton of a function that takes a list of dimension column names as an argument and returns the SQL.

def get_basic_stats_sql(dimensions):
  '''
  Returns the sql computing the number of transactions,
  as well as the total and the average transaction amounts
  for the provided list of column names as dimensions.
  '''
  # TODO: construct params
  return apply_sql_template(_BASIC_STATS_TEMPLATE, params)

Essentially, we want to transform a list of column names, such as ['payment_method'] or ['store_id', 'payment_method'] into a single string containing the column names as a comma-separated list. Here we have some options, as it can be done either in Python or in the template.

Passing a string generated outside the template

The first option is to generate the comma-separated string before passing it to the template. We can do it simply by joining the members of the list together:

def get_basic_stats_sql(dimensions):
    '''
    Returns the sql computing the number of transactions,
    as well as the total and the average transaction amounts
    for the provided list of column names as dimensions.
    '''
    params = {
      'dim': '\n    , '.join(dimensions)
    }
    return apply_sql_template(_BASIC_STATS_TEMPLATE, params)

It so happens that the template parameter dim is in the right place, so the resulting query is

>>> print(get_basic_stats_sql(['store_id', 'payment_method']))
select
    store_id
    , payment_method
    , count(*) as num_transactions
    , sum(amount) as total_amount
    , avg(amount) as avg_amount
from
    transactions
group by
    store_id
    , payment_method
order by total_amount desc

Now we can quickly generate SQL queries for all desired dimensions using

dimension_lists = [
    ['user_id'],
    ['store_id'],
    ['payment_method'],
    ['store_id', 'payment_method'],
]

dimension_queries = [get_basic_stats_sql(dims) for dims in dimension_lists]

Preset variables inside the template

An alternative to passing a pre-built string as a template parameter is to move the column list SQL generation into the template itself by setting a new variable at the top:

_PRESET_VAR_STATS_TEMPLATE = '''
{% set dims = '\n    , '.join(dimensions) %}
select
    {{ dims | sqlsafe }}
    , count(*) as num_transactions
    , sum(amount) as total_amount
    , avg(amount) as avg_amount
from
    transactions
group by
    {{ dims | sqlsafe }}
order by total_amount desc
'''

This template is more readable than the previous version since all transformations happen in one place in the template, and at the same time, there’s no clutter. The function should change to

def get_stats_sql(dimensions):
    '''
    Returns the sql computing the number of transactions,
    as well as the total and the average transaction amounts
    for the provided list of column names as dimensions.
    '''
    params = {
      'dimensions': dimensions
    }
    return apply_sql_template(_PRESET_VAR_STATS_TEMPLATE, params)

Loops inside the template

We can also use loops inside the template to generate the columns.

_LOOPS_STATS_TEMPLATE = '''
select
    {{ dimensions[0] | sqlsafe }}\
    {% for dim in dimensions[1:] %}
    , {{ dim | sqlsafe }}{% endfor %}
    , count(*) as num_transactions
    , sum(amount) as total_amount
    , avg(amount) as avg_amount
from
    transactions
group by
    {{ dimensions[0] | sqlsafe }}\
    {% for dim in dimensions[1:] %}
    , {{ dim | sqlsafe }}{% endfor %}
order by total_amount desc
'''

This example may not be the best use of the loops because a preset variable does the job just fine without the extra complexity. However, loops are a powerful construct, especially when there is additional logic inside the loop, such as conditions ({% if ... %} - {% endif %}) or nested loops.

So what is happening in the template above? The first element of the list dimensions[0] stands alone because it doesn’t need a comma in front of the column name. We wouldn’t need that if there were a defined first column in the query, and the for-loop would look simply as

    {% for dim in dimensions %}
    , {{ dim | sqlsafe }}
    {% endfor %}

Then, the for-loop construct goes over the remaining elements dimensions[1:]. The same code appears in the group by clause, which is also not ideal and only serves the purpose of showing the loop functionality.

One may wonder why the formatting of the loop is so strange. The reason is that the flow elements of the SQL template, such as {% endfor %}, generate a blank line if they appear on a separate line. To avoid that, in the template above, both {% for ... %} and {% endfor %} are technically on the same line as the previous code (hence the backslash \ after the first column name). SQL, of course, doesn’t care about whitespace, but humans who read SQL may (and should) care. An alternative to fighting with formatting within the template that would make it more readable is to strip the blank lines from the generated query before printing or logging it. A useful function for that purpose is

import os
def strip_blank_lines(text):
    '''
    Removes blank lines from the text, including those containing only spaces.
    https://stackoverflow.com/questions/1140958/whats-a-quick-one-liner-to-remove-empty-lines-from-a-python-string
    '''
    return os.linesep.join([s for s in text.splitlines() if s.strip()])

A better-formatted template then becomes

_LOOPS_STATS_TEMPLATE = '''
select
    {{ dimensions[0] | sqlsafe }}
    {% for dim in dimensions[1:] %}
    , {{ dim | sqlsafe }}
    {% endfor %}
    , count(*) as num_transactions
    , sum(amount) as total_amount
    , avg(amount) as avg_amount
from
    transactions
group by
    {{ dimensions[0] | sqlsafe }}
    {% for dim in dimensions[1:] %}
    , {{ dim | sqlsafe }}
    {% endfor %}
order by total_amount desc
'''

And the call to print the query is

print(strip_blank_lines(get_loops_stats_sql(['store_id', 'payment_method'])))

All the SQL templates so far used a list of dimensions to produce precisely the same query.

Custom dimensions with looping over a dictionary

In the loop example above, we see how to iterate over a list. It is also possible to iterate over a dictionary. This comes in handy, for example, when we want to alias or transform some or all of the columns that form dimensions. Suppose we wanted to combine the debit and credit cards as a single value and compare it to cash transactions. We can accomplish that by first creating a dictionary defining a transformation for the payment_method and keeping the store_id unchanged.

custom_dimensions = {
    'store_id': 'store_id',
    'card_or_cash': "case when payment_method = 'cash' then 'cash' else 'card' end",
}

Here, both credit and debit values are replaced with card. Then, the template may look like the following:

_CUSTOM_STATS_TEMPLATE = '''
{% set dims = '\n    , '.join(dimensions.keys()) %}
select
    sum(amount) as total_amount
    {% for dim, def in dimensions.items() %}
    , {{ def | sqlsafe }} as {{ dim | sqlsafe }}
    {% endfor %}
    , count(*) as num_transactions
    , avg(amount) as avg_amount
from
    transactions
group by
    {{ dims | sqlsafe }}
order by total_amount desc
'''

Note that I moved the total_amount as the first column just to simplify this example and avoid having to deal with the first element in the loop separately. Also, note that the group by clause is using a preset variable and is different from the code in the select query because it only lists the names of the generated columns. The resulting SQL query is

>>> print(strip_blank_lines(
...     apply_sql_template(template=_CUSTOM_STATS_TEMPLATE,
...                        parameters={'dimensions': custom_dimensions})))
select
    sum(amount) as total_amount
    , store_id as store_id
    , case when payment_method = 'cash' then 'cash' else 'card' end as card_or_cash
    , count(*) as num_transactions
    , avg(amount) as avg_amount
from
    transactions
group by
    store_id
    , card_or_cash
order by total_amount desc

Calling custom Python functions from within JinjaSql templates

What if we want to use a Python function to generate a portion of the code? Jinja2 allows one to register custom functions and functions from other packages for use within the SQL templates. Let’s start with defining a function that generates the string that we insert into the SQL for transforming custom dimensions.

def transform_dimensions(dimensions: dict) -> str:
    '''
    Generate SQL for aliasing or transforming the dimension columns.
    '''
    return '\n    , '.join([
        '{val} as {key}'.format(val=val, key=key)
        for key, val in dimensions.items()
    ])

The output of this function is what we expect to appear in the select clause:

>>> print(transform_dimensions(custom_dimensions))
store_id as store_id
    , case when payment_method = 'cash' then 'cash' else 'card' end as card_or_cash

Now we need to register this function with Jinja2. To do that, we’ll modify the apply_sql_template function from the previous blog as follows.

from jinjasql import JinjaSql

def apply_sql_template(template, parameters, func_list=None):
    '''
    Apply a JinjaSql template (string) substituting parameters (dict) and return
    the final SQL. Use the func_list to pass any functions called from the template.
    '''
    j = JinjaSql(param_style='pyformat')
    if func_list:
        for func in func_list:
            j.env.globals[func.__name__] = func
    query, bind_params = j.prepare_query(template, parameters)
    return get_sql_from_template(query, bind_params)

This version has an additional optional argument func_list that needs to be a list of functions.

Let’s change the template to take advantage of the transform_dimensions function.

_FUNCTION_STATS_TEMPLATE = '''
{% set dims = '\n    , '.join(dimensions.keys()) %}
select
    {{ transform_dimensions(dimensions) | sqlsafe }}
    , sum(amount) as total_amount
    , count(*) as num_transactions
    , avg(amount) as avg_amount
from
    transactions
group by
    {{ dims | sqlsafe }}
order by total_amount desc
'''

Now we also don’t need to worry about the first column not having a comma. The following call produces a SQL query similar to that in the previous section.

>>> print(strip_blank_lines(
...     apply_sql_template(template=_FUNCTION_STATS_TEMPLATE,
...                        parameters={'dimensions': custom_dimensions},
...                        func_list=[transform_dimensions])))

select
    store_id as store_id
    , case when payment_method = 'cash' then 'cash' else 'card' end as card_or_cash
    , sum(amount) as total_amount
    , count(*) as num_transactions
    , avg(amount) as avg_amount
from
    transactions
group by
    store_id
    , card_or_cash
order by total_amount desc

Note how we pass transform_dimensions to apply_sql_template as a list [transform_dimensions]. Multiple functions can be passed into SQL templates as a list of functions, for example, [func1, func2].

Conclusion

This tutorial is a follow up to my first post on the basic use of JinjaSql. It demonstrates the use of preset variables, loops over lists and dictionaries, and custom Python functions within JinjaSql templates for advanced SQL code generation. In particular, the addition of custom functions registration to the apply_sql_template function makes templating much more powerful and versatile. Parameterized SQL queries continue to be indispensable for automated report generation and for reducing the amount of SQL code that needs to be maintained. An added benefit is that with reusable SQL code snippets, it becomes easier to use standard Python unit testing techniques to verify that the generated SQL is correct.

The code in this post is licensed under the MIT License.

Photo by Sergei Izrailev

The post Advanced SQL Templates In Python with JinjaSql by Sergei Izrailev appeared first on Life Around Data.

]]>
Left Join with Pandas Data Frames in Python http://www.lifearounddata.com/left-join-with-pandas-data-frames-in-python/ Wed, 21 Aug 2019 14:11:48 +0000 http://www.lifearounddata.com/?p=175 Merging Pandas data frames is covered extensively in a StackOverflow article Pandas Merging 101. However, my experience of grading data science take-home tests leads me to believe that left joins remain to be a challenge for many people. In this post, I show how to properly handle cases when the

The post Left Join with Pandas Data Frames in Python by Sergei Izrailev appeared first on Life Around Data.

]]>

Merging Pandas data frames is covered extensively in a StackOverflow article Pandas Merging 101. However, my experience of grading data science take-home tests leads me to believe that left joins remain to be a challenge for many people. In this post, I show how to properly handle cases when the right table (data frame) in a Pandas left join contains nulls.

Let’s consider a scenario where we have a table transactions containing transactions performed by some users and a table users containing some user properties, for example, their favorite color. We want to annotate the transactions with the users’ properties. Here are the data frames:

import numpy as np
import pandas as pd

np.random.seed(0)
# transactions
left = pd.DataFrame({'transaction_id': ['A', 'B', 'C', 'D'],
                     'user_id': ['Peter', 'John', 'John', 'Anna'],
                     'value': np.random.randn(4),
                    })

# users
right = pd.DataFrame({'user_id': ['Paul', 'Mary', 'John', 'Anna'],
                      'favorite_color': ['blue', 'blue', 'red', np.NaN],
                     })

Note that Peter is not in the users table and Anna doesn’t have a favorite color.

>>> left
  transaction_id user_id     value
0              A   Peter  1.867558
1              B    John -0.977278
2              C    John  0.950088
3              D    Anna -0.151357

>>> right
  user_id favorite_color
0    Paul           blue
1    Mary           blue
2    John            red
3    Anna            NaN

Adding the user’s favorite color to the transaction table seems straightforward using a left join on the user id:

>>> left.merge(right, on='user_id', how='left')
  transaction_id user_id     value favorite_color
0              A   Peter  1.867558            NaN
1              B    John -0.977278            red
2              C    John  0.950088            red
3              D    Anna -0.151357            NaN

We see that Peter and Anna have NaNs in the favorite_color column. However, the missing values are there for two different reasons: Peter’s record didn’t have a match in the users table, while Anna didn’t have a value for the favorite color. In some cases, this subtle difference is important. For example, it can be critical to understanding the data during initial exploration and to improving data quality.

Here are two simple methods to track the differences in why a value is missing in the result of a left join. The first is provided directly by the merge function through the indicator parameter. When set to True, the resulting data frame has an additional column _merge:

>>> left.merge(right, on='user_id', how='left', indicator=True)
  transaction_id user_id     value favorite_color     _merge
0              A   Peter  1.867558            NaN  left_only
1              B    John -0.977278            red       both
2              C    John  0.950088            red       both
3              D    Anna -0.151357            NaN       both

The second method is related to how it would be done in the SQL world and explicitly adds a column representing the user_id in the right table. We note that if the join columns in the two tables have different names, both columns appear in the resulting data frame, so we rename the user_id column in the users table before merging.

>>> left.merge(right.rename({'user_id': 'user_id_r'}, axis=1),
               left_on='user_id', right_on='user_id_r', how='left')
  transaction_id user_id     value user_id_r favorite_color
0              A   Peter  1.867558       NaN            NaN
1              B    John -0.977278      John            red
2              C    John  0.950088      John            red
3              D    Anna -0.151357      Anna            NaN

An equivalent SQL query is

select
    t.transaction_id
    , t.user_id
    , t.value
    , u.user_id as user_id_r
    , u.favorite_color
from
    transactions t
    left join
    users u
    on t.user_id = u.user_id
;

In conclusion, adding an extra column that indicates whether there was a match in the Pandas left join allows us to subsequently treat the missing values for the favorite color differently depending on whether the user was known but didn’t have a favorite color or the user was missing from the users table.

Photo by Ilona Froehlich on Unsplash.

The post Left Join with Pandas Data Frames in Python by Sergei Izrailev appeared first on Life Around Data.

]]>
A Simple Approach To Templated SQL Queries In Python http://www.lifearounddata.com/templated-sql-queries-in-python/ http://www.lifearounddata.com/templated-sql-queries-in-python/#respond Fri, 08 Mar 2019 04:15:47 +0000 http://www.lifearounddata.com/?p=72 There are numerous situations in which one would want to insert parameters in a SQL query, and there are many ways to implement templated SQL queries in python. Without going into comparing different approaches, this post explains a simple and effective method for parameterizing SQL using JinjaSql. Besides many powerful

The post A Simple Approach To Templated SQL Queries In Python by Sergei Izrailev appeared first on Life Around Data.

]]>

There are numerous situations in which one would want to insert parameters in a SQL query, and there are many ways to implement templated SQL queries in python. Without going into comparing different approaches, this post explains a simple and effective method for parameterizing SQL using JinjaSql. Besides many powerful features of Jinja2, such as conditional statements and loops, JinjaSql offers a clean and straightforward way to parameterize not only the values substituted into the where and in clauses, but also SQL statements themselves, including parameterizing table and column names and composing queries by combining whole code blocks.

Basic parameter substitution

Let’s assume we have a table transactions holding records about financial transactions. The columns in this table could be transaction_id, user_id, transaction_date, and amount. To compute the number of transactions and the total amount for a given user on a given day, a query directly to the database may look something like

select
    user_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
from
    transactions
where
    user_id = 1234
    and transaction_date = '2019-03-02'
group by
    user_id

Here, we assume that the database will automatically convert the YYYY-MM-DD format of the string representation of the date into a proper date type.

If we want to run the query above for an arbitrary user and date, we need to parameterize the user_id and the transaction_date values. In JinjaSql, the corresponding template would simply become

select
    user_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
from
    transactions
where
    user_id = {{ uid }}
    and transaction_date = {{ tdate }}
group by
    user_id

Here, the values were replaced by placeholders with python variable names enclosed in double curly braces {{ }}. Note that the variable names uid and tdate were picked only to demonstrate that they are variable names and don’t have anything to do with the column names themselves. A more readable version of the same template stored in a python variable is

user_transaction_template = '''
select
    user_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
from
    transactions
where
    user_id = {{ user_id }}
    and transaction_date = {{ transaction_date }}
group by
    user_id
'''

Next, we need to set the parameters for the query.

params = {
    'user_id': 1234,
    'transaction_date': '2019-03-02',
}

Now, generating a SQL query from this template is straightforward.

from jinjasql import JinjaSql
j = JinjaSql(param_style='pyformat')
query, bind_params = j.prepare_query(user_transaction_template, params)

If we print query and bind_params, we find that the former is a parameterized string, and the latter is an OrderedDict of parameters:

>>> print(query)
select
    user_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
from
    transactions
where
    user_id = %(user_id)s
    and transaction_date = %(transaction_date)s
group by
    user_id
>>> print(bind_params)
OrderedDict([('user_id', 1234), ('transaction_date', '2018-03-01')])

Running parameterized queries

Many database connections have an option to pass bind_params as an argument to the method executing the SQL query on a connection. For a data scientist, it may be natural to get results of the query in a Pandas data frame. Once we have a connection conn, it is as easy as running read_sql:

import pandas as pd
frm = pd.read_sql(query, conn, params=bind_params)

See the JinjaSql docs for other examples.

From a template to the final SQL query

It is often desired to fully expand the query with all parameters before running it. For example, logging the full query is invaluable for debugging batch processes because one can copy-paste the query from the logs directly into an interactive SQL interface. It is tempting to substitute the bind_params into the query using python built-in string substitution. However, we quickly find that string parameters need to be quoted to result in proper SQL. For example, in the template above, the date value must be enclosed in single quotes.

>>> print(query % bind_params)

select
    user_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
from
    transactions
where
    user_id = 1234
    and transaction_date = 2018-03-01
group by
    user_id

To deal with this, we need a helper function to correctly quote parameters that are strings. We detect whether a parameter is a string, by calling

from six import string_types
isinstance(value, string_types)

This works for both python 3 and 2.7. The string parameters are converted to the str type, single quotes in the names are escaped by another single quote, and finally, the whole value is enclosed in single quotes.

from six import string_types

def quote_sql_string(value):
    '''
    If `value` is a string type, escapes single quotes in the string
    and returns the string enclosed in single quotes.
    '''
    if isinstance(value, string_types):
        new_value = str(value)
        new_value = new_value.replace("'", "''")
        return "'{}'".format(new_value)
    return value

Finally, to convert the template to proper SQL, we loop over bind_params, quote the strings, and then perform string substitution.

from copy import deepcopy

def get_sql_from_template(query, bind_params):
    if not bind_params:
        return query
    params = deepcopy(bind_params)
    for key, val in params.items():
        params[key] = quote_sql_string(val)
    return query % params

Now we can easily get the final query that we can log or run interactively:

>>> print(get_sql_from_template(query, bind_params))

select
    user_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
from
    transactions
where
    user_id = 1234
    and transaction_date = '2018-03-01'
group by
    user_id

Putting it all together, another helper function wraps the JinjaSql calls and simply takes the template and a dict of parameters, and returns the full SQL.

from jinjasql import JinjaSql

def apply_sql_template(template, parameters):
    '''
    Apply a JinjaSql template (string) substituting parameters (dict) and return
    the final SQL.
    '''
    j = JinjaSql(param_style='pyformat')
    query, bind_params = j.prepare_query(template, parameters)
    return get_sql_from_template(query, bind_params)

Compute statistics on a column

Computing statistics on the values stored in a particular database column is handy both when first exploring the data and for data validation in production. Since we only want to demonstrate some features of the templates, for simplicity, let’s just work with integer columns, such as the column user_id in the table transactions above. For integer columns, we are interested in the number of unique values, min and max values, and the number of nulls. Some columns may have a default of say, -1, the drawbacks of which are beyond the scope of this post, however, we do want to capture that by reporting the number of default values.

Consider the following template and function. The function takes the table name, the column name and the default value as arguments, and returns the SQL for computing the statistics.

COLUMN_STATS_TEMPLATE = '''
select
    {{ column_name | sqlsafe }} as column_name
    , count(*) as num_rows
    , count(distinct {{ column_name | sqlsafe }}) as num_unique
    , sum(case when {{ column_name | sqlsafe }} is null then 1 else 0 end) as num_nulls
    {% if default_value %}
    , sum(case when {{ column_name | sqlsafe }} = {{ default_value }} then 1 else 0 end) as num_default
    {% else %}
    , 0 as num_default
    {% endif %}
    , min({{ column_name | sqlsafe }}) as min_value
    , max({{ column_name | sqlsafe }}) as max_value
from
    {{ table_name | sqlsafe }}
'''


def get_column_stats_sql(table_name, column_name, default_value):
    '''
    Returns the SQL for computing column statistics.
    Passing None for the default_value results in zero output for the number
    of default values.
    '''
    # Note that a string default needs to be quoted first.
    params = {
        'table_name': table_name,
        'column_name': column_name,
        'default_value': quote_sql_string(default_value),
    }
    return apply_sql_template(COLUMN_STATS_TEMPLATE, params)

This function is straightforward and very powerful because it applies to any column in any table. Note the {% if default_value %} syntax in the template. If the default value that is passed to the function is None, the SQL returns zero in the num_default field.

The function and template above will also work with strings, dates, and other data types if the default_value is set to None. However, to handle different data types more intelligently, it is necessary to extend the function to also take the data type as an argument and build the logic specific to different data types. For example, one might want to know the min and max of the string length instead of the min and max of the value itself.

Let’s look at the output for the transactions.user_id column.

>>> print(get_column_stats_sql('transactions', 'user_id', None))

select
    user_id as column_name
    , count(*) as num_rows
    , count(distinct user_id) as num_unique
    , sum(case when user_id is null then 1 else 0 end) as num_nulls

    , 0 as num_default

    , min(user_id) as min_value
    , max(user_id) as max_value
from
    transactions

Note that blank lines appear in place of the {% %} clauses and could be removed.

Summary

With the helper functions above, creating and running templated SQL queries in python is very easy. Because the details of parameter substitution are hidden, one can focus on building the template and the set of parameters, and then call a single function to get the final SQL.

One important caveat is the risk of code injection. For batch processes, it should not be an issue, but using the sqlsafe construct in web applications could be dangerous. The sqlsafe keyword indicates that the user (you) is confident that no code injection is possible and takes responsibility for simply putting whatever string is passed in the parameters directly into the query.

On the other hand, the ability to put an arbitrary string in the query allows one to pass whole code blocks into a template. For example, instead of passing table_name='transactions' above, one could pass '(select * from transactions where transaction_date = 2018-03-01) t', and the query would still work.

The code in this post is licensed under the MIT License.

Photo and image by Sergei Izrailev

The post A Simple Approach To Templated SQL Queries In Python by Sergei Izrailev appeared first on Life Around Data.

]]>
http://www.lifearounddata.com/templated-sql-queries-in-python/feed/ 0