productivity Archives | Life Around Data

Faster Data Science: From Big To Small Data

Sergei Izrailev — Sun, 31 Mar 2019 05:29:32 +0000

Business leaders often ask how to accelerate data science projects. It is well established that data scientists spend as much as 80% of their time on data wrangling. Reducing this time leads to faster data science project turnaround and allows data scientists to spend a larger fraction of their time on high-value activities that cannot be performed by others. Economic and productivity benefits grow quickly with the size of the data science team. This post makes a case for a moderate investment in data engineering that drastically reduces the time spent on interactive data exploration by satisfying the need for smaller representative data sets.

Reduce the turnaround time of analytics queries

The key to reducing the time spent in data exploration and code development is minimizing the time it takes to get an answer to basic questions about the data. The dynamics of the development process changes drastically when the typical run time of an analytics query goes from several hours to an hour to 10 minutes to under a minute to under 10 seconds. Once the query response time gets into the range of a few seconds, one can and does ask many questions that test different hypotheses and ultimately arrives at a better result in a significantly shorter time.

Big Data is too big

During research projects as well as the development of data pipelines, data scientists necessarily run their queries multiple times, refining and improving them to get the data they need in the format they want. However, working with Big Data is time-consuming because processing a large data set can take a long time whether you are using Hadoop or a modern distributed data warehouse such as Snowflake, Redshift, or Vertica. This problem is often exacerbated by a pattern of using shared computational resources where interactive queries compete with larger batch processes.

Data scientists, frustrated by how long it takes to move their research forward, invent shortcuts that avoid long wait times and often sacrifice the accuracy or applicability of the results of their analyses. For example, working with a single hour of log data makes the query time bearable so that one can make progress. However, the results cannot be used to draw conclusions about the whole data set.

Data engineers can alleviate the pain of data wrangling by employing several simple approaches that reduce the query time with only a moderate amount of work and little extra cost. These approaches, reviewed in detail below, aim at reducing the size and the scope of the data to a subset that is relevant to the questions asked.

Use columnar data stores

The way the data is stored makes a massive difference to how easily it is accessible for analysis. Columnar data formats, such as Parquet, are widely used and provide many benefits. In addition to limiting the data scans to just the columns that are present in the query and allowing better compression, these formats store the data in a binary format rather than text. Even in a highly efficient CSV reader, parsing the text into binary data types can consume about 80% of the time needed to load the data into computer memory. In addition, parsing text fields correctly can itself be a challenge. Pre-processing once and storing the data in a binary format avoids the additional computation every time the data is accessed. Data warehouses typically provide binary columnar data storage out-of-the-box.

Create always-on sampled data sets

Many data questions can be answered by using a representative sample of the data. Sampled data sets are much better than a small time slice of data because they are representative of the whole data set and are sufficient for answering a variety of questions. Even though extracting and updating a sample is tedious, every data scientist sooner or later is enticed to do it merely to shorten the turnaround time of their queries. A much more efficient solution that doesn’t cost much is providing an automatically generated sample of the data at regular intervals. Generation of sampled data should ideally be tied to the pipeline that produces the full data set.

In a typical situation when the data represents some events (e.g., ad impressions, purchase transactions, site visits, etc.) related to some entities (e.g., online users, companies, etc.), it is beneficial to create data sets with different types of sampling. An obvious one is sampling the events, where a given percentage of events is randomly chosen to be included in the data set. Another sampling strategy, potentially more valuable and harder for a data scientist or analyst to implement, is covering all records related to a sample of entities. For example, we may want to include all purchases for 5% of our site users. Such a sample allows one to perform a user-based analysis efficiently. Sampling strategies, including adaptive sampling, is a topic for another post.

Extract smaller sets of relevant data

When events of interest are rare, sampling may not be an option. For example, in the digital advertising setting, one may be interested in extracting all available data for a specific advertising campaign within a limited time frame. Such a data set, while complete with every necessary field, is a small subset of all of the data. Analysts and data scientists interacting with this data are likely to issue hundreds of queries while they work on a project. The process of extracting such a data set and keeping it up-to-date can be automated if the data engineering team builds tools that allow data scientists to create a query for the initial extract, with another query responsible for periodic updates. After the project is complete, the data is no longer needed and can be archived or deleted.

Put smaller data sets in efficient SQL stores

SQL is undoubtedly the most common language of data analytics. Thus, making data sets available for querying using SQL is expanding the number of people that can interact with the data. Democratization of the data aside, making smaller data sets available in efficient analytics SQL query engines further reduces the query time and, therefore, the wasted time for data scientists and analysts. Such query engines range, depending on the size of the data and the requirements to infrastructure, from MySQL and PostgreSQL to Snowflake and Redshift to fully managed services such as Amazon’s Athena and Google’s BigQuery.

Provide dedicated computational resources

One of the biggest frustrations in data exploration is having to wait in a shared queue for results of a query that eventually runs in under a minute. Separate high-priority resource pools for interactive queries on a shared cluster, possibly limited to business hours, go a long way in improving the overall query turnaround time. In addition, providing data science and analytics teams with dedicated computational resources that have sufficient CPU and memory capacity allows loading of the smaller data sets into a tool of their choice to perform development and in-depth analyses.

The investment is worth it!

Human time is an order of magnitude more expensive than computer time and data storage costs. One can observe this easily by comparing an effective hourly rate of a data scientist to the hourly cost of a powerful computer. Also, human time creates net new value for the company. Thus, reducing the typical analytics query completion time to under a minute is likely the most impactful investment in technology available to accelerate data science and analytics projects.

Photos by Kelly Sikkema and Glen Noble on Unsplash combined by Sergei Izrailev

The post Faster Data Science: From Big To Small Data by Sergei Izrailev appeared first on Life Around Data.

A Simple Approach To Templated SQL Queries In Python

Sergei Izrailev — Fri, 08 Mar 2019 04:15:47 +0000

There are numerous situations in which one would want to insert parameters in a SQL query, and there are many ways to implement templated SQL queries in python. Without going into comparing different approaches, this post explains a simple and effective method for parameterizing SQL using JinjaSql. Besides many powerful features of Jinja2, such as conditional statements and loops, JinjaSql offers a clean and straightforward way to parameterize not only the values substituted into the where and in clauses, but also SQL statements themselves, including parameterizing table and column names and composing queries by combining whole code blocks.

Basic parameter substitution

Let’s assume we have a table transactions holding records about financial transactions. The columns in this table could be transaction_id, user_id, transaction_date, and amount. To compute the number of transactions and the total amount for a given user on a given day, a query directly to the database may look something like

select
    user_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
from
    transactions
where
    user_id = 1234
    and transaction_date = '2019-03-02'
group by
    user_id

Here, we assume that the database will automatically convert the YYYY-MM-DD format of the string representation of the date into a proper date type.

If we want to run the query above for an arbitrary user and date, we need to parameterize the user_id and the transaction_date values. In JinjaSql, the corresponding template would simply become

select
    user_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
from
    transactions
where
    user_id = {{ uid }}
    and transaction_date = {{ tdate }}
group by
    user_id

Here, the values were replaced by placeholders with python variable names enclosed in double curly braces {{ }}. Note that the variable names uid and tdate were picked only to demonstrate that they are variable names and don’t have anything to do with the column names themselves. A more readable version of the same template stored in a python variable is

user_transaction_template = '''
select
    user_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
from
    transactions
where
    user_id = {{ user_id }}
    and transaction_date = {{ transaction_date }}
group by
    user_id
'''

Next, we need to set the parameters for the query.

params = {
    'user_id': 1234,
    'transaction_date': '2019-03-02',
}

Now, generating a SQL query from this template is straightforward.

from jinjasql import JinjaSql
j = JinjaSql(param_style='pyformat')
query, bind_params = j.prepare_query(user_transaction_template, params)

If we print query and bind_params, we find that the former is a parameterized string, and the latter is an OrderedDict of parameters:

>>> print(query)
select
    user_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
from
    transactions
where
    user_id = %(user_id)s
    and transaction_date = %(transaction_date)s
group by
    user_id
>>> print(bind_params)
OrderedDict([('user_id', 1234), ('transaction_date', '2018-03-01')])

Running parameterized queries

Many database connections have an option to pass bind_params as an argument to the method executing the SQL query on a connection. For a data scientist, it may be natural to get results of the query in a Pandas data frame. Once we have a connection conn, it is as easy as running read_sql:

import pandas as pd
frm = pd.read_sql(query, conn, params=bind_params)

See the JinjaSql docs for other examples.

From a template to the final SQL query

It is often desired to fully expand the query with all parameters before running it. For example, logging the full query is invaluable for debugging batch processes because one can copy-paste the query from the logs directly into an interactive SQL interface. It is tempting to substitute the bind_params into the query using python built-in string substitution. However, we quickly find that string parameters need to be quoted to result in proper SQL. For example, in the template above, the date value must be enclosed in single quotes.

>>> print(query % bind_params)

select
    user_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
from
    transactions
where
    user_id = 1234
    and transaction_date = 2018-03-01
group by
    user_id

To deal with this, we need a helper function to correctly quote parameters that are strings. We detect whether a parameter is a string, by calling

from six import string_types
isinstance(value, string_types)

This works for both python 3 and 2.7. The string parameters are converted to the str type, single quotes in the names are escaped by another single quote, and finally, the whole value is enclosed in single quotes.

from six import string_types

def quote_sql_string(value):
    '''
    If `value` is a string type, escapes single quotes in the string
    and returns the string enclosed in single quotes.
    '''
    if isinstance(value, string_types):
        new_value = str(value)
        new_value = new_value.replace("'", "''")
        return "'{}'".format(new_value)
    return value

Finally, to convert the template to proper SQL, we loop over bind_params, quote the strings, and then perform string substitution.

from copy import deepcopy

def get_sql_from_template(query, bind_params):
    if not bind_params:
        return query
    params = deepcopy(bind_params)
    for key, val in params.items():
        params[key] = quote_sql_string(val)
    return query % params

Now we can easily get the final query that we can log or run interactively:

>>> print(get_sql_from_template(query, bind_params))

select
    user_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
from
    transactions
where
    user_id = 1234
    and transaction_date = '2018-03-01'
group by
    user_id

Putting it all together, another helper function wraps the JinjaSql calls and simply takes the template and a dict of parameters, and returns the full SQL.

from jinjasql import JinjaSql

def apply_sql_template(template, parameters):
    '''
    Apply a JinjaSql template (string) substituting parameters (dict) and return
    the final SQL.
    '''
    j = JinjaSql(param_style='pyformat')
    query, bind_params = j.prepare_query(template, parameters)
    return get_sql_from_template(query, bind_params)

Compute statistics on a column

Computing statistics on the values stored in a particular database column is handy both when first exploring the data and for data validation in production. Since we only want to demonstrate some features of the templates, for simplicity, let’s just work with integer columns, such as the column user_id in the table transactions above. For integer columns, we are interested in the number of unique values, min and max values, and the number of nulls. Some columns may have a default of say, -1, the drawbacks of which are beyond the scope of this post, however, we do want to capture that by reporting the number of default values.

Consider the following template and function. The function takes the table name, the column name and the default value as arguments, and returns the SQL for computing the statistics.

COLUMN_STATS_TEMPLATE = '''
select
    {{ column_name | sqlsafe }} as column_name
    , count(*) as num_rows
    , count(distinct {{ column_name | sqlsafe }}) as num_unique
    , sum(case when {{ column_name | sqlsafe }} is null then 1 else 0 end) as num_nulls
    {% if default_value %}
    , sum(case when {{ column_name | sqlsafe }} = {{ default_value }} then 1 else 0 end) as num_default
    {% else %}
    , 0 as num_default
    {% endif %}
    , min({{ column_name | sqlsafe }}) as min_value
    , max({{ column_name | sqlsafe }}) as max_value
from
    {{ table_name | sqlsafe }}
'''


def get_column_stats_sql(table_name, column_name, default_value):
    '''
    Returns the SQL for computing column statistics.
    Passing None for the default_value results in zero output for the number
    of default values.
    '''
    # Note that a string default needs to be quoted first.
    params = {
        'table_name': table_name,
        'column_name': column_name,
        'default_value': quote_sql_string(default_value),
    }
    return apply_sql_template(COLUMN_STATS_TEMPLATE, params)

This function is straightforward and very powerful because it applies to any column in any table. Note the {% if default_value %} syntax in the template. If the default value that is passed to the function is None, the SQL returns zero in the num_default field.

The function and template above will also work with strings, dates, and other data types if the default_value is set to None. However, to handle different data types more intelligently, it is necessary to extend the function to also take the data type as an argument and build the logic specific to different data types. For example, one might want to know the min and max of the string length instead of the min and max of the value itself.

Let’s look at the output for the transactions.user_id column.

>>> print(get_column_stats_sql('transactions', 'user_id', None))

select
    user_id as column_name
    , count(*) as num_rows
    , count(distinct user_id) as num_unique
    , sum(case when user_id is null then 1 else 0 end) as num_nulls

    , 0 as num_default

    , min(user_id) as min_value
    , max(user_id) as max_value
from
    transactions

Note that blank lines appear in place of the {% %} clauses and could be removed.

Summary

With the helper functions above, creating and running templated SQL queries in python is very easy. Because the details of parameter substitution are hidden, one can focus on building the template and the set of parameters, and then call a single function to get the final SQL.

One important caveat is the risk of code injection. For batch processes, it should not be an issue, but using the sqlsafe construct in web applications could be dangerous. The sqlsafe keyword indicates that the user (you) is confident that no code injection is possible and takes responsibility for simply putting whatever string is passed in the parameters directly into the query.

On the other hand, the ability to put an arbitrary string in the query allows one to pass whole code blocks into a template. For example, instead of passing table_name='transactions' above, one could pass '(select * from transactions where transaction_date = 2018-03-01) t', and the query would still work.

The code in this post is licensed under the MIT License.

Photo and image by Sergei Izrailev

The post A Simple Approach To Templated SQL Queries In Python by Sergei Izrailev appeared first on Life Around Data.