Life Around Data

Advanced SQL Templates In Python with JinjaSql

Sergei Izrailev — Sun, 12 Jan 2020 17:38:36 +0000

In A Simple Approach To Templated SQL Queries In Python, I introduced the basics of SQL templates in Python using JinjaSql. This post further demonstrates the power of Jinja2 within JinjaSql templates using presets, loops, and custom functions. Let’s consider an everyday use case when we have a table with some dimensions and some numerical values, and we want to find some metrics for a given dimension or a combination of dimensions. The example data below is tiny and entirely made up, but it suffices for demonstrating the advanced features. First, I introduce the data set and the questions to be answered with SQL queries. I’ll then build the queries without templates, and finally, show how to use the SQL templates to parameterize and generate these queries. To generate SQL from JinjaSql templates, I’ll use (and modify) the apply_sql_template function introduced in the previous blog and available on GitHub in sql_tempates_base.py.

Example data set

Let’s consider a table transactions that contains records about purchases in some stores. The purchase can be made by cash, with a credit or a debit card, which adds an extra dimension to the data. Here is the code for creating and populating the table.

create table transactions (
    transaction_id int,
    user_id int,
    transaction_date date,
    store_id int,
    payment_method varchar(10),
    amount float
)
;

insert into transactions
(transaction_id, user_id, transaction_date, store_id, payment_method, amount)
values
    (1, 1234, '2019-03-02', 1, 'cash', 5.25),
    (1, 1234, '2019-03-01', 1, 'credit', 10.75),
    (1, 1234, '2019-03-02', 2, 'cash', 25.50),
    (1, 1234, '2019-03-03', 2, 'credit', 17.00),
    (1, 4321, '2019-03-01', 2, 'cash', 20.00),
    (1, 4321, '2019-03-02', 2, 'debit', 30.00),
    (1, 4321, '2019-03-03', 1, 'cash', 3.00)
;

Metrics to compute

When exploring a data set, it is common to look at the main performance metrics across all dimensions. In this example, we want to compute the following metrics:

number of transactions
average transaction amount
the total amount of transactions

We want these metrics for each user, store, and payment method. We also want to look at these metrics by store and payment method together.

Template for a single dimension

The query to obtain the metrics for each store is:

select
    store_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
    , avg(amount) as avg_amount
from
    transactions
group by
    store_id
order by total_amount desc

To get the same metrics for other dimensions, we only need to change the store_id into user_id or payment_method in both the select and group by clauses. So the JinjaSql template may look like

_BASIC_STATS_TEMPLATE = '''
select
    {{ dim | sqlsafe }}
    , count(*) as num_transactions
    , sum(amount) as total_amount
    , avg(amount) as avg_amount
from
    transactions
group by
    {{ dim | sqlsafe }}
order by total_amount desc
'''

with parameters, for example, as

params = {
    'dim': 'payment_method'
}
sql = apply_sql_template(_BASIC_STATS_TEMPLATE, params)

The above template works for a single dimension, but what if we have more than one? To generate a generic query that works with any number of dimensions, let’s create a skeleton of a function that takes a list of dimension column names as an argument and returns the SQL.

def get_basic_stats_sql(dimensions):
  '''
  Returns the sql computing the number of transactions,
  as well as the total and the average transaction amounts
  for the provided list of column names as dimensions.
  '''
  # TODO: construct params
  return apply_sql_template(_BASIC_STATS_TEMPLATE, params)

Essentially, we want to transform a list of column names, such as ['payment_method'] or ['store_id', 'payment_method'] into a single string containing the column names as a comma-separated list. Here we have some options, as it can be done either in Python or in the template.

Passing a string generated outside the template

The first option is to generate the comma-separated string before passing it to the template. We can do it simply by joining the members of the list together:

def get_basic_stats_sql(dimensions):
    '''
    Returns the sql computing the number of transactions,
    as well as the total and the average transaction amounts
    for the provided list of column names as dimensions.
    '''
    params = {
      'dim': '\n    , '.join(dimensions)
    }
    return apply_sql_template(_BASIC_STATS_TEMPLATE, params)

It so happens that the template parameter dim is in the right place, so the resulting query is

>>> print(get_basic_stats_sql(['store_id', 'payment_method']))
select
    store_id
    , payment_method
    , count(*) as num_transactions
    , sum(amount) as total_amount
    , avg(amount) as avg_amount
from
    transactions
group by
    store_id
    , payment_method
order by total_amount desc

Now we can quickly generate SQL queries for all desired dimensions using

dimension_lists = [
    ['user_id'],
    ['store_id'],
    ['payment_method'],
    ['store_id', 'payment_method'],
]

dimension_queries = [get_basic_stats_sql(dims) for dims in dimension_lists]

Preset variables inside the template

An alternative to passing a pre-built string as a template parameter is to move the column list SQL generation into the template itself by setting a new variable at the top:

_PRESET_VAR_STATS_TEMPLATE = '''
{% set dims = '\n    , '.join(dimensions) %}
select
    {{ dims | sqlsafe }}
    , count(*) as num_transactions
    , sum(amount) as total_amount
    , avg(amount) as avg_amount
from
    transactions
group by
    {{ dims | sqlsafe }}
order by total_amount desc
'''

This template is more readable than the previous version since all transformations happen in one place in the template, and at the same time, there’s no clutter. The function should change to

def get_stats_sql(dimensions):
    '''
    Returns the sql computing the number of transactions,
    as well as the total and the average transaction amounts
    for the provided list of column names as dimensions.
    '''
    params = {
      'dimensions': dimensions
    }
    return apply_sql_template(_PRESET_VAR_STATS_TEMPLATE, params)

Loops inside the template

We can also use loops inside the template to generate the columns.

_LOOPS_STATS_TEMPLATE = '''
select
    {{ dimensions[0] | sqlsafe }}\
    {% for dim in dimensions[1:] %}
    , {{ dim | sqlsafe }}{% endfor %}
    , count(*) as num_transactions
    , sum(amount) as total_amount
    , avg(amount) as avg_amount
from
    transactions
group by
    {{ dimensions[0] | sqlsafe }}\
    {% for dim in dimensions[1:] %}
    , {{ dim | sqlsafe }}{% endfor %}
order by total_amount desc
'''

This example may not be the best use of the loops because a preset variable does the job just fine without the extra complexity. However, loops are a powerful construct, especially when there is additional logic inside the loop, such as conditions ({% if ... %} - {% endif %}) or nested loops.

So what is happening in the template above? The first element of the list dimensions[0] stands alone because it doesn’t need a comma in front of the column name. We wouldn’t need that if there were a defined first column in the query, and the for-loop would look simply as

    {% for dim in dimensions %}
    , {{ dim | sqlsafe }}
    {% endfor %}

Then, the for-loop construct goes over the remaining elements dimensions[1:]. The same code appears in the group by clause, which is also not ideal and only serves the purpose of showing the loop functionality.

One may wonder why the formatting of the loop is so strange. The reason is that the flow elements of the SQL template, such as {% endfor %}, generate a blank line if they appear on a separate line. To avoid that, in the template above, both {% for ... %} and {% endfor %} are technically on the same line as the previous code (hence the backslash \ after the first column name). SQL, of course, doesn’t care about whitespace, but humans who read SQL may (and should) care. An alternative to fighting with formatting within the template that would make it more readable is to strip the blank lines from the generated query before printing or logging it. A useful function for that purpose is

import os
def strip_blank_lines(text):
    '''
    Removes blank lines from the text, including those containing only spaces.
    https://stackoverflow.com/questions/1140958/whats-a-quick-one-liner-to-remove-empty-lines-from-a-python-string
    '''
    return os.linesep.join([s for s in text.splitlines() if s.strip()])

A better-formatted template then becomes

_LOOPS_STATS_TEMPLATE = '''
select
    {{ dimensions[0] | sqlsafe }}
    {% for dim in dimensions[1:] %}
    , {{ dim | sqlsafe }}
    {% endfor %}
    , count(*) as num_transactions
    , sum(amount) as total_amount
    , avg(amount) as avg_amount
from
    transactions
group by
    {{ dimensions[0] | sqlsafe }}
    {% for dim in dimensions[1:] %}
    , {{ dim | sqlsafe }}
    {% endfor %}
order by total_amount desc
'''

And the call to print the query is

print(strip_blank_lines(get_loops_stats_sql(['store_id', 'payment_method'])))

All the SQL templates so far used a list of dimensions to produce precisely the same query.

Custom dimensions with looping over a dictionary

In the loop example above, we see how to iterate over a list. It is also possible to iterate over a dictionary. This comes in handy, for example, when we want to alias or transform some or all of the columns that form dimensions. Suppose we wanted to combine the debit and credit cards as a single value and compare it to cash transactions. We can accomplish that by first creating a dictionary defining a transformation for the payment_method and keeping the store_id unchanged.

custom_dimensions = {
    'store_id': 'store_id',
    'card_or_cash': "case when payment_method = 'cash' then 'cash' else 'card' end",
}

Here, both credit and debit values are replaced with card. Then, the template may look like the following:

_CUSTOM_STATS_TEMPLATE = '''
{% set dims = '\n    , '.join(dimensions.keys()) %}
select
    sum(amount) as total_amount
    {% for dim, def in dimensions.items() %}
    , {{ def | sqlsafe }} as {{ dim | sqlsafe }}
    {% endfor %}
    , count(*) as num_transactions
    , avg(amount) as avg_amount
from
    transactions
group by
    {{ dims | sqlsafe }}
order by total_amount desc
'''

Note that I moved the total_amount as the first column just to simplify this example and avoid having to deal with the first element in the loop separately. Also, note that the group by clause is using a preset variable and is different from the code in the select query because it only lists the names of the generated columns. The resulting SQL query is

>>> print(strip_blank_lines(
...     apply_sql_template(template=_CUSTOM_STATS_TEMPLATE,
...                        parameters={'dimensions': custom_dimensions})))
select
    sum(amount) as total_amount
    , store_id as store_id
    , case when payment_method = 'cash' then 'cash' else 'card' end as card_or_cash
    , count(*) as num_transactions
    , avg(amount) as avg_amount
from
    transactions
group by
    store_id
    , card_or_cash
order by total_amount desc

Calling custom Python functions from within JinjaSql templates

What if we want to use a Python function to generate a portion of the code? Jinja2 allows one to register custom functions and functions from other packages for use within the SQL templates. Let’s start with defining a function that generates the string that we insert into the SQL for transforming custom dimensions.

def transform_dimensions(dimensions: dict) -> str:
    '''
    Generate SQL for aliasing or transforming the dimension columns.
    '''
    return '\n    , '.join([
        '{val} as {key}'.format(val=val, key=key)
        for key, val in dimensions.items()
    ])

The output of this function is what we expect to appear in the select clause:

>>> print(transform_dimensions(custom_dimensions))
store_id as store_id
    , case when payment_method = 'cash' then 'cash' else 'card' end as card_or_cash

Now we need to register this function with Jinja2. To do that, we’ll modify the apply_sql_template function from the previous blog as follows.

from jinjasql import JinjaSql

def apply_sql_template(template, parameters, func_list=None):
    '''
    Apply a JinjaSql template (string) substituting parameters (dict) and return
    the final SQL. Use the func_list to pass any functions called from the template.
    '''
    j = JinjaSql(param_style='pyformat')
    if func_list:
        for func in func_list:
            j.env.globals[func.__name__] = func
    query, bind_params = j.prepare_query(template, parameters)
    return get_sql_from_template(query, bind_params)

This version has an additional optional argument func_list that needs to be a list of functions.

Let’s change the template to take advantage of the transform_dimensions function.

_FUNCTION_STATS_TEMPLATE = '''
{% set dims = '\n    , '.join(dimensions.keys()) %}
select
    {{ transform_dimensions(dimensions) | sqlsafe }}
    , sum(amount) as total_amount
    , count(*) as num_transactions
    , avg(amount) as avg_amount
from
    transactions
group by
    {{ dims | sqlsafe }}
order by total_amount desc
'''

Now we also don’t need to worry about the first column not having a comma. The following call produces a SQL query similar to that in the previous section.

>>> print(strip_blank_lines(
...     apply_sql_template(template=_FUNCTION_STATS_TEMPLATE,
...                        parameters={'dimensions': custom_dimensions},
...                        func_list=[transform_dimensions])))

select
    store_id as store_id
    , case when payment_method = 'cash' then 'cash' else 'card' end as card_or_cash
    , sum(amount) as total_amount
    , count(*) as num_transactions
    , avg(amount) as avg_amount
from
    transactions
group by
    store_id
    , card_or_cash
order by total_amount desc

Note how we pass transform_dimensions to apply_sql_template as a list [transform_dimensions]. Multiple functions can be passed into SQL templates as a list of functions, for example, [func1, func2].

Conclusion

This tutorial is a follow up to my first post on the basic use of JinjaSql. It demonstrates the use of preset variables, loops over lists and dictionaries, and custom Python functions within JinjaSql templates for advanced SQL code generation. In particular, the addition of custom functions registration to the apply_sql_template function makes templating much more powerful and versatile. Parameterized SQL queries continue to be indispensable for automated report generation and for reducing the amount of SQL code that needs to be maintained. An added benefit is that with reusable SQL code snippets, it becomes easier to use standard Python unit testing techniques to verify that the generated SQL is correct.

The code in this post is licensed under the MIT License.

Photo by Sergei Izrailev

The post Advanced SQL Templates In Python with JinjaSql by Sergei Izrailev appeared first on Life Around Data.

Left Join with Pandas Data Frames in Python

Sergei Izrailev — Wed, 21 Aug 2019 14:11:48 +0000

Merging Pandas data frames is covered extensively in a StackOverflow article Pandas Merging 101. However, my experience of grading data science take-home tests leads me to believe that left joins remain to be a challenge for many people. In this post, I show how to properly handle cases when the right table (data frame) in a Pandas left join contains nulls.

Let’s consider a scenario where we have a table transactions containing transactions performed by some users and a table users containing some user properties, for example, their favorite color. We want to annotate the transactions with the users’ properties. Here are the data frames:

import numpy as np
import pandas as pd

np.random.seed(0)
# transactions
left = pd.DataFrame({'transaction_id': ['A', 'B', 'C', 'D'],
                     'user_id': ['Peter', 'John', 'John', 'Anna'],
                     'value': np.random.randn(4),
                    })

# users
right = pd.DataFrame({'user_id': ['Paul', 'Mary', 'John', 'Anna'],
                      'favorite_color': ['blue', 'blue', 'red', np.NaN],
                     })

Note that Peter is not in the users table and Anna doesn’t have a favorite color.

>>> left
  transaction_id user_id     value
0              A   Peter  1.867558
1              B    John -0.977278
2              C    John  0.950088
3              D    Anna -0.151357

>>> right
  user_id favorite_color
0    Paul           blue
1    Mary           blue
2    John            red
3    Anna            NaN

Adding the user’s favorite color to the transaction table seems straightforward using a left join on the user id:

>>> left.merge(right, on='user_id', how='left')
  transaction_id user_id     value favorite_color
0              A   Peter  1.867558            NaN
1              B    John -0.977278            red
2              C    John  0.950088            red
3              D    Anna -0.151357            NaN

We see that Peter and Anna have NaNs in the favorite_color column. However, the missing values are there for two different reasons: Peter’s record didn’t have a match in the users table, while Anna didn’t have a value for the favorite color. In some cases, this subtle difference is important. For example, it can be critical to understanding the data during initial exploration and to improving data quality.

Here are two simple methods to track the differences in why a value is missing in the result of a left join. The first is provided directly by the merge function through the indicator parameter. When set to True, the resulting data frame has an additional column _merge:

>>> left.merge(right, on='user_id', how='left', indicator=True)
  transaction_id user_id     value favorite_color     _merge
0              A   Peter  1.867558            NaN  left_only
1              B    John -0.977278            red       both
2              C    John  0.950088            red       both
3              D    Anna -0.151357            NaN       both

The second method is related to how it would be done in the SQL world and explicitly adds a column representing the user_id in the right table. We note that if the join columns in the two tables have different names, both columns appear in the resulting data frame, so we rename the user_id column in the users table before merging.

>>> left.merge(right.rename({'user_id': 'user_id_r'}, axis=1),
               left_on='user_id', right_on='user_id_r', how='left')
  transaction_id user_id     value user_id_r favorite_color
0              A   Peter  1.867558       NaN            NaN
1              B    John -0.977278      John            red
2              C    John  0.950088      John            red
3              D    Anna -0.151357      Anna            NaN

An equivalent SQL query is

select
    t.transaction_id
    , t.user_id
    , t.value
    , u.user_id as user_id_r
    , u.favorite_color
from
    transactions t
    left join
    users u
    on t.user_id = u.user_id
;

In conclusion, adding an extra column that indicates whether there was a match in the Pandas left join allows us to subsequently treat the missing values for the favorite color differently depending on whether the user was known but didn’t have a favorite color or the user was missing from the users table.

Photo by Ilona Froehlich on Unsplash.

The post Left Join with Pandas Data Frames in Python by Sergei Izrailev appeared first on Life Around Data.

Summary of “How We Learn”

Sergei Izrailev — Mon, 13 May 2019 00:18:39 +0000

“How We Learn” by Benedict Carey covers a vast body of research on learning science and challenges some commonly accepted learning practices. In this series of posts, I summarize my interpretation of the practical advice presented in the book. I am looking at the contents from several perspectives: are there better ways to learn myself, are there better ways to teach, and how can I help my kids to study more efficiently.

The book is entertaining, informative, and provides specific recipes for making learning more effective that one can easily incorporate into everyday life. A common thread throughout the book is about the positive effects of alternating different aspects of learning: physical places, noise level, context, types of problems, sleep, distractions, focus periods, etc. A condensed list of findings includes the following (however, read more to understand the context!):

Studying in the same quiet place at the same time of day is overrated. Varying the setting (where and how we study, what sounds we hear while we study, etc.) can help retention.
Spacing the study time instead of cramming it all at once before the test gives a “free” boost to results.
Testing is learning, so self-examination and flash cards are good, as long as we don’t cheat by looking at the answers.
Fluency illusion is when, looking at the material, we think we know it. Self-examination can bring the gaps to light.
Teaching someone else what you just learned is in itself a learning technique.
Distraction is not all that bad when we are stuck. Short and long-term distractions work slightly differently, and both can get us “unstuck.”
Interrupting an activity keeps it top-of-mind and attunes our brains to relevant information, which can be used on purpose to keep thinking about the problem in the background.
Start on complicated long-term projects as soon as possible: it prevents the project from growing on you, and because you keep thinking about it, it may not be as difficult as it seems at first.
Mixing different skills during practice enhances the benefit of repetition by making what we learn applicable to a previously unseen or a real-life situation. This approach works for sports, music, math, and almost any other subject.
Sleep improves learning by consolidating and processing what we learned, and it’s good to have at least a 1-1.5 hour nap before the test.
Perceptual learning, which is when we learn automatically without thinking by discerning patterns among closely related subjects (paintings, equations, etc.), is a subject of its own and has exciting implications for creators of educational apps and video games.

In the three follow-up posts, I summarize what’s behind these findings, and the book itself has much, much more.

Photo by Green Chameleon on Unsplash.

The post Summary of “How We Learn” by Sergei Izrailev appeared first on Life Around Data.

Summary of “How We Learn” – Part 2: The Basics

Sergei Izrailev — Mon, 13 May 2019 00:13:11 +0000

This is Part 2 of 4 of my summary of “How We Learn” by Benedict Carey. Other parts:

Memory strengths: storage and retrieval

The theory of learning states, among other things, that memory has two strengths: storage and retrieval. Once we store something in memory, it is there forever (more or less) – the size of the storage in the brain is more than enough to save every second of one’s life. The learning strength “builds up steadily with studying, and more sharply with use,” and it can only increase. On the other hand, much of learning is about retrieval strength. It is how easily we can remember what we learned or experienced previously. The retrieval strength also increases with studying and with use, however, it drops off quickly without reinforcement. Also, the “capacity [of retrieval] is relatively small, compared to storage. At any given time, we can pull up only a limited number of items in connection with any given cue or reminder.” Improvements to the learning process are then aimed at increasing the retrieval strength, especially long term and as applied to situations and problems that we didn’t study directly (generalization).

The power of forgetting

Think of learning as building a muscle: forgetting is important because “the harder we have to work to retrieve a memory, the greater the subsequent spike in retrieval and storage strength (learning).” The practical use cases of this finding include (more details to follow):

Spacing of study sessions: let yourself forget some of what you studied to make it harder (but still possible) to retrieve by breaking your study time in multiple sessions.
Self-testing and testing in general – unaided retrieval of what you’ve already learned is in itself a learning process. In my opinion, this speaks in favor of more tests in school, but only if they are low-pressure, i.e., not carrying the weight of future admissions to the next level of education.

The effect of context on learning

Studies show that changing the context of where and when we learn helps retrieval. For example, contextual cues, such as the place where we study, the background music, the light and color of the environment, weave themselves into the learning and when present again help trigger the memory of what we learned. “We can easily multiply the number of perceptions connected to a given memory — most simply, by varying where we study.”

Breaking up the study time

Spacing out study sessions has been proven to help for improving retention of memorized material. “If the test is in a week, and you want to split your study time in two, then do a session today and tomorrow, or today and the day after tomorrow. If you want to add a third, study the day before the test ( just under a week later ).” The book provides a specific recipe for varying time intervals for preparing for a test. This technique works well for memorizing facts such as foreign languages, names, places, etc.

Testing is learning

Many of us had used a technique when someone is asking us questions about the material that we expect to be on the test, and we try to answer these questions from memory. It turns out that not only the process checks our knowledge, but also by remembering the answers without looking at the notes or textbook, we are improving our retrieval strength of the material. Such testing can be a test given by the teacher or a self-examination, for example, trying to recite a poem by heart. An essential aspect of this process is providing the correct answer immediately or shortly after the test. Many online corporate training programs use this technique (on purpose or not) by providing short quizzes and highlighting correct answers afterward.

Another interesting concept is pre-testing, i.e., testing for something you haven’t learned yet — the mere process of guessing the right answer wires our brains to the material taught after the pre-testing. As a result, we have a much better recall during the tests performed after the learning is complete.

To summarize, “testing does not = studying, after all. In fact, testing > studying, and by a country mile, on delayed tests.”

Teaching someone is an effective learning technique

There is a saying to the effect of “you haven’t learned the subject until you can teach it to others.” Trying to explain to someone what you’ve just learned triggers better learning and elucidates the gaps in your understanding. A long time ago, I read a book about a woman who made a living teaching foreign languages in Poland after World War II. She only knew Polish when she started. She then achieved fluency in seven languages by studying them herself lesson by lesson from books and then teaching the same lessons to her students. She was always just a couple of lessons behind with her students than what she was learning herself. In my own experience, explaining a concept to someone brings clarity that is rarely achieved by just reading about it. By the way, writing this article also has this effect!

Fluency illusion

The fluency illusion is “the belief that because facts or formulas or arguments are easy to remember right now, they’ll remain that way tomorrow or the next day.” When we look at notes, highlighted sections, or outlines, we feel that we know the material. However, when the notes are not available, we can’t remember anything. The illusion is so strong that is it is easy to convince yourself that you are ready for the test if you can recall something you just read. This illusion is a consequence of the fact that “the easier it is to call a fact to mind, the smaller the effect on learning.” Self-examination mentioned above is an ample antidote to the fluency illusion.

The role of sleep

Sleep has a consolidating effect on learning. Experiments showed that people who had a 1-1.5 hour nap or slept overnight between studying and testing performed better on the tests. The book explains the 5 phases of sleep and makes an interesting point that a full cycle nap of 1-1.5 hours was shown to affect learning similar to a full 8-hour overnight sleep. I conjecture that for those pulling all-nighters before the exam, even a couple of hours of sleep can improve the results. “Unconscious downtime clarifies memory and sharpens skills — that it’s a necessary step to lock in both. In a fundamental sense, that is, sleep is learning .”

Next: Summary of “How We Learn” – Part 3: Distraction

All quotes above are from the book “How We Learn“.

Photo by Alexis Brown on Unsplash.

The post Summary of “How We Learn” – Part 2: The Basics by Sergei Izrailev appeared first on Life Around Data.

Summary of “How We Learn” – Part 3: Distraction

Sergei Izrailev — Mon, 13 May 2019 00:10:10 +0000

This is Part 3 of 4 of my summary of “How We Learn” by Benedict Carey. Other parts:

Distraction

Have you ever felt a solution to a problem spontaneously come to you after you’ve been distracted from focusing on the problem? Any great ideas in the shower? It is a common phenomenon, which is the primary subject of another book, “The Net and the Butterfly” by Olivia Fox Cabane and Judah Pollack. They argue that there are two networks in our brain: the top-of-the-mind “executive” network and a default “background” network. The book “How We Learn” breaks the interruptions that allow the background network to kick in into short-term (incubation) and long-term (percolation). Incubation is a distraction lasting for some number of minutes and involving playing a video game, just spacing out, reading a book, whereas percolation is a longer-term repeated interruption spanning days, weeks and months.

Short-term distraction: Incubation

Incubation refers to distraction periods of 5-20 minutes that work best for problems that have a single solution that is not readily apparent. The process involving incubation consists of three stages. The first is preparation that can last hours or days (or longer) when we struggle to solve a problem at hand. The second is incubation that starts when we temporarily abandon the problem. It is essential that at this point we’ve reached an impasse and got stuck rather than just experienced a bump. “Knock off and play a videogame too soon and you get nothing.” The third stage is illumination, or the “aha” moment. The final stage is verifying that the idea that came during illumination works.

“The Net and the Butterfly” tells us that the “aha” moments are fleeting and that we need to write down the ideas and insights before they fade away. Also, several practical techniques are useful to increase the frequency of the “aha” moments by understanding how and when they occur.

There are “two mental operations that aid incubation: picking up clues from the environment, and breaking fixed assumptions.” An example of a fixed assumption is the “puzzle” of “A doctor in Boston has a brother who is a doctor in Chicago, but the doctor in Chicago doesn’t have a brother at all.” The fixed assumption is that a doctor must be a male, and the answer is, of course, that the doctor in Boston is a woman. Numerous puzzles and experiments investigate the fixed assumptions. Incubation helps break the fixed assumptions and therefore solve problems from a different angle.

Long-term distraction: Percolation

Percolation is a long-term cumulative process that is distinct from the short-term incubation. “Percolation is for building something that was not there before, whether it’s a term paper, a robot, an orchestral piece, or some other labyrinthine project.” In my mind, percolation is a process of thinking about a project or a problem on and off, keeping it in mind the whole time. The “off” time is when percolation happens, often, subconsciously. Again, the trick is to catch the moment and write down the ideas coming from percolation before they disappear.

“Quitting before I’m ahead doesn’t put the project to sleep; it keeps it awake. That’s Phase 1, and it initiates Phase 2, the period of … casual data collecting. Phase 3 is listening to what I think about all those incoming bits and pieces. Percolation depends on all three elements, and in that order.”

Interruptions

Part of the discussion of percolation in the book deals with interruptions. Studies show that an activity that was interrupted, especially in the worst possible moment, remains top-of-mind for some time because we tend to think of unfinished tasks as goals. This finding leads to two distinct use cases, one of which is described in “How We Learn.” Deliberate self-interruption causes the brain to keep being attuned to information that may be relevant to the problem at hand, and make connections with already stored information, while in the “background” mode.

A different use case is described in the book “Deep Work” by Cal Newport and has to do with purging out unfinished tasks from your at the end of the workday so that you can relax at night and not think about work.

Next: Summary of “How We Learn” – Part 4: Interleaving and Perceptual Learning

All quotes above are from the book “How We Learn“.

Photo by JESHOOTS.COM on Unsplash.

The post Summary of “How We Learn” – Part 3: Distraction by Sergei Izrailev appeared first on Life Around Data.

Summary of “How We Learn” – Part 4: Interleaving and Perceptual Learning

Sergei Izrailev — Mon, 13 May 2019 00:05:32 +0000

This is Part 4 of 4 of my summary of “How We Learn” by Benedict Carey. Other parts:

Interleaving

We’ve all heard the advice of using repetition when practicing a skill. Without a doubt, the repetition of a single skill works. As it turns out, practicing mixed skills works much better. “Varied practice produces a slower apparent rate of improvement in each single practice session but a greater accumulation of skill and learning over time.” For example, badminton players who practiced three different types of serves in random order did better in a slightly different setting (serving to the other side of the court) than players who practiced the same serves in blocks, one type of serve per training session. Another example comes from learning about art. “Counterintuitive as it may be to art history teachers … interleaving paintings by different artists was more effective than massing all of an artist’s paintings together.”

The discussion of mixed practice also touches on a phenomenon that I also observed in the past, when “kids who do great on unit tests — the weekly, or biweekly reviews — often do terribly on cumulative exams on the same material.” The same happens in sports when someone who performs very well during practice seems to lose it during an actual game. The reason is thought to be the inability of choosing a strategy for solving a problem on a test. In unit tests, we typically practice a single approach that we just learned. On a cumulative test (and in real life!), one needs first to decide which strategy is appropriate, and then apply it. Interleaving different types of problems during learning helps the skills to be more applicable under varying conditions.

Interleaving increases our ability to generalize and apply learnings in different situations. “The science suggests that interleaving is, essentially, about preparing the brain for the unexpected.” In practical terms, the advice is to mix learning new material with a dose of “stuff you already know but haven’t revisited in a while.”

Perceptual learning

What is a “good eye”? How can a chess grandmaster understand the position on the board in a few seconds, a professional baseball player decide to hit the ball, and an experienced airplane pilot to quickly make sense of the navigation panel with so many dials? Experience is critical here, but apparently, there is a type of learning that happens automatically without thinking and can help us develop a “good eye” for specific situations. Moreover, we can do it “cheap, quick and dirty.”

Perceptual learning happens automatically, i.e., without our conscious participation, when we are repeatedly exposed to whatever we want to learn to distinguish from one another – painting styles, airplane control readings, similar squiggles, sounds, pictures of birds – and given correct answers. “We have to pay attention, of course, but we don’t need to turn it on or tune it in.” During this process, “the brain doesn’t solely learn to perceive by picking up on tiny differences in what it sees, hears, smells, or feels. … it also perceives to learn. It takes the differences it has detected between similar – looking notes or letters or figures, and uses those to help decipher new, previously unseen material.”

An example of a practical application of perceptual learning is a “perceptual learning module” (PLM) – a computer program that trains pilots to read the airplane instrument panels. It displays the instrument panels with a choice of 7 possible answers describing the state of the plane, such as “Straight & Level” or “Descending Turn.” When the trainee gives a wrong answer, it flashes and provides the right one. Initially, novices are merely guessing. However, “after one hour [of training] they could read the panels as well as pilots with an average of one thousand flying hours.” Note that “it’s a supplement to experience, not a substitute.” The pilots still need to fly the plane.

In another example, the author describes a training system that he devised for himself to learn 12 painting styles by using a PLM loaded with images of 120 paintings, 10 per style, with interleaving of different styles. After one hour of practicing, he was able to identify the styles of previously unseen paintings with an 80% accuracy.

The implications of the success of PLMs are enormous. One can imagine a whole plethora of learning apps that build the “good eye” from learning Chinese characters and math equations to radiology and chemistry. It follows that in situations when an experienced professional would say that something is not right and then investigate, we can train ourselves to recognize such circumstances, given labeled data, without having years and years of experience.

Being a data scientist, I find perceptual learning to be fascinating since on the surface the process resembles so much how AI algorithms “learn” patterns in the data by “looking” at examples with correct labels. In deep neural networks, just like in our brains, we don’t necessarily understand exactly how the connections are made and why the system can recognize pictures of cats or human faces.

Conclusion

In this series of posts, I attempted to summarize the book’s findings and suggestions that I found interesting. There’s a lot more in the book itself, so I encourage you to read the original.

All quotes above are from the book “How We Learn“.

Photos by Tadas Mikuckis and Arie Wubben on Unsplash.

The post Summary of “How We Learn” – Part 4: Interleaving and Perceptual Learning by Sergei Izrailev appeared first on Life Around Data.

Faster Data Science: From Big To Small Data

Sergei Izrailev — Sun, 31 Mar 2019 05:29:32 +0000

Business leaders often ask how to accelerate data science projects. It is well established that data scientists spend as much as 80% of their time on data wrangling. Reducing this time leads to faster data science project turnaround and allows data scientists to spend a larger fraction of their time on high-value activities that cannot be performed by others. Economic and productivity benefits grow quickly with the size of the data science team. This post makes a case for a moderate investment in data engineering that drastically reduces the time spent on interactive data exploration by satisfying the need for smaller representative data sets.

Reduce the turnaround time of analytics queries

The key to reducing the time spent in data exploration and code development is minimizing the time it takes to get an answer to basic questions about the data. The dynamics of the development process changes drastically when the typical run time of an analytics query goes from several hours to an hour to 10 minutes to under a minute to under 10 seconds. Once the query response time gets into the range of a few seconds, one can and does ask many questions that test different hypotheses and ultimately arrives at a better result in a significantly shorter time.

Big Data is too big

During research projects as well as the development of data pipelines, data scientists necessarily run their queries multiple times, refining and improving them to get the data they need in the format they want. However, working with Big Data is time-consuming because processing a large data set can take a long time whether you are using Hadoop or a modern distributed data warehouse such as Snowflake, Redshift, or Vertica. This problem is often exacerbated by a pattern of using shared computational resources where interactive queries compete with larger batch processes.

Data scientists, frustrated by how long it takes to move their research forward, invent shortcuts that avoid long wait times and often sacrifice the accuracy or applicability of the results of their analyses. For example, working with a single hour of log data makes the query time bearable so that one can make progress. However, the results cannot be used to draw conclusions about the whole data set.

Data engineers can alleviate the pain of data wrangling by employing several simple approaches that reduce the query time with only a moderate amount of work and little extra cost. These approaches, reviewed in detail below, aim at reducing the size and the scope of the data to a subset that is relevant to the questions asked.

Use columnar data stores

The way the data is stored makes a massive difference to how easily it is accessible for analysis. Columnar data formats, such as Parquet, are widely used and provide many benefits. In addition to limiting the data scans to just the columns that are present in the query and allowing better compression, these formats store the data in a binary format rather than text. Even in a highly efficient CSV reader, parsing the text into binary data types can consume about 80% of the time needed to load the data into computer memory. In addition, parsing text fields correctly can itself be a challenge. Pre-processing once and storing the data in a binary format avoids the additional computation every time the data is accessed. Data warehouses typically provide binary columnar data storage out-of-the-box.

Create always-on sampled data sets

Many data questions can be answered by using a representative sample of the data. Sampled data sets are much better than a small time slice of data because they are representative of the whole data set and are sufficient for answering a variety of questions. Even though extracting and updating a sample is tedious, every data scientist sooner or later is enticed to do it merely to shorten the turnaround time of their queries. A much more efficient solution that doesn’t cost much is providing an automatically generated sample of the data at regular intervals. Generation of sampled data should ideally be tied to the pipeline that produces the full data set.

In a typical situation when the data represents some events (e.g., ad impressions, purchase transactions, site visits, etc.) related to some entities (e.g., online users, companies, etc.), it is beneficial to create data sets with different types of sampling. An obvious one is sampling the events, where a given percentage of events is randomly chosen to be included in the data set. Another sampling strategy, potentially more valuable and harder for a data scientist or analyst to implement, is covering all records related to a sample of entities. For example, we may want to include all purchases for 5% of our site users. Such a sample allows one to perform a user-based analysis efficiently. Sampling strategies, including adaptive sampling, is a topic for another post.

Extract smaller sets of relevant data

When events of interest are rare, sampling may not be an option. For example, in the digital advertising setting, one may be interested in extracting all available data for a specific advertising campaign within a limited time frame. Such a data set, while complete with every necessary field, is a small subset of all of the data. Analysts and data scientists interacting with this data are likely to issue hundreds of queries while they work on a project. The process of extracting such a data set and keeping it up-to-date can be automated if the data engineering team builds tools that allow data scientists to create a query for the initial extract, with another query responsible for periodic updates. After the project is complete, the data is no longer needed and can be archived or deleted.

Put smaller data sets in efficient SQL stores

SQL is undoubtedly the most common language of data analytics. Thus, making data sets available for querying using SQL is expanding the number of people that can interact with the data. Democratization of the data aside, making smaller data sets available in efficient analytics SQL query engines further reduces the query time and, therefore, the wasted time for data scientists and analysts. Such query engines range, depending on the size of the data and the requirements to infrastructure, from MySQL and PostgreSQL to Snowflake and Redshift to fully managed services such as Amazon’s Athena and Google’s BigQuery.

Provide dedicated computational resources

One of the biggest frustrations in data exploration is having to wait in a shared queue for results of a query that eventually runs in under a minute. Separate high-priority resource pools for interactive queries on a shared cluster, possibly limited to business hours, go a long way in improving the overall query turnaround time. In addition, providing data science and analytics teams with dedicated computational resources that have sufficient CPU and memory capacity allows loading of the smaller data sets into a tool of their choice to perform development and in-depth analyses.

The investment is worth it!

Human time is an order of magnitude more expensive than computer time and data storage costs. One can observe this easily by comparing an effective hourly rate of a data scientist to the hourly cost of a powerful computer. Also, human time creates net new value for the company. Thus, reducing the typical analytics query completion time to under a minute is likely the most impactful investment in technology available to accelerate data science and analytics projects.

Photos by Kelly Sikkema and Glen Noble on Unsplash combined by Sergei Izrailev

The post Faster Data Science: From Big To Small Data by Sergei Izrailev appeared first on Life Around Data.

A Simple Approach To Templated SQL Queries In Python

Sergei Izrailev — Fri, 08 Mar 2019 04:15:47 +0000

There are numerous situations in which one would want to insert parameters in a SQL query, and there are many ways to implement templated SQL queries in python. Without going into comparing different approaches, this post explains a simple and effective method for parameterizing SQL using JinjaSql. Besides many powerful features of Jinja2, such as conditional statements and loops, JinjaSql offers a clean and straightforward way to parameterize not only the values substituted into the where and in clauses, but also SQL statements themselves, including parameterizing table and column names and composing queries by combining whole code blocks.

Basic parameter substitution

Let’s assume we have a table transactions holding records about financial transactions. The columns in this table could be transaction_id, user_id, transaction_date, and amount. To compute the number of transactions and the total amount for a given user on a given day, a query directly to the database may look something like

select
    user_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
from
    transactions
where
    user_id = 1234
    and transaction_date = '2019-03-02'
group by
    user_id

Here, we assume that the database will automatically convert the YYYY-MM-DD format of the string representation of the date into a proper date type.

If we want to run the query above for an arbitrary user and date, we need to parameterize the user_id and the transaction_date values. In JinjaSql, the corresponding template would simply become

select
    user_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
from
    transactions
where
    user_id = {{ uid }}
    and transaction_date = {{ tdate }}
group by
    user_id

Here, the values were replaced by placeholders with python variable names enclosed in double curly braces {{ }}. Note that the variable names uid and tdate were picked only to demonstrate that they are variable names and don’t have anything to do with the column names themselves. A more readable version of the same template stored in a python variable is

user_transaction_template = '''
select
    user_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
from
    transactions
where
    user_id = {{ user_id }}
    and transaction_date = {{ transaction_date }}
group by
    user_id
'''

Next, we need to set the parameters for the query.

params = {
    'user_id': 1234,
    'transaction_date': '2019-03-02',
}

Now, generating a SQL query from this template is straightforward.

from jinjasql import JinjaSql
j = JinjaSql(param_style='pyformat')
query, bind_params = j.prepare_query(user_transaction_template, params)

If we print query and bind_params, we find that the former is a parameterized string, and the latter is an OrderedDict of parameters:

>>> print(query)
select
    user_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
from
    transactions
where
    user_id = %(user_id)s
    and transaction_date = %(transaction_date)s
group by
    user_id
>>> print(bind_params)
OrderedDict([('user_id', 1234), ('transaction_date', '2018-03-01')])

Running parameterized queries

Many database connections have an option to pass bind_params as an argument to the method executing the SQL query on a connection. For a data scientist, it may be natural to get results of the query in a Pandas data frame. Once we have a connection conn, it is as easy as running read_sql:

import pandas as pd
frm = pd.read_sql(query, conn, params=bind_params)

See the JinjaSql docs for other examples.

From a template to the final SQL query

It is often desired to fully expand the query with all parameters before running it. For example, logging the full query is invaluable for debugging batch processes because one can copy-paste the query from the logs directly into an interactive SQL interface. It is tempting to substitute the bind_params into the query using python built-in string substitution. However, we quickly find that string parameters need to be quoted to result in proper SQL. For example, in the template above, the date value must be enclosed in single quotes.

>>> print(query % bind_params)

select
    user_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
from
    transactions
where
    user_id = 1234
    and transaction_date = 2018-03-01
group by
    user_id

To deal with this, we need a helper function to correctly quote parameters that are strings. We detect whether a parameter is a string, by calling

from six import string_types
isinstance(value, string_types)

This works for both python 3 and 2.7. The string parameters are converted to the str type, single quotes in the names are escaped by another single quote, and finally, the whole value is enclosed in single quotes.

from six import string_types

def quote_sql_string(value):
    '''
    If `value` is a string type, escapes single quotes in the string
    and returns the string enclosed in single quotes.
    '''
    if isinstance(value, string_types):
        new_value = str(value)
        new_value = new_value.replace("'", "''")
        return "'{}'".format(new_value)
    return value

Finally, to convert the template to proper SQL, we loop over bind_params, quote the strings, and then perform string substitution.

from copy import deepcopy

def get_sql_from_template(query, bind_params):
    if not bind_params:
        return query
    params = deepcopy(bind_params)
    for key, val in params.items():
        params[key] = quote_sql_string(val)
    return query % params

Now we can easily get the final query that we can log or run interactively:

>>> print(get_sql_from_template(query, bind_params))

select
    user_id
    , count(*) as num_transactions
    , sum(amount) as total_amount
from
    transactions
where
    user_id = 1234
    and transaction_date = '2018-03-01'
group by
    user_id

Putting it all together, another helper function wraps the JinjaSql calls and simply takes the template and a dict of parameters, and returns the full SQL.

from jinjasql import JinjaSql

def apply_sql_template(template, parameters):
    '''
    Apply a JinjaSql template (string) substituting parameters (dict) and return
    the final SQL.
    '''
    j = JinjaSql(param_style='pyformat')
    query, bind_params = j.prepare_query(template, parameters)
    return get_sql_from_template(query, bind_params)

Compute statistics on a column

Computing statistics on the values stored in a particular database column is handy both when first exploring the data and for data validation in production. Since we only want to demonstrate some features of the templates, for simplicity, let’s just work with integer columns, such as the column user_id in the table transactions above. For integer columns, we are interested in the number of unique values, min and max values, and the number of nulls. Some columns may have a default of say, -1, the drawbacks of which are beyond the scope of this post, however, we do want to capture that by reporting the number of default values.

Consider the following template and function. The function takes the table name, the column name and the default value as arguments, and returns the SQL for computing the statistics.

COLUMN_STATS_TEMPLATE = '''
select
    {{ column_name | sqlsafe }} as column_name
    , count(*) as num_rows
    , count(distinct {{ column_name | sqlsafe }}) as num_unique
    , sum(case when {{ column_name | sqlsafe }} is null then 1 else 0 end) as num_nulls
    {% if default_value %}
    , sum(case when {{ column_name | sqlsafe }} = {{ default_value }} then 1 else 0 end) as num_default
    {% else %}
    , 0 as num_default
    {% endif %}
    , min({{ column_name | sqlsafe }}) as min_value
    , max({{ column_name | sqlsafe }}) as max_value
from
    {{ table_name | sqlsafe }}
'''


def get_column_stats_sql(table_name, column_name, default_value):
    '''
    Returns the SQL for computing column statistics.
    Passing None for the default_value results in zero output for the number
    of default values.
    '''
    # Note that a string default needs to be quoted first.
    params = {
        'table_name': table_name,
        'column_name': column_name,
        'default_value': quote_sql_string(default_value),
    }
    return apply_sql_template(COLUMN_STATS_TEMPLATE, params)

This function is straightforward and very powerful because it applies to any column in any table. Note the {% if default_value %} syntax in the template. If the default value that is passed to the function is None, the SQL returns zero in the num_default field.

The function and template above will also work with strings, dates, and other data types if the default_value is set to None. However, to handle different data types more intelligently, it is necessary to extend the function to also take the data type as an argument and build the logic specific to different data types. For example, one might want to know the min and max of the string length instead of the min and max of the value itself.

Let’s look at the output for the transactions.user_id column.

>>> print(get_column_stats_sql('transactions', 'user_id', None))

select
    user_id as column_name
    , count(*) as num_rows
    , count(distinct user_id) as num_unique
    , sum(case when user_id is null then 1 else 0 end) as num_nulls

    , 0 as num_default

    , min(user_id) as min_value
    , max(user_id) as max_value
from
    transactions

Note that blank lines appear in place of the {% %} clauses and could be removed.

Summary

With the helper functions above, creating and running templated SQL queries in python is very easy. Because the details of parameter substitution are hidden, one can focus on building the template and the set of parameters, and then call a single function to get the final SQL.

One important caveat is the risk of code injection. For batch processes, it should not be an issue, but using the sqlsafe construct in web applications could be dangerous. The sqlsafe keyword indicates that the user (you) is confident that no code injection is possible and takes responsibility for simply putting whatever string is passed in the parameters directly into the query.

On the other hand, the ability to put an arbitrary string in the query allows one to pass whole code blocks into a template. For example, instead of passing table_name='transactions' above, one could pass '(select * from transactions where transaction_date = 2018-03-01) t', and the query would still work.

The code in this post is licensed under the MIT License.

Photo and image by Sergei Izrailev

The post A Simple Approach To Templated SQL Queries In Python by Sergei Izrailev appeared first on Life Around Data.

Over-the-Wall Data Science and How to Avoid Its Pitfalls

Sergei Izrailev — Sat, 03 Nov 2018 13:27:24 +0000

Over-the-wall data science is a common organizational pattern for deploying data science team output to production systems. A data scientist develops an algorithm, a model, or a machine learning pipeline, and then an engineer, often from another team, is responsible for putting the data scientist’s code in production.

Such a pattern of development attempts to solve for the following:

Quality: We want production code to be of high quality and maintained by engineering teams. Since most data scientists are not great software engineers, they are not expected to write end-to-end production-quality code.
Resource Allocation: Building and maintaining production systems requires special expertise, and data scientists can contribute more value solving problems for which they were trained rather than spend the time acquiring such expertise.
Skills: The programming language used in production may be different from what the data scientist is normally using.

However, there are numerous pitfalls in the over-the-wall development pattern that can be avoided with proper planning and resourcing.

What is over-the-wall data science?

A data scientist writes some code and spends a lot of time to get it to behave correctly. For example, the code may assemble data in a certain way and build a machine learning model that performs well on test data. Getting to this point is where data scientists spend most of their time iterating over the code and the data. The work product could be a set of scripts, or a Jupyter or RStudio notebook containing code snippets, documentation, and reproducible test results. In the extreme, the data scientist produces a document detailing the algorithm, using mathematical formulas and references to library calls, and doesn’t even give any code to the engineering team.

At this point, the code is thrown over the wall to Engineering.

An engineer is then tasked with productionizing the data scientist’s code. If the data scientist used R, and the production applications use Java, that could be a real challenge that in the worst case leads to rewriting everything in a different language. Even in a common and much simpler case of Python on both sides, the engineer may want to rewrite the code to satisfy coding standards, add tests, optimize it for performance, etc. As a result, the ownership of the production code lies with the engineer, and the data scientist can’t modify it.

This is, of course, an oversimplification, and there are many variations of such a process.

What is wrong with having a wall?

Let’s assume that the engineer successfully built the new code, the data scientist compared its results to the results of their own code, and the new code is released to production. Time goes by, and the data scientist needs to change something in the algorithm. The data engineer in the meantime moved on to other projects. Changing the algorithm in production becomes a lengthy process, involving waiting for an engineer (hopefully the same one) to become available. In many cases, after going through the process a couple of times, the data scientist simply gives up, and only critical updates are ever released.

Such interaction between data science and engineering frustrates data scientists because it makes it hard to make changes and strips them of ownership of the final code. It also makes it very difficult to troubleshoot production issues. It is also frustrating for engineers because they feel that they are excluded from the original design, don’t participate in the most interesting part of the project, and have to fix someone else’s code. The frustration on both sides makes the whole process even more difficult.

Breaking down the wall between data science and engineering

The need for over-the-wall data science can be eliminated entirely if data scientists are self-sufficient and can safely deploy their own code to production. This can be achieved by minimizing the footprint of data scientist’s code on production systems and by making engineers part of the AI system design and development process upfront. AI system development is a team sport, and both engineers and data scientists are required for success. Hiring and resource allocation must take that into account.

Make the teams cross-functional

Involving engineering early in the data science projects avoids the “us” and “them” mentality, makes the product much better, and encourages knowledge sharing. Even when a full cross-functional team of engineers and data scientists is not practical, forming a project team working together towards a common goal solves most of the problems of over-the-wall data science.

Expect data scientists to become better engineers

In the end, data scientists should own the logic of the AI code in the production application, and that logic needs to be isolated in the application so that data scientists could modify it themselves. In order to do so, data scientists must follow the same best practices as engineers. For example, writing unit and integration tests may feel like a lot of overhead for data scientists at first, however, the value of knowing that your code still works after you’ve made a change soon overcomes that feeling. Also, engineers must be part of the data scientists’ code review process to make sure the code is of production quality and there are no scalability or other issues.

Provide production tooling for data scientists

Engineers should build production-ready reusable components and wrappers, testing, deployment, and monitoring tools, as well as infrastructure and automation for data science related code. Data scientists can then focus on a much smaller portion of the code containing the main logic of the AI application. When the tooling is not in place, data scientists tend to spend much of their time on building the tools themselves.

Avoid rewriting the code in another language

The production environment is one of the constraints on the types of acceptable machine learning packages, algorithms, and languages. This constraint has to be enforced at the beginning of the project to avoid rewrites. A number of companies are offering production-oriented data science platforms and AI model deployment strategies both in open source and commercial products. These products, such as TensorFlow and H2O.ai, help solve the problem of a production environment being very different from that normally used by data scientists.

Images by MabelAmber and Wokadandapix on Pixabay

The post Over-the-Wall Data Science and How to Avoid Its Pitfalls by Sergei Izrailev appeared first on Life Around Data.

AI Systems Development Cycle And How It’s Different From Other Software

Sergei Izrailev — Thu, 27 Sep 2018 13:45:20 +0000

Most software development projects go through the same four phases: discovery, research, prototype, and production. Usually, the research and prototype stages are fairly light because experienced engineers can design a solution and when necessary, test their ideas with a quick proof-of-concept (PoC). AI systems development cycle, on the other hand, depends heavily on research to find whether we can actually build a machine learning model that performs well. In addition, putting an AI system in production operationally involves much more than building the models. Therefore, a working prototype is typically required for AI systems in order to have the confidence that the system will work end-to-end.

Let’s look at each of the four stages of AI systems development cycle in more detail.

Discovery

The discovery phase is responsible for defining the project: what are its goals, what is the business problem it is solving, why solving it is important, what is the value of solving it, what are the constraints, and how will we know that we’ve succeeded. Frequently, such information is captured in a Product Requirements Document (PRD) or a similar document, defining the “what” of the project. Some aspects of discovery are described in another article on reducing the risk of machine learning projects.

For AI systems, feasibility and quality of a solution to the problem at hand are usually not obvious from the start. Carefully defining the constraints can dramatically narrow down the choice of technology and algorithms. However, creating new machine learning models still largely remains to be a task for an expert. As a result, a research stage is needed in order to find whether or not an AI solution is possible, as well as to estimate its value and cost.

Research

The research phase answers in detail how we are going to solve the business problem. Relevant documentation of a typical software project may include a system design, various options considered during design and their trade-offs, specifications, etc., with enough information for an engineering team to build the software.

The research phase of AI systems development cycle is highly iterative, often manual, and heavy on visualizations and analytics. First, we need to check whether we can solve the problem with machine learning given the available data and constraints established in the discovery phase. We collect the data, extract it and transform it into inputs to a machine learning algorithm. We usually build many variants of a model, experiment with input data and algorithms, test and evaluate the models. Then we frequently go back to collecting and transforming the data. This cycle stops when, after a few (and sometimes many) iterations at every step, there’s a model that makes predictions with an acceptable accuracy. Information gathered during this process is passed back into the discovery phase.

Prototype

A prototype for an AI system is proof that a system reflecting the production design, without all the bells and whistles, can run end-to-end as code and produce predictions within the predefined constraints. Sometimes, the output of the research phase is close to a prototype, after a little clean-up and converting some manual steps into scripts. As we are getting closer to production, it is better to keep the prototype code at production quality and involve engineers who will be working on the production AI system.

Note that the goal of the prototype stage is not for a data scientist to create something that will then be rewritten by an engineer in a different language. Often referred to as “over the wall” development, such a pattern is extremely inefficient and should be avoided.

Production

The production stage of the AI systems development cycle is responsible for the final system that is able to reliably build, deploy and operate machine learning models. The reliability requirements lead to a plethora of components that can easily take much of the time and effort of the whole project. Such components include testing, validation, model tracking and versioning, deployment, automation, logging, monitoring, alerting, and error handling, to name a few.

Summary

The AI systems development cycle has the same stages as most other software. It is different in the much higher proportion of the effort allocated to the research and prototype stages. The operational components of AI systems at the production stage may also require much effort, especially in the first iteration of the whole cycle. Once the first AI system is in production, the frameworks used for operationalizing machine learning can be reused and improved on in the subsequent cycles.

The post AI Systems Development Cycle And How It’s Different From Other Software by Sergei Izrailev appeared first on Life Around Data.