Teams Archives | Life Around Data

Over-the-Wall Data Science and How to Avoid Its Pitfalls

Sergei Izrailev — Sat, 03 Nov 2018 13:27:24 +0000

Over-the-wall data science is a common organizational pattern for deploying data science team output to production systems. A data scientist develops an algorithm, a model, or a machine learning pipeline, and then an engineer, often from another team, is responsible for putting the data scientist’s code in production.

Such a pattern of development attempts to solve for the following:

Quality: We want production code to be of high quality and maintained by engineering teams. Since most data scientists are not great software engineers, they are not expected to write end-to-end production-quality code.
Resource Allocation: Building and maintaining production systems requires special expertise, and data scientists can contribute more value solving problems for which they were trained rather than spend the time acquiring such expertise.
Skills: The programming language used in production may be different from what the data scientist is normally using.

However, there are numerous pitfalls in the over-the-wall development pattern that can be avoided with proper planning and resourcing.

What is over-the-wall data science?

A data scientist writes some code and spends a lot of time to get it to behave correctly. For example, the code may assemble data in a certain way and build a machine learning model that performs well on test data. Getting to this point is where data scientists spend most of their time iterating over the code and the data. The work product could be a set of scripts, or a Jupyter or RStudio notebook containing code snippets, documentation, and reproducible test results. In the extreme, the data scientist produces a document detailing the algorithm, using mathematical formulas and references to library calls, and doesn’t even give any code to the engineering team.

At this point, the code is thrown over the wall to Engineering.

An engineer is then tasked with productionizing the data scientist’s code. If the data scientist used R, and the production applications use Java, that could be a real challenge that in the worst case leads to rewriting everything in a different language. Even in a common and much simpler case of Python on both sides, the engineer may want to rewrite the code to satisfy coding standards, add tests, optimize it for performance, etc. As a result, the ownership of the production code lies with the engineer, and the data scientist can’t modify it.

This is, of course, an oversimplification, and there are many variations of such a process.

What is wrong with having a wall?

Let’s assume that the engineer successfully built the new code, the data scientist compared its results to the results of their own code, and the new code is released to production. Time goes by, and the data scientist needs to change something in the algorithm. The data engineer in the meantime moved on to other projects. Changing the algorithm in production becomes a lengthy process, involving waiting for an engineer (hopefully the same one) to become available. In many cases, after going through the process a couple of times, the data scientist simply gives up, and only critical updates are ever released.

Such interaction between data science and engineering frustrates data scientists because it makes it hard to make changes and strips them of ownership of the final code. It also makes it very difficult to troubleshoot production issues. It is also frustrating for engineers because they feel that they are excluded from the original design, don’t participate in the most interesting part of the project, and have to fix someone else’s code. The frustration on both sides makes the whole process even more difficult.

Breaking down the wall between data science and engineering

The need for over-the-wall data science can be eliminated entirely if data scientists are self-sufficient and can safely deploy their own code to production. This can be achieved by minimizing the footprint of data scientist’s code on production systems and by making engineers part of the AI system design and development process upfront. AI system development is a team sport, and both engineers and data scientists are required for success. Hiring and resource allocation must take that into account.

Make the teams cross-functional

Involving engineering early in the data science projects avoids the “us” and “them” mentality, makes the product much better, and encourages knowledge sharing. Even when a full cross-functional team of engineers and data scientists is not practical, forming a project team working together towards a common goal solves most of the problems of over-the-wall data science.

Expect data scientists to become better engineers

In the end, data scientists should own the logic of the AI code in the production application, and that logic needs to be isolated in the application so that data scientists could modify it themselves. In order to do so, data scientists must follow the same best practices as engineers. For example, writing unit and integration tests may feel like a lot of overhead for data scientists at first, however, the value of knowing that your code still works after you’ve made a change soon overcomes that feeling. Also, engineers must be part of the data scientists’ code review process to make sure the code is of production quality and there are no scalability or other issues.

Provide production tooling for data scientists

Engineers should build production-ready reusable components and wrappers, testing, deployment, and monitoring tools, as well as infrastructure and automation for data science related code. Data scientists can then focus on a much smaller portion of the code containing the main logic of the AI application. When the tooling is not in place, data scientists tend to spend much of their time on building the tools themselves.

Avoid rewriting the code in another language

The production environment is one of the constraints on the types of acceptable machine learning packages, algorithms, and languages. This constraint has to be enforced at the beginning of the project to avoid rewrites. A number of companies are offering production-oriented data science platforms and AI model deployment strategies both in open source and commercial products. These products, such as TensorFlow and H2O.ai, help solve the problem of a production environment being very different from that normally used by data scientists.

Images by MabelAmber and Wokadandapix on Pixabay

The post Over-the-Wall Data Science and How to Avoid Its Pitfalls by Sergei Izrailev appeared first on Life Around Data.

Four Machine Learning Skills of a Successful AI Team

Sergei Izrailev — Fri, 14 Sep 2018 12:38:47 +0000

In the past few years, several trends accelerated adoption of AI for business applications. The abundance of data, cheap computing, advances in AI algorithms, and the advent of platforms that facilitate implementation of AI systems are making AI ever more accessible. Still, extracting business value from AI remains elusive for many companies. The solution lies in part in the ability to assemble a cross-functional team with skills appropriate for the task. Which skills are needed is largely determined by which business problems are worth solving using AI and what creates a competitive advantage or is strategic for the company.

Kitchen builders, chefs, cooks, and microwave builders

In her blog Why businesses fail at machine learning, Cassie Kozyrkov, Chief Decision Scientist at Google, points out two main types of machine learning: research and applied, and draws a compelling analogy with cooking. In this analogy, machine learning researchers who develop new algorithms are likened to engineers who build microwaves and other appliances. The applied machine learning specialists, on the other hand, are cooks who use appliances to produce tasty dishes.

Extending this line of thought, just like individual dishes don’t necessarily make a meal, predictions coming from machine learning systems are usually not the end goal. Someone needs to define how to use them in a product that solves a specific customer or business problem. For example, if the AI system predicts a user’s music preferences, there’s still work on how this information is optimally used in a music streaming app, and how to measure its impact on key performance indicators. Such a person is similar to a chef in a restaurant, who is capable of selecting and combining dishes together to serve a full dinner.

Another important aspect of AI systems is their day-to-day operation, and so people who develop and integrate machine learning platforms are those who build automated and scalable kitchens. Kitchen builders are making the cooks efficient so that the chefs are able to deliver more meals.

To summarize, there are four broad AI skillsets:

Development of new and improvement of existing algorithms (microwave and other appliance builders).
Application of existing algorithms to build machine learning models and produce predictions that are useful in solving a business problem (cooks, who prepare individual dishes).
Defining how predictions are used to solve specific business problems (chefs who create a full meal and dining experience).
Building platforms that facilitate and automate machine learning (kitchen builders).

It is important to keep these skillsets in mind when building a team that is tasked with developing AI-driven products. So which of them are necessary, which are optional, and which can be outsourced?

How to assemble a successful AI team

Considering a large number of readily available machine learning algorithms, one would normally not start with developing a new algorithm or modifying an existing one. Therefore, we usually would not need microwave builders to start (unless it’s a microwave design business). Unfortunately, these are the most common skills taught in machine learning and data science courses.

Cooking skills are definitely required, and sometimes cooks specializing in certain dishes (pastries, sauces) may be needed. For example, if solving the business problem involves text analysis, knowledge of NLP could come in handy.

Having a chef, whose skills cross into the product management area, is absolutely critical to making AI valuable for the business. It doesn’t have to be a separate role, and it can be synthesized from more than one person.

Finally, if there is a lot of cooking to be done, kitchen builders are also required in order for the whole process to scale. While one can certainly rent a ready-to-go kitchen, some of the kitchen builder skills are usually needed in-house to make it operational. The main reason is that while platforms make certain aspects of machine learning easier, integration with existing production systems and processes remains a challenge.

To assemble a successful AI team, one first needs to evaluate what business problems need solving, which skill sets are needed, and which of them are required in-house. In some cases, it is sufficient to rent a kitchen, hire a chef who can also cook various dishes, and provide adequate engineering support to make the chef effective. In others, the competitive advantage comes from the ability to cook simple meals at scale and serve many customers. Then kitchen builders become a key to success. In yet other cases, none of the existing algorithms can solve the business problem well, and one has to develop better microwaves and hire microwave builders.

Image by olafBroeker on Pixabay

The post Four Machine Learning Skills of a Successful AI Team by Sergei Izrailev appeared first on Life Around Data.