On the Transferability of Software Engineering Knowledge
Proekspert has 25 years of experience in software development. Creating a team of data scientists in our company we learned the hard way that some communication practices software developers use daily may not be so obvious for machine learning experts.
Data science packaged as macine learning and artificial intelligence
Recent years have witnessed a new rise in data science packaged as machine learning and artificial intelligence. This time the revolution has finally broken out of academia and research labs to the applied business sector. We see practical ML-based applications bettering traditional algorithmic solutions in their ability to tackle a wide array of problems.
The talent shortage is here to stay in machine learning and artificial intelligence-based products
We should still acknowledge that AI- and ML-based products are in the embryonic stage. Early adopters — tech giants with deep pockets — are hoovering up the talent and pouring lots of money into practical application. The majority of organizations, however, are only taking their first steps in the field.
Data science barriers
According to a 2017 survey by Kaggle[1], close to half of data science practitioners see the “lack of data science talent” to be one of the most important barriers in their work. As experience from the IT sector shows, this problem is not diminishing. According to a recent MemSQL and O’Reilly Media survey[2], 88 percent of respondents said their companies already have, or have plans to, implement AI and ML technologies. Soon this lack of talent will become even more serious.
The current situation in data science appears to be very similar to the problems faced by companies during the 1990s and 2000s where a wide array of new problems that could be solved with software created a huge demand for software developers. The world soon understood that the deficit of software engineering talent can be dealt with only when the principles of the software development are improved.
Not an engineering problem
Until the 1990s, software development had been viewed as similar to production. Subsequent decades proved this analogy not to be the best. There are few serious problems with engineering and production in software development — or at least the technical problems are much easier to solve. The harder problem to tackle, in my opinion, has been communication.
I believe communication in a broad context is the main reason successful teams perform better than less successful ones. I do not mean only communication used to understand the problems to be solved, but rather the communication of knowledge and progress inside a team and the practices that allow you to communicate with your “future self.”
Our experience with building data science teams shows that results-oriented people who are not used to working in teams often require training in learning to communicate. The most important change in mentality when shifting from one-man projects to teamwork is the understanding that most of the work you do impacts others on your team or people who will join your team in the future.
Lessons from software development
Here are some practices originating from software development that our ML team has found useful in their everyday work.
1. Reduce your projects’ cognitive load.
Work in software engineering and data science means constant learning. To reduce that load it makes sense to have consistent structure in your projects. This includes both the process and how code, documents, and data are organized. Agree on the formats within your team (or organization) and refrain from changing them too often. We’ve found Cookiecutter Data Science to be a good source for best practices on work item organization for data science projects.
2. Make experiment set-up easy.
Often setting up the environment for an experiment is complicated. Sure, over time you’ve installed all imaginable python libraries and the scripts run in your machine without problems. But have you considered how much time you’ll spend to set up your environment when you get new new computer? How much time will you need to spend to get “the new guy” up and running? There are good tools available to log and recreate the environment. Conda or Virtualenv for example. Depending on the technical skills of your team, packaging your experiment environment to a Docker image could be extremely effective.
3. Script your pipeline.
Most experiments have to deal with data acquisition, cleaning, splitting and feature definition. Script these transformations. Manipulating the data manually each time new data becomes available, or in case you want to change the split or features could become very time consuming in a long run. When there are multiple steps in your data preparation process the scripts also help to reduce the risk of human error. It may even be worth pipelining these scripts when the operations take a long time to complete so you can run them unattended. You may use simple shell script, makefile or a dedicated framework (like Airflow) for longer-term projects.
4. Use a single source of truth for your code.
I couldn’t imagine a software team nowadays who would not use version control to manage their files. The same should become standard for data science teams. In addition to a convenient way of safekeeping and sharing your code within the team, it allows safe experimentation — multiple branches of development can exist in parallel.
5. Log and store your results.
Versus software engineering, data science includes more experimentation. The paths you take are often dead ends in a maze and it makes sense to map where you’ve been. The best comments in source code often describe alternative solutions which might seem tempting to take but have proven to be less effective or incorrect. It is best practice to document architectural decisions implemented in software with the reasoning why this particular approach was taken. The same applies to ML.
6. Create tasks with a clear Definition of Done criterion.
What’s the goal of your current task? Have you agreed on what it takes it to finish? In some cases, for example in EDA, it may be hard to define what is a task and what are the tangible results you expect. In such cases, logging what’s been done and timeboxing activity will help keep you out of an infinite loop. Based on the findings better understanding of the goals may arrive.
7. Visualize your progress.
One cannot overestimate the value of knowing what others in your team are working on. Usually, organizations employing data scientists have working task management systems. Maybe it’s an online system; maybe a scrum- or kanban board at your office. Visualizing tasks and progress allows the team to find out if somebody is stuck on a task and needs help.
8. Communicate your progress and intermediate insights.
Agile practices in software development have proven that shortening the communication loop between the team and stakeholders is crucial to building a good product. The same is true for data science. Demo your intermediate results and discuss your insights with domain experts and other stakeholders to validate the paths you’re taking. This ensures you’re not wasting time doing the wrong thing and increases the likelihood of errors being spotted early.
Years of software development have given us simple yet useful practices that facilitate learning, knowledge sharing, and communication, allowing teams to achieve better results. These practices may be unknown to people with other backgrounds. Building on those practices and putting your scarce data science talent to its best use will help bring to life the awesome power of ML and AI applications much faster.
If you feel like discussing it further, feel free to contact me: andrus.kuus@proekspert.ee
References:
www.kaggle.com/surveys/2017
www.globenewswire.com/news-release/2018/02/07/1335563/0/en/Survey-Finds-Machine-Learning-and-Artificial-Intelligence-are-Top-Business-Priorities.html