Introduction
I’m a data scientist. I’ve seen data scientists uncover important business insights and create artificial intelligence that can transform entire industries. But despite the fact that data science teams can readily produce groundbreaking results, most companies struggle to incorporate those results into business processes and applications. It is all too common for data-science teams to find themselves in a continual cycle of producing proof-of-concept results that fail to create broad impact.
The problem is that data scientists are not magic beanstalk beans; simply embedding them within a business will not quickly take that business to the next level. Hiring data scientists was not the first step for the most successful data-driven companies. Instead, the success of data science is built upon cultural and technological foundations that treat data as a first-class business priority. Historically, the field of data science sprang into existence only after companies like Google and Facebook adopted such a mindset.
Data scientists are, of course, vital for an organization to make the most of its data. In reality, however, data scientists will require support from throughout an organization as well as dedicated engineering effort. In other words, data scientists are just regular beanstalk beans: they will flourish if nurtured properly.
In this blog post, we will outline the essential elements that enable data scientists to deliver their best work. First off, it is important to realize that the fundamentals of data science are not complicated, even if the details require an advanced degree. Recognizing that, fundamental data-science activities can be more easily enabled by prioritizing data engineering. Good data engineering will also empower product teams to create data-driven processes, which will in turn generate feedback for data scientists. Finally, machine-learning application deployment can be greatly accelerated with dedicated engineering and operations teams.
Data science fundamentals – it’s not that complicated
Fundamentally, data science is simply the art of learning from examples, i.e. data. Where traditional processes are built on established rules and intuition, data scientists look to data to create new insight and algorithms based on observed data. The upshot of this basic view is that learning about anything requires examples of the things to be learned about. This sounds obvious, but a commonly incorrect notion is that any data can solve any problem. In reality, business applications must be designed to collect relevant data in order to derive significant value. The details of machine learning are indeed difficult – good data scientists need use a host of techniques to transform data and create machine-learning models – but it’s only possible if the examples are collected in the first place.
You don’t have to be a data scientist to think about whether your application or business is collecting relevant examples. Examples might represent customer or user interactions with your business and its products. Is your website keeping a record of which products users are browsing? Is it also recording their eventual purchases? Recording both of those things could provide important insight about your products, while recording one without the other would be useless in comparison. Â
Akin to the idea of recording purchases along with browsing data, most data is vastly more useful when labeled. Labels are recorded values of attributes that we might want a machine-learning model to learn to predict. If you’re recording data on your manufacturing line, you might want a machine-learning model to predict defects before they finish production or reach the market. In this case, labels would represent which units were defective and which were not. Â
Labels are important. While it may be possible for data scientists to perform unsupervised learning style=”font-weight: 300;”> without labels, it is vastly less complicated to carry out supervised learning on labeled data. Supervised learning is more straightforward because we can show the machine explicitly the output we want it to produce during the learning process. Unsupervised learning requires significant additional exploratory effort to uncover structure and extract value from data.
Most importantly, It doesn’t take a data scientist to see the value in these basic concepts. Product managers and engineers can put a data-first mindset into practice. Applications should collect examples relevant to business processes and be designed to assign labels to those examples whenever possible. This is a message that can be spread throughout an organization to allow innovation to occur outside the data science groups.
Build a data culture
As we started to see in the last section, data scientists can only get so far if they’re the only ones thinking about leveraging data. A data scientist might be good at answering questions with data, but often they don’t know which questions to ask and whether the data is relevant at all. In order to bridge that gap, it is important to encourage interactions with data at all levels of your organization.
Data scientists shouldn’t be the only ones inspecting raw data; there is great value in putting more eyes on what is available. Anyone who has seen an Excel spreadsheet can take a look at data in a SQL database. Skimming a table for data integrity is a basic activity, but one of utmost importance. Anyone can look at a list of columns and think about what patterns and signals might be present within recorded values. While a data scientist is capable of these tasks, collaboration with business units can spark conversations and drive innovation in a dramatic fashion.
While data scientists can employ quantitative skills to extract value, employees that are closer to customers and stakeholders understand how a business operates. Digital transformation is all about bringing these skills together. Employees that are closer to customers and stakeholders are best equipped to interpret and inspect data. Beyond that, employees with access to data will more easily see the advantages to incorporating data into their day to day operations. Shared exposure to data may also reduce the friction that has been observed between established business units and upstart data science groups.Â
Employees outside of data science and analytics may also begin to find enough utility in data to interact with it more directly. Basic SQL queries are not terribly hard to write, understand, and adapt – take this for an example:
SELECT product_name, price FROM products WHERE price > 100
Like many basic queries, this one explains itself. The allure of answering questions with real data can be a powerful motivator for individuals to learn new skills like SQL. Tools such as the Impala editor in Cloudera HUE and Amazon Athena make it easy to issue ad hoc queries and thus provide a good platform for organizational data access.
Enable data science with quality data engineering
Perhaps needless to say, building a culture as described above starts with quality data engineering. Data is only valuable if it is stored in a usable format and accessible by data scientists, analysts, and applications. Actionable insights won’t magically spring from a pile of CSVs over here and machine-learning applications can’t be based on a bunch of XML files with no schema. Functional data architectures require careful design, implementation, and tuning.
First and foremost, data ingest and storage components must be able to scale with your organization and its applications. If data is produced by many clients for ingestion, tools such as Apache Kafka and Amazon Kinesis can provide a fault-tolerant way to ingest enormous volumes of data. Apache Hadoop and Amazon S3 provide affordable and scalable data storage mechanisms. Choosing the right big-data technologies can mean the difference between lasting, reliable applications and bug-ridden nightmares.
Data integrity is another seemingly obvious component to successful downstream analysis. Maintaining data integrity relies on data pipelines with centralized logging, monitoring, and alerting. When issues do arise, good software design and DevOps practices can ease the operational burden. A practice of writing unit and integration tests can help defend against bugs and issues; and tests can later be expanded to prevent recurrence of previously resolved issues. Implementing an automated continuous integration and delivery pipeline can help to deploy code with fewer errors and repair issues quickly and easily.
Finally, data is only useful when it is accessible and interpretable. Data should be centralized and cataloged so that users within your organization know where it is and how to access it. Nothing slows data innovation more than delays and confusion in the access process. Tools like Amazon Glue and Cloudera SDX can help to catalog data. Both of those platforms can be integrated with centralized identity management services to manage user access. And when data is accessed, it should have a good metadata strategy in place to ease interpretation.
Invest in ML engineering and operations
Successful data science teams are of course going to produce groundbreaking machine-learning models and applications – but the average data scientist is not equipped to efficiently deploy these applications at scale. Even if your data scientists do have that expertise, do you really want to invest their energy into deploying it? And once it is deployed, who is going to monitor the output, resolve any issues, and retrain the models? How will you know when it is even time to retrain the models?
The issues described above are giving rise to the burgeoning fields of machine learning engineering and operations. Data scientists are not experts in software engineering and DevOps. Further still, machine learning applications are presenting new deployment challenges that are not present among traditional software deployments. For instance, typical software deployments will act in a reliable, repeatable fashion once deployed. The output of machine-learning models, on the other hand, is subject to the distributions of input data that could change with time. In other words, changes in user or system behavior can cause the algorithms to go haywire. To combat these novel issues, machine-learning engineers and operations specialists are developing new strategies for monitoring applications.
In order to achieve agility in the machine-learning space, organizations will have to make a significant investment in machine-learning engineering. While managed solutions such as Amazon Sagemaker are streamlining model development, the associated deployment pipelines may still require careful attention to meet enterprise standards of security and scalability. Good machine-learning engineers will work tightly with data engineering teams to deliver robust solutions that automatically alert on issues and enable rapid recovery as necessary.
Conclusion and Takeaways
There has been a lot of focus in the community around flashy AI buzzwords, but investment in algorithms is just the beginning. Our guidance is to:
- Emphasize collection and availability of good data: examples of customer actions and system behavior that drive your business, with a labeling strategy to enable supervised learning.
- Foster a data culture by helping more people see the data, validate its integrity, and uncover insights.
- Build robust data pipelines and storage; streamline access patterns; and develop with tests and automated deployments.
- Anticipate the challenges of ML Operations – don’t expect your data science teams to take their solutions all the way to production.
All of these steps may seem daunting to implement at once, but it is important to start from the ground up. If these steps seem like a challenge to you and your organization, reach out to phData. We’re here to help.