The Next-Level of Operationalizing Machine Learning: Real-time Data Streaming into Data Science Environments

Print Friendly, PDF & Email

New Stack’s Streaming Data and the Future Tech Stack report (2019) show a 500% percent increase in the number of companies processing data in real-time for AI/ML use cases. And experts posit a more massive increase in the number of companies following this trend as we approach 2021.

To retain their competitive edge and increase market share, many companies are looking for ways to achieve faster and more intelligent data analysis. This has become even more important with the increased demand for digitization and automation to confront the realities of the ongoing pandemic and the anticipant needs of a post-coronavirus world.

88 percent of enterprise respondents to a Forrester survey expressed the need to perform real-time analytics on stored streamed data while over 70 percent stated that they are using or plan to use machine learning with streamed data.

As such, data scientists today need better tools to extract more value from streamed data at less cost and complexity. They require access to a full-fledged development environment with built-in fully-integrated support for tools that facilitate seamless development, training, management, and deployment of ML models with streamed data.  This has become a coveted part of MLOps (machine learning operationalization)

Such an environment will enable data scientists to fully leverage the reams of rapidly arriving real-time data streams to build and train machine learning models that deliver more accurate insights and facilitate smart, fast decisions for real-world impact.

Analytics with stored data vs analytics with streamed data

Traditionally, machine learning models are trained using historical data or data at rest. A lot of machine learning applications today aim to identify reliable, repeatable patterns and anomalies in historical data to identify what will happen in the future. 

The assumption here is that the world will stay the same…and that patterns that were observed in the past will repeat in the future. While this approach renders useful insights in several cases, it’s not applicable in many real-world applications and use cases.

Rather than look at the past to deliver predictions, cutting-edge data science is focused on ‘querying the future” by looking at real-time data. Applying data science models to streaming data in real-time delivers several advantages. It can supercharge the performance and accuracy of AI-powered applications through the rapid consideration and ingestion of newly identified insights. 

As a result, ML applications can immediately take cognizance of dynamic changes in data patterns to support business-critical processes and deliver competitive differentiation for applicable use cases.

Streaming analytics is the linchpin that makes it possible for enterprises to achieve such intelligent analysis and data-driven decision-making in real-time. Incorporating real-time data streaming into their workflow means that companies can achieve adaptive learning and continuous calibration of models based on the newest data flowing into the pipeline to enhance operations and extract further business value. Special algorithms can also be applied to simultaneously improve the prediction models in real-time and avoid concept drift.

However, a different architectural, technological, and analytics approach is required when working with data in motion as opposed to data at rest.

Achieving streaming data science

Businesses looking to enjoy the promise of real-time data analytics by building applications that transform and react to data streams must first build a real-time streaming data pipeline to reliably get data into their data science environment.

Today’s data scientists have access to a plethora of open source tools, frameworks, and libraries to train their ML models. However, a lot of these tools are geared towards exploring and visualizing data at rest. Working with real-time data streams involves showing the data with really low latency…and very few tools can handle such latency requirements.

Streaming data is usually composed of time-stamped data packets in series and data scientists looking to work with such must use tools that come with native data types that can process such data. This makes it easier to clean and visualize the data, explore patterns, and extract insights at scale. 

For these types of use cases, data science platforms can be beneficial, as they provide a full-fledged data science environment.  Many come built-in with the popular Python libraries to enable data scientists to work with fresh and real-time data sets during data exploration, training, etc. Data science platforms that are focused on real-time environments will also enable data scientists to easily access both historic and real-time data within their Python environment for exploration and training.

To achieve real-time data streaming into a data science development environment, data can be collected from an existing Kafka stream into a time series table. Kafka is a low latency, distributed, horizontally-scalable open source streaming platform that can handle trillions of events a day.

Currently, Kafka is the most popular framework for ingesting data streams into processing platforms. Essentially, it acts as a data transportation mechanism, alternately serving as a transmission point to a stream or an ingestion point from a stream.

To speed up data ingestion, open source technology Nuclio can be used to “listen” to the Kafka stream and then ingest its events into the time series table. Nuclio is the fastest open source serverless framework that’s embedded into Kubeflow Pipelines, a Kubernetes ML framework. Next, the stream can be visualized with a Grafana dashboard and finally, the time series data can be manipulated in a Jupyter notebook using Python code.

In this way, data scientists can build, monitor and manage real-time ML pipelines that automate and automatically scale data science workflows to handle real-time streaming data.

About the Author

Adi Hirschtein, VP Product, Iguazio. Adi contributes 20 years of experience as an executive, product manager and entrepreneur building and driving innovation in technology companies. As the VP of Product at Iguazio, the data science platform built for production and real-time use cases, he leads the product roadmap and strategy. Adi holds a B.A. in Business Administration and Information Technology from the College of Management Academic Studies.

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1

Speak Your Mind

*