What are data lakes and how are they used for IoT analytics?
An IoT data lake is a way for you to store your IoT data over time. Later, you can access your IoT data for historical analytics. Offloading to an IoT data lake enables you to build and retain experiences with data, store data cost effectively, and use that data without impacting the performance of your IoT solution.
You can use historical analytics in an IoT data lake to:
- View and analyze historic IoT data in aggregate form to identify long-term trends
- Train machine learning (ML) models
- Plan production line changes
With historical analytics, you can answer questions like: "when did this last happen?" "where and how many times have we seen this happen?" and also "what’s the average value of the measurement of this device across all factories over a specific time?" Using historical analytics alongside real-time streaming analytics, you can fine-tune your processes—for example, improve shipment processes by learning from past shipments or discover trends to identify devices that require proactive maintenance to prevent in-operation failure.
An engineering company that develops and tests combustion engines combines real-time IoT data with historical data to provide its customers better insight. Based on its analysis, the company recommends how its customers can improve operational performance and save money.
Key considerations
When choosing a data lake for IoT analytics, ask:
- Can you store data on-premises and at the edge?
- Will you have only one option for data lake storage, or will you have choices?
- How long can you retain operational endpoint data? Some IoT platforms retain data for a finite time—as short as two weeks
- How expensive will storage be?
- Is the data lake built for huge data sets?
- Is the data lake optimized for data analysis or training machine learning (ML) models?
- Can you use a variety of business intelligence, machine learning and SQL-based tools?
- Will data architects have the flexibility and control they need while data consumers have self-service access to their IoT data?
Benefits of a data lake for IoT analytics
Make your IoT data your advantage with Software AG’s Cumulocity IoT DataHub. With Cumulocity IoT DataHub, you can bridge the gap between streaming and historical analytics in a way that simplifies processes for IT administrators and enables the business to gain new insights about operations and performance.
Simplified management of long-term data storage
Cumulocity IoT DataHub takes data periodically from the operational data store, either on-premises or at the edge, and transforms it into a compact format that’s highly efficient for analytical queries and places it in an analytical store in the data lake. Cumulocity IoT DataHub can support a multitude of devices and, for each offload, Cumulocity IoT DataHub will move alarm, event, measurement and inventory data for every device into the data lake.
Lower cost for IoT data storage
The analytical store can be hosted on your choice of Amazon® S3 or Microsoft® Azure® Data Lake Storage. Cloud-based storage dramatically lowers the cost of creating and managing a data lake. Cumulocity IoT DataHub also supports file system data storage and Hadoop® Distributed File System (HDFS).
Scalable SQL querying of long-term IoT data
Cumulocity IoT DataHub is designed to support an IoT solution consisting of any number of devices and can scale to manage the data each produces. For analyzing this onslaught of device data, Cumulocity IoT DataHub offers SQL, the lingua franca of data processing for decades. Unleash the power of SQL, and you will quickly convert raw IoT device data into meaningful information.
Standard interfaces to BI & data science tools
Cumulocity IoT DataHub acts as an integration layer, enabling high-performance SQL queries on historical IoT data that can be used with a wide range of business intelligence or analytics applications, machine learning training or with other custom applications that use standards such as Arrow Flight, JDBC®, ODBC, REST and SQL.
How Cumulocity IoT DataHub works
Offloading data
Cumulocity IoT DataHub moves your IoT data from an operational store in Cumulocity IoT to a data lake to turn raw IoT data into a structured, condensed format needed for efficient SQL querying, the basis of reporting and intelligence. This “offloading” process allows you to build a low-cost and long-term archive of device data.
What does the data in the data lake look like? The transformed, tabular data is stored in the Apache Parquet™ format, which gives you an analysis-friendly and storage-efficient columnar representation of your data. Apache Parquet is the de-facto standard data format in “big data” tools, giving you the freedom to process the data with tools such as Apache Spark™ in addition to using Cumulocity IoT DataHub. Taking common analysis patterns into account, Cumulocity IoT DataHub arranges the Parquet files in a temporal folder hierarchy. Additional housekeeping mechanisms in the background regularly compact smaller Parquet files, which boosts overall query performance (think of defragmenting a hard disk back in the days).
Once offloading is complete, you can analyze and gain insights from your data at interactive speed, using your favorite BI and data science tools. You can then gain business insights by extracting exactly what you need and integrate the information with that from other business systems.
Combining insight from business systems with IoT data
Cumulocity IoT DataHub enables you to connect your BI querying and reporting tools to your IoT data, so you can extract all sorts of powerful business insights from the data. It offers SQL as the query interface, which is the lingua franca of data processing and analytics. Dremio™ is the internal engine which executes the SQL queries. Due to its highly scalable nature, Dremio can easily cope with many analytical queries.
With Cumulocity IoT DataHub, you can quickly connect the tool or application of your choice, including:
- BI tools using JDBC or ODBC
- Data science applications using Python® scripts, which connect via ODBC
- Custom applications using JDBC for the Java® ecosystem, ODBC for .NET, Python, etc. and REST for (Cumulocity IoT) web applications
Training machine learning models
Nowadays, machine learning is a popular choice for gaining deeper knowledge into business and production processes. The more data you have, the more reliable the insight from your machine learning models will be. Cumulocity IoT DataHub prepares the ground for training complex machine learning models by making the entirety of your IoT data available in a well-structured and analysis-friendly format. Simply connect your favorite data science tool through ODBC, JDBC or REST, and start processing your data. You can, for example, train a model on the failure states of a valve in order to learn which factors indicate that the valve will soon fail. Then use these insights, combined with your current live data in Cumulocity IoT, to proactively change a valve before it breaks. This is the power of combining live data with historical data.
Cumulocity IoT DataHub architecture
Cumulocity IoT DataHub is designed to:
- Automatically move data from the operational data store into a data lake
- Store flattened data in an analysis-friendly layout
- Execute complex analytical queries at high speed
- Easily scale with the amount of IoT data being processed—one pillar of that architecture is the separation of storage and compute capabilities
Cloud data lakes allow you to easily scale your data storage needs with the onslaught of data emitted by your IoT sensors. Cumulocity IoT DataHub ensures the data is well structured within a temporal hierarchy and complemented by internal housekeeping mechanisms ensuring compact file representations. Cumulocity IoT DataHub uses Dremio to move data into the data lake. Dremio is also in charge of executing queries on that data lake. Dremio delivers leading-edge query performance by using innovative technologies like Apache Arrow™, reflections and Columnar Cloud Cache. Scaling Dremio nodes allows you to process ever-increasing amounts of IoT data in seconds.
Cumulocity IoT DataHub is designed as a cloud-native application, with all its components running as microservices/containers in Kubernetes® clusters in private or public clouds. Need local processing? On a shop floor, for example, IoT devices are often connected to local computers instead of remote cloud platforms and do local processing instead of moving all data to the cloud. Cumulocity IoT DataHub serves those use cases by providing an edge edition. As a storage layer, Cumulocity IoT DataHub Edge uses the local storage of the edge device. Other than that, Cumulocity IoT DataHub Edge offers the same capabilities as the cloud edition, excluding horizontal scalability.