1. Database, Database Management System, and Database System. What's the difference?
To be honest, I did not know the difference between the three terms until my postgraduate stage. When I heard the explanations of the three terms for the first time. I found them "too academic" and doubted if it is necessary to make the distinction.
However, when I told people from different industries that "I'm developing a database", they always had their own understanding of what I was doing exactly. Therefore, it could help communicate with the audience more efficiently to understand their comprehension of database.
Database: a collection of data.
This is the meaning of database in the minds of most liberal arts students. An example is Google BigQuery. It can also be called a "dataset", which is essentially a cluster of static information. If we need some information, we can search these databases.
Database Management System (DBMS): a system that manages a database.
DBMS is for maintaining the collection of data, including reading and writing. It seems as if it could meet the needs of all users, yet it is just for software developers rather than end users, since the interaction with DBMS is only possible with specific programming languages.
When we say we are database kernel developers, we are making a reference exactly to a database management system.
Database System: containing the database, database management system, and application system.
The database system is the one for end-users. Since most applications use databases, most systems can be called "database system".
Their relationship is shown in the figure below.
Strictly speaking, we're engaged in developing a time-series database management system, but we usually just call it database for convenience - there would generally be no ambiguity in the context of computers. So all the "database" in the following text refer to "database management system".
2. Categories of databases
The major categories of databases include: relational database, key-value database, document database, graph database, and time-series database.
Among them, the time series database has become the most popular member of the family in the last two years.
The following chart made by DB-Engines shows the historical trend of the categories' popularity (https://db-engines.com/en/ranking_categories):
Different categories of databases are used to manage different types of data. For example, the relational database is used to manage various relational data and transaction data. Time series database is mainly used to manage time series data.
What is time-series data?
Time series data is everywhere. The trend chart above itself is time-series data. Each category of databases has a curve as representative, and this curve changes over time, constituting a time series. Other common examples include electrocardiograms and computer running state monitoring.
Why is the time-series database so popular? Well, let's move on.
3. Massive IIoT time series data volume
The aim of IoT is the networking of things, most of which are electronic devices, such as smart bands, wind turbines, and smart cars.
These devices are equipped with various sensors, such as temperature sensors, voltage sensors, wind speed sensors, gyroscopes, and GPS. These sensors generate time-series data - the device status and environmental information collected - at a certain frequency.
These data can be regarded as the electrocardiograms of the devices.
Why is the volume of time series data generated in the IoT area so large?
First, the number of time series is huge. Suppose there are 20,000 cars and each car has 500 sensors (also called measure points/physical quantity), 10 million time series will be generated. Moreover, a single high-end device could have tens of thousands of sensors. For example, an aircraft has 80,000 sensors, and a large generator unit has 10,000+ sensors.
Second, the sampling frequency is high. In the case of precise control and monitoring, data acquisition is usually carried out at 100-1,000Hz, such as a vibrating sensor acquiring the information of a bridge's vibration.
According to the IEC61400-25 wind power standard, a wind turbine produces 6TB of data in its 7,500 hours effective operation per year. The time-series data generated in the IoT will keep increasing. And the requirements for databases will become higher as the data are generated faster.
4. How we leverage time-series data?
First, why do people collect so much time-series data?
These time-series data often contain rich industrial semantics. For example, the green part in the figure below represents a type of wind that greatly affects wind turbines. The wind can increase the load on the wind turbine and reduce its lifetime. By analyzing the typical models and knowledge in time series data, we can adjust the WT angle in advance so as to reduce the failure rate.
In industrial production, the output quality control (OQC) is usually necessary, which relies on the analysis of the monitoring data from the production process.
In addition, the "carbon neutrality and peak carbon dioxide emissions" policy requires enterprises to monitor carbon emissions; common problems of industrial equipment such as leakage and venting need to be detected through monitoring. If some values are abnormal, timely alarming and response are necessary.
To sum up, time-series data have two major functions: analysis and forecast; monitoring and alarming.
The two points have different meanings in different industrial enterprises and application scenarios, which need to be defined with the professional knowledge. So the development of industrial IoT is a long-term process.
But one thing is clear: data collection is the first step to monitor and control the production process intelligently. There's no subsequent analysis without data.
Therefore, data is an intangible asset, especially the real data generated by a running machine.
5. Why do we need TSDB?
TSDB is a product after 2010. Before it appeared, there was one kind of product called "real-time database", or RTDB. The most widely used RTDB is the Pi database of American OSISoft company. Each power plant has at least one set of RTDB.
RTDB is not so capable of processing data. It can only realize simple processing of recent real-time data for monitoring & alarming, usually deployed on the device side.
If you need historical data analysis, you have to send historical data to the cloud and save it in a relational database. The write-in performance of the relational database is also a problem. It works when the data size is small, but when the data size is big, samples storage is needed since full storage is no longer possible.
In general, what are the drawbacks of RTDB?
(1) Duplicate data processing. Data go through the real-time database, relational database, NoSQL database and so on from generation at a device to cloud application. Repeated write-in consumes excess network bandwidth and computing resources.
(2) Schema limitations. The sensors of a single device in the industrial Internet of Things may be tens of thousands. If we store relevant data in a relational database, we need to conduct vertical sharding in a device. Sharding makes write-in and query logic complex, and model modification is expensive.
(3) Low read-write speed. Most of the existing databases can only write hundreds of thousands of points or one/two million points per second. The query takes a few seconds or even tens of seconds.
(4) Low compression ratio. For example, HBase and traditional relational databases adopt row-based storage without any compression to data, taking up more disk space, which results in much more cost for data storage.
Then, TSDB emerged.
6. TSDB for IoT: Apache IoTDB
After 2010, a group of time series databases based on NoSQL appeared, such as OpenTSDB, and KairosDB. Compared with relational databases, NoSQL databases were generally cluster-deployment and abandoned transaction management, so their throughput or horizontal scalability had improved. But their read-write performance still could not fully meet the needs of IIoT, such as data writing of 10 million points per second.
Though these databases are called time-series databases, they are actually not native time-series databases. "Native" here means being originally designed for time series data management, rather than modified based on existing system.
Encapsulation or modification based on existing systems cannot eliminate the burden of existing systems, because the existing systems were not designed for time series data management. Bottlenecks always appear.
Then, the first native time-series database appeared-- InfluxDB, also the most popular time-series database currently.
Unfortunately, InfluxDB was mainly designed for data centers' monitoring scenarios. There is no problem when the data managed is limited, but its read-write performance and memory management also have limitations in IIoT scenarios.
That motivated us to work on our own time-series database-- Apache IoTDB. It is dedicated to IoT scenarios.
IoTDB's features:
(1) "Device-Edge-Cloud" Data collaboration architecture
(2) High-throughput read and write
Apache IoTDB can support high-speed write access for millions of low-power and intelligently networked devices. It also provides lightning read access for retrieving data.
(3) Efficient directory structure
Apache IoTDB can efficiently organize complex data structure from IoT devices and large size of timeseries data with fuzzy searching strategy for complex directory of timeseries data.
(4) Rich query semantics
Apache IoTDB can support time alignment for timeseries data across devices and sensors, computation in timeseries field and abundant aggregation functions in time dimension.
(5) Low cost on hardware
Apache IoTDB can reach a high compression ratio of disk storage (it costs less than $0.23 to store 1GB of data on hard disk).
(6) Flexible deployment
Apache IoTDB can provide users one-click installation on the cloud, terminal tool on desktop and the bridge tool between cloud platform and on premise machine (Data Synchronization Tool).
(7) Intense integration with Open Source Ecosystem
Apache IoTDB can support analysis ecosystems, for example, Hadoop, Spark, Flink and Grafana(visualization tool).
A detailed introduction will be offered later. Please stay tuned.