Apache TsFile - Time Series Data Storage Redefined

In today's data-driven world, the importance of time series data cannot be overstated. From IoT to financial analytics, its applications are vast and varied. Yet, until recently, the lack of a standardized file format for time series data posed significant challenges in data collection and processing.

Previously, companies resorted to writing time series data in various user-defined formats or using general columnar formats like Parquet and ORC, resulting in complications during data collection and processing. Apache TsFile streamlines this process, offering a unified and standardized format for time series data.

Unlike other time series databases and file formats, TsFile stands out by addressing the unique requirements of time series data with unparalleled efficiency and flexibility. As the underlying storage format for Apache IoTDB, TsFile offers seamless integration and advanced capabilities for managing and analyzing time series data. Its distinctive features include:

  • Efficient Storage and Compression: TsFile employs advanced compression techniques, minimizing storage requirements and enhancing system efficiency. This results in reduced disk space consumption and optimized data management.

  • Flexible Schema and Metadata Management: Unlike traditional approaches, TsFile allows for flexible schema management, enabling direct data writing without predefined schemas. This adaptability simplifies data acquisition and management, catering to the dynamic nature of time series data.

  • High Query Performance with Time Range: With indexed devices, sensors, and time dimensions, TsFile accelerates query performance, enabling fast filtering and retrieval of time series data based on specific time ranges.

  • Seamless Integration: TsFile seamlessly integrates with existing big data frameworks besides IoTDB, e.g. Apache Spark, Apache Flink, and Apache Hadoop. This ensures compatibility and interoperability across diverse data processing environments, facilitating streamlined data analysis and insights.

The same TsFile could be deployed flexibly and synchronized throughout embedded devices, edge servers, and cloud nodes, without packing and unpacking, which could significantly reduce the costs in the Extract-Transform-Load (ETL) process.

What makes time series data file format TsFile unique?

When conceptualizing the structure of TsFile, there were several key considerations:

  • Efficient Compression: Recognizing the importance of space optimization, TsFile compresses data extensively to minimize storage requirements.

  • Device Packing: Multiple devices are packed together to reduce the number of files, streamlining data management.

  • Data Locality: Time series data expected to be queried together are kept close in physical locations to enhance query performance.

  • Disk Fragmentation: TsFile ensures data is packed with sizes aligned with file systems to avoid disk fragmentation.

  • Efficient Access: With millions of time series needing efficient access, TsFile is optimized for rapid data retrieval.

Columnar Storage and File Structure

TsFile adopts a columnar storage design, similar to other file formats, primarily to optimize time-series data's storage efficiency and query performance. This design aligns with the nature of time series data, which often involves large volumes of similar data types recorded over time. However, TsFile was developed particularly with a structure of page, chunk, chunk group, block, and index:

  • Page: The basic unit for storing time series data, sorted by time in ascending order with separate columns for timestamps and values.

  • Chunk: Comprising metadata headers and several pages, each chunk belongs to one time series, with variable sizes allowing for different compression and encoding methods.

  • Chunk Group: Multiple chunks within a chunk group belong to one or multiple series of a device written in the same period, facilitating efficient query processing.

  • Block: Buffered in memory before being flushed to TsFile, all chunk groups form a block, allowing for efficient data locality in distributed file systems like HDFS.

  • Index: The file metadata at the end of TsFile contains a chunk-level index and file-level statistics for efficient data access.

The following diagram illustrates TsFile's innovative columnar storage design, showcasing the efficiency of its page, chunk, and block structure.

TsFile Architecture_20240319.png

TsFile Architecture

It allows for better compression ratio due to the homogeneity of data in each column, faster queries by loading only necessary columns into memory, and improved scalability by organizing data into manageable units for processing and retrieval.

Encoding and Compression Techniques

TsFile employs advanced encoding and compression techniques to optimize storage and access for time series data. It uses methods like run-length encoding (RLE), bit-packing, and Snappy for efficient compression, allowing separate encoding of timestamp and value columns for better data processing. Its unique encoding algorithms are designed specifically for the characteristics of time series data in IoT scenarios, focusing on regular time intervals and the correlation among series. Additionally, TsFile incorporates frequency domain encoding, utilizing quantization and bit-width reduction to efficiently store frequency domain data for reuse, ensuring space efficiency without compromising data accuracy.

The table below compares 3 file formats in different dimensions.

TsFile-Parquet-ORC Comparison_20240319.png

TsFile, Parquet and ORC in Comparison

Its development facilitates efficient data encoding, compression, and access, reflecting a deep understanding of industry needs, pioneering a path toward efficient, scalable, and flexible data analytics platforms.

Target Users

TsFile caters to developers and organizations working with time series data across various domains, including IoT, smart control systems, financial analytics, and log analysis. Its focus on efficient data storage, fast access, and analysis makes it an ideal choice for both edge devices and cloud-based systems.

Milestone Achievements

TsFile is the underlying storage format for Apache IoTDB. In October 2023, the idea about splitting TsFile and IoTDB into separate projects was discussed in the Apache IoTDB community. Because of its excellent suport for time series data, self-parsing, various encoding and compression methods, TsFile could be the standard data file format in the IoT field rather than solely for IoTDB. The proposal was endorsed by many committers in the IoTDB community and has drawn attention of two members from the board of directors of the Apache Software Foundation (ASF).

Following ASF meritocratic principles, the initial PMC of TsFile was formed by 14 committers from the IoTDB community. Shortly afterwards, more talented and devoted committers from Timecho, Tsinghua University, BONC, Huawei, eBay, Yonyou and other organizations were invited to collaborate and develop TsFile.

On 15th November, 2023, Apache TsFile was straightly promoted to Apache Top-level Project (TLP) at the ASF board meeting.

Since its inception, TsFile has rapidly evolved, culminating in its recent release of version 1.0.0 on 20th Feb, 2024. This release marked a significant milestone, featuring support for multiple data types, encoding algorithms, compression algorithms, and various write and query patterns.

Conclusion and Roadmap

TsFile’s efficient storage and compression leverage unique encoding algorithms tailored for time series data, significantly outperforming traditional columnar formats in both storage reduction and query speed. Its journey from a foundational component of Apache IoTDB to a standalone project is a testament to its transformative impact and enduring commitment to advancing the field of time-series data management, offering unparalleled efficiency, flexibility, and integration capabilities.

Looking ahead, TsFile's roadmap includes plans to become an independent project with its own SDK and documentation, multi-language support, integration of additional encoding and compression methods, and the development of more tools for visualization, parsing, and repair.

As the collaboration continues within the Apache community, TsFile is poised to redefine the landscape of time series data analytics and drive forward data-driven innovation. Developers, analysts, and organizations are welcome to join the vibrant community, explore TsFile’s capabilities, and contribute to its ongoing innovation.

References

  1. Chen Wang, Xiangdong Huang, Jialin Qiao, Tian Jiang, Lei Rui, Jinrui Zhang, Rong Kang, Julian Feinauer, Kevin A McGrail, Peng Wang, et al. 2020. Apache IoTDB: time-series database for internet of things. Proceedings of the VLDB Endowment, Vol. 13, 12 (2020), 2901--2904.

  2. Apache IoTDB homepage: https://iotdb.apache.org/

  3. Apache TsFile homepage: https://tsfile.apache.org/

  4. TsFile: A Standard Format for IoT Time Series Data, https://thenewstack.io/tsfile-a-standard-format-for-iot-time-series-data/