TileDB introduces a novel on-disk format for storing multi-dimensional arrays. Contrary to other popular systems (e.g., HDF5) that are optimized mostly for dense arrays, TileDB is optimized for both dense and sparse arrays, exposing a unified array API. In addition, the TileDB format allows for efficient data ingestion and updates.
TileDB is a library exposing a C API, which makes it very easy to integrate with popular higher-level programming languages (e.g., R, Python, Matlab, Java, etc.) and data science tools (e.g., NumPy, Pandas, Spark, etc.).
TileDB is thread- and process-safe, allowing users to build powerful parallel computational engines on top of the TileDB array storage, either with multithreading or multiprocessing (e.g., using OpenMP / MPI). In addition, TileDB supports asynchronous writes and reads, enabling users to overlap IO with CPU intensive operation boosting performance.
In addition to the effective data format, TileDB is written in C/C++ incorporating many low-level optimizations for achieving IO efficiency and a small main-memory footprint. The VLDB 2017 research paper demonstrates the performance superiority of TileDB against competing solutions for array storage operations.
TileDB can compress array data with a wide number of compressors, such as GZIP, BZIP2, LZ4, ZStandard, Blosc, double-delta and run-length encoding. TileDB groups array elements (cells) in tiles, which are the atomic unit of compression and IO. This enables fast slicing and dicing of arrays while achieving high compression ratios. TileDB can be easily extended to support more compression mechanisms.
TileDB is constantly being optimized for a wide range of storage backends in addition to local filesystems, such as Hadoop File System (HDFS), S3 object storage, Google File System (GFS), and more. TileDB abstracts the array storage layer, offering to the user a unified global view of their arrays that is agnostic to the actual storage backend.