Intro notes on Apache Druid

Notes

Blurbs

  • "A modern cloud-native, stream-native, analytics database"
  • "next-gen open source alternative to analytical databases such as Vertica, Greenplum, and Exadata, and data warehouses such as Snowflake, BigQuery, and Redshift"
  • "Unified system for operational analytics"
  • "Apache Druid generally works well with any event-oriented, clickstream, timeseries, or telemetry data, especially streaming datasets from Apache Kafka."
  • "Druid provides exactly once consumption semantics from Apache Kafka and is commonly used as a sink for event-oriented Kafka topics."
  • Integrates batch and streaming: "Druid can natively stream data from message buses such as Kafka, Amazon Kinesis, and more, and batch load files from data lakes such as HDFS, Amazon S3, and more"
  • "Druid is a real-time columnar timeseries database on steroids that scales veryyyyy well."
  • Users: Netflix, Airbnb, Walmart
  • Addresses: timeseries, high cardinality, fast-query, streaming
  • "new type of database to power real-time analytic workloads for event-driven data"
  • advantages over traditional data warehouses: Low latency streaming ingest, and direct integration with messages buses such as Apache Kafka. Time-based partitioning, which enables performant time-based queries. Fast search and filter, for fast ad-hoc slice and dice. Minimal schema design, and native support for semi-structured and nested data.

Uses

  • Druid is often used to ingest and analyze semi-structured data such as JSON
  • Druid is optimized for data where a timestamp is present
  • Druid partitions data by time, and queries that include a time filter will be significantly faster than those that do not.
  • "Real Time Aggregations" is a primary reason why developers consider Druid over the competitors - https://stackshare.io/stackups/druid-vs-spark

Concepts:

  • "Your data is stored into segments. Segments are immutable. Once they are created, you cannot update it. (You can create a new version of a segment, but that implies re-indexing all the data for the period)". Segments are basically a group-by roll up. Can specify how frequently to create them (hourly, daily). A segment has: Timestamp, Dimension(s), Metric(s). Metrics are aggregated over the dimensions & time span. [tds-intro]

Limitations

  • No window function / rolling averages (must be done externally)
  • Not possible to join data (should be done upstream)
  • Data is read only. Best practice is to store IDs not attributes like names (e.g. customer-id not customer-name)

Visualization:

References

Introductions

Basic Links

Site by Zachariah Garner, 2019