- "A modern cloud-native, stream-native, analytics database"
- "next-gen open source alternative to analytical databases such as Vertica, Greenplum, and Exadata, and data warehouses such as Snowflake, BigQuery, and Redshift"
- "Unified system for operational analytics"
- "Apache Druid generally works well with any event-oriented, clickstream, timeseries, or telemetry data, especially streaming datasets from Apache Kafka."
- "Druid provides exactly once consumption semantics from Apache Kafka and is commonly used as a sink for event-oriented Kafka topics."
- Integrates batch and streaming: "Druid can natively stream data from message buses such as Kafka, Amazon Kinesis, and more, and batch load files from data lakes such as HDFS, Amazon S3, and more"
- "Druid is a real-time columnar timeseries database on steroids that scales veryyyyy well."
- Users: Netflix, Airbnb, Walmart
- Addresses: timeseries, high cardinality, fast-query, streaming
- "new type of database to power real-time analytic workloads for event-driven data"
- advantages over traditional data warehouses: Low latency streaming ingest, and direct integration with messages buses such as Apache Kafka. Time-based partitioning, which enables performant time-based queries. Fast search and filter, for fast ad-hoc slice and dice. Minimal schema design, and native support for semi-structured and nested data.
- Druid is often used to ingest and analyze semi-structured data such as JSON
- Druid is optimized for data where a timestamp is present
- Druid partitions data by time, and queries that include a time filter will be significantly faster than those that do not.
- "Real Time Aggregations" is a primary reason why developers consider Druid over the competitors - https://stackshare.io/stackups/druid-vs-spark
- "Your data is stored into segments. Segments are immutable. Once they are created, you cannot update it. (You can create a new version of a segment, but that implies re-indexing all the data for the period)". Segments are basically a group-by roll up. Can specify how frequently to create them (hourly, daily). A segment has: Timestamp, Dimension(s), Metric(s). Metrics are aggregated over the dimensions & time span. [tds-intro]
- No window function / rolling averages (must be done externally)
- Not possible to join data (should be done upstream)
- Data is read only. Best practice is to store IDs not attributes like names (e.g. customer-id not customer-name)
- Apache Superset is a visualization / BI tool w/ heavy integration with druid
- Druid + Superset - https://medium.com/swlh/druid-and-superset-for-real-time-monitoring-at-scale-ac5d6727fe8c
- tds-intro - https://towardsdatascience.com/introduction-to-druid-4bf285b92b5a - Useful starting point