Why Traditional Data Quality Tools Fall Short in Streaming Data Ecosystems

“Moving Beyond Great Expectations for Real-Time Data Quality Assurance"

Posted Aug 5, 2025

By Sneha Shrivastav

4 min read

In today’s data ecosystems, businesses increasingly rely on streaming data pipelines to enable near real-time decision-making, alerting, personalization, and analytics. Yet, while the volume and velocity of data have evolved, many teams continue to lean on traditional data quality (DQ) frameworks like Great Expectations and Deequ that were built for static, batch-based workflows.

This blog explores why such frameworks fall short for streaming data quality and how modern systems need fundamentally different capabilities to ensure trust in high-velocity, continuously flowing data streams.

Streaming vs. Batch: A Paradigm Shift in Data Quality Needs

Characteristic	Batch Data	Streaming Data
Data Availability	Finite, known datasets	Infinite, continuous streams
Processing Style	Periodic or scheduled	Real-time / micro-batched
Validation Window	Entire dataset or partition	Sliding/tumbling time windows
Metrics and Checks	Snapshot-level, aggregate-based	Incremental, time-sensitive
Latency Expectations	Minutes to hours	Sub-seconds to a few seconds
Error Tolerance	Allows manual intervention	Must be automated and non-blocking

Batch data quality focuses on completeness, uniqueness, accuracy, and schema validation over a finite dataset. In contrast, streaming pipelines require ongoing monitoring, low-latency checks, anomaly detection, and resilience to drift or out-of-order events.

Shortcomings of Great Expectations & Deequ in Streaming

1. Batch-First Architecture

Great Expectations and Deequ operate on static DataFrames or RDDs. They require data to be fully loaded into memory or processed in full partitions. Streaming pipelines, by nature, require processing unbounded data incrementally.

2. No Native Support for Windowed Computation

Streaming often involves checks over rolling windows (e.g., last 5 minutes, hourly aggregates). GE and Deequ do not support stateful aggregations or sliding window validations out of the box.

3. Lack of Real-Time Alerting

These tools are designed to generate post-run reports, often in HTML or JSON, but do not support real-time event triggers. In streaming systems, alerting must be immediate to flag and act on critical issues (e.g., schema drift or null spikes).

4. No Support for Streaming Engines

Neither GE nor Deequ provide tight integration with streaming engines like:

Apache Kafka
Apache Flink
Spark Structured Streaming

Any integration requires custom wrappers, complicating deployment and reducing maintainability.

5. No Support for Anomaly Detection or Drift Detection

Streaming data often varies over time. A DQ system must detect distributional drift, seasonality, and unexpected patterns. GE and Deequ rely on static rules unless explicitly configured otherwise.

What Streaming Data Quality Actually Requires

To truly validate streaming data, you need tools and platforms that:

Support incremental validation using windowed aggregations
Allow low-latency validation and alerting (sub-second to few seconds)
Handle stateful checks like row counts over time, freshness, etc.
Are fault-tolerant and scalable under load
Integrate natively with stream processing engines
Can detect data anomalies and drift automatically

Tools Better Suited for Streaming Data Quality

1. Apache Flink + Custom DQ Operators

Flink’s support for stateful stream processing and custom operators allows building tailored data quality checks (e.g., null checks, threshold alerts, lag tracking) within pipelines.

2. Kafka Streams with Embedded Validations

DQ logic can be embedded directly into Kafka Streams applications, allowing line-by-line validation with immediate rejection, transformation, or logging.

3. Monte Carlo, Anomalo, and Acceldata

These tools offer commercial monitoring layers that detect:

Sudden volume changes
Freshness lags
Schema or distributional drift

They work across batch and streaming pipelines, often integrating with metadata catalogs and data lineage tools.

4. dbt + Metrics Layer (for near-real-time)

If using micro-batching, dbt Cloud + semantic layer + alerts can simulate real-time checks by scheduling more frequently and tracking quality metrics over time.

Sample Streaming Data Quality Report (Simulated)

  
{
  "stream": "kafka://orders",
  "window": "2025-08-05T15:00:00Z to 15:05:00Z",
  "checks": {
    "null_rate:order_id": "0.000%",
    "duplicate_rate:order_id": "0.003%",
    "out_of_range:order_amount": "1.1%",
    "late_arrivals (>2 min delay)": "3.6%",
    "schema_drift_detected": false
  },
  "status": "warnings",
  "alerts": [
    {
      "type": "threshold_breach",
      "field": "order_amount",
      "severity": "warning",
      "message": "1.1% of order_amounts exceeded the configured limit of $10,000"
    },
    {
      "type": "timeliness",
      "severity": "warning",
      "message": "3.6% of events arrived later than the 2-minute SLA"
    }
  ]
}

Final Thoughts

As data pipelines modernize, streaming-first architectures are becoming the norm. To keep up, data quality checks must evolve too. While Great Expectations and Deequ excel in the batch world, they fall short in delivering the speed, flexibility, and observability required for streaming use cases.

Organizations should invest in tools that support incremental validation, anomaly detection, and real-time alerting to maintain trust in their continuously flowing data.

Have you tried building data quality validations in Flink or Kafka Streams? Share your approach or tools in the comments below!

data-quality

streaming-data data-quality

This post is licensed under CC BY 4.0 by the author.