Data pipelines are tested against expected inputs. Real-world data contains edge cases that break hardcoded assumptions.
Pipelines make assumptions about data shape, completeness, and relationships. These assumptions hold in development and testing but break in production when real-world edge cases appear.
Null handling:
-- Assumes stock_count is never null
SELECT product_id, SUM(stock_count) FROM inventory
-- Breaks when warehouse system sends null for items being countedImplicit joins:
-- Assumes every order has a customer
SELECT o.*, c.name FROM orders o
JOIN customers c ON o.customer_id = c.id
-- Breaks when guest checkout creates orders without customer_idDate boundaries:
-- Assumes all orders are recent
WHERE order_date > CURRENT_DATE - 90
-- Breaks when processing historical data or backfillsThese edge cases don't surface in testing because test data is clean. Production data has:
Orders from deleted customers
Transactions with missing timestamps
Records that violate expected cardinality (one customer, multiple "primary" addresses)
The pipeline either fails hard (better) or silently produces wrong results (worse). These problems compound when orchestration tools keep running pipelines without validating whether the data meets current business standards.
Validation rules that check assumptions before execution:
Are all required joins satisfied?
Is data within expected validity windows?
Do records meet cardinality constraints?
Business rules define when data is safe to use. Context-aware workflows validate these assumptions before execution, preventing static pipelines from breaking on unexpected data.