Data Pipelines

Elements of a Data Pipeline

At the most basic level, any data pipeline infrastructure must support four core elements:

Ingest

Get raw data into the cloud and into your pipeline
Extract data from its source and load it into the pipeline
Involves copying data from source to target data store

Store

Securely store raw and processed data in a cost-effective way
Must support varying access patterns and retention requirements

Process

Transform data so it becomes viable for analysis
Data processed iteratively to evaluate and improve results
Linear pipeline is simplified version of iterative process

Analyze

Discover details about the data to build visualizations or predictions
Support ability to derive insights from processed data

Pipeline Design Approach

Start with the end in mind - begin with the business problem and build the data pipeline that supports it. The characteristics of your data and the business problem determine:

Elements of the pipeline
Iterative process requirements
Infrastructure architecture decisions

Ingestion Patterns

Homogeneous Ingestion

Objective: Move data from source to destination while keeping the same data format or storage engine type
Process: Data ingested “as-is” without transformation
Use Cases:
- Data migration (e.g., on-premises NoSQL to Amazon RDS MySQL)
- Populating landing areas where original copies are kept
- Raw text files stored without transformation

Heterogeneous Ingestion Patterns

Extract, Transform, and Load (ETL)

Process:
1. Extract structured data
2. Transform data into format matching destination
3. Load data into structured storage for defined analytics
Best For: Structured data destined for data warehouse
Advantage: Stores data ready for analysis, saving analyst time

Extract, Load, and Transform (ELT)

Process:
1. Extract unstructured or structured data
2. Load data into storage destination in format close to raw form
3. Transform data as needed for analytics scenarios
Best For: Unstructured data destined for data lake
Advantage: Flexibility to create new queries with access to more raw data

Processing Patterns

Batch Processing

Characteristics:
- Computes results based on complete datasets
- Every command runs on entire batch of data
- Can be run on demand, on schedule, or based on events
Use Cases:
- Daily and weekly reporting
- Deep analysis of large datasets
- Compute-intensive tasks during off-peak times

Streaming Processing

Characteristics:
- Data stream is unbounded - continuous, incremental sequence of small data packets
- Metrics or reports incrementally updated as new data arrives
- Processes series of events for real-time analytics
Use Cases:
- Real-time analytics requiring immediate insights
- Continuous monitoring and alerting

Comparison: Batch vs Streaming

Feature	Batch Processing	Streaming Processing
Data Processing Cycles	Infrequently, typically during off-peak hours	Continuously
Compute Requirements	High computing power	Low computing power and reliable, low-latency network
Use Case Example	Sales transaction data analyzed overnight with morning reports	Product recommendations requiring immediate data analysis

The choice between batch and streaming depends on business requirements for data freshness and the urgency of insights needed from the data.

A data pipeline integrates ingestion, storage, processing, and analysis layers. The pipeline design should consider data characteristics and business requirements to determine appropriate ingestion patterns (homogeneous vs heterogeneous) and processing patterns (batch vs streaming).