Data Pipelines
Data Pipelines
Section titled “Data Pipelines”Elements of a Data Pipeline
Section titled “Elements of a Data Pipeline”At the most basic level, any data pipeline infrastructure must support four core elements:
Ingest
Section titled “Ingest”- Get raw data into the cloud and into your pipeline
- Extract data from its source and load it into the pipeline
- Involves copying data from source to target data store
- Securely store raw and processed data in a cost-effective way
- Must support varying access patterns and retention requirements
Process
Section titled “Process”- Transform data so it becomes viable for analysis
- Data processed iteratively to evaluate and improve results
- Linear pipeline is simplified version of iterative process
Analyze
Section titled “Analyze”- Discover details about the data to build visualizations or predictions
- Support ability to derive insights from processed data
Pipeline Design Approach
Section titled “Pipeline Design Approach”Start with the end in mind - begin with the business problem and build the data pipeline that supports it. The characteristics of your data and the business problem determine:
- Elements of the pipeline
- Iterative process requirements
- Infrastructure architecture decisions
Ingestion Patterns
Section titled “Ingestion Patterns”Homogeneous Ingestion
Section titled “Homogeneous Ingestion”- Objective: Move data from source to destination while keeping the same data format or storage engine type
- Process: Data ingested “as-is” without transformation
- Use Cases:
- Data migration (e.g., on-premises NoSQL to Amazon RDS MySQL)
- Populating landing areas where original copies are kept
- Raw text files stored without transformation
Heterogeneous Ingestion Patterns
Section titled “Heterogeneous Ingestion Patterns”Extract, Transform, and Load (ETL)
Section titled “Extract, Transform, and Load (ETL)”- Process:
- Extract structured data
- Transform data into format matching destination
- Load data into structured storage for defined analytics
- Best For: Structured data destined for data warehouse
- Advantage: Stores data ready for analysis, saving analyst time
Extract, Load, and Transform (ELT)
Section titled “Extract, Load, and Transform (ELT)”- Process:
- Extract unstructured or structured data
- Load data into storage destination in format close to raw form
- Transform data as needed for analytics scenarios
- Best For: Unstructured data destined for data lake
- Advantage: Flexibility to create new queries with access to more raw data
Processing Patterns
Section titled “Processing Patterns”Batch Processing
Section titled “Batch Processing”- Characteristics:
- Computes results based on complete datasets
- Every command runs on entire batch of data
- Can be run on demand, on schedule, or based on events
- Use Cases:
- Daily and weekly reporting
- Deep analysis of large datasets
- Compute-intensive tasks during off-peak times
Streaming Processing
Section titled “Streaming Processing”- Characteristics:
- Data stream is unbounded - continuous, incremental sequence of small data packets
- Metrics or reports incrementally updated as new data arrives
- Processes series of events for real-time analytics
- Use Cases:
- Real-time analytics requiring immediate insights
- Continuous monitoring and alerting
Comparison: Batch vs Streaming
Section titled “Comparison: Batch vs Streaming”Feature | Batch Processing | Streaming Processing |
---|---|---|
Data Processing Cycles | Infrequently, typically during off-peak hours | Continuously |
Compute Requirements | High computing power | Low computing power and reliable, low-latency network |
Use Case Example | Sales transaction data analyzed overnight with morning reports | Product recommendations requiring immediate data analysis |
The choice between batch and streaming depends on business requirements for data freshness and the urgency of insights needed from the data.
A data pipeline integrates ingestion, storage, processing, and analysis layers. The pipeline design should consider data characteristics and business requirements to determine appropriate ingestion patterns (homogeneous vs heterogeneous) and processing patterns (batch vs streaming).