Skip to content
Pablo Rodriguez

Data Pipelines

At the most basic level, any data pipeline infrastructure must support four core elements:

  • Get raw data into the cloud and into your pipeline
  • Extract data from its source and load it into the pipeline
  • Involves copying data from source to target data store
  • Securely store raw and processed data in a cost-effective way
  • Must support varying access patterns and retention requirements
  • Transform data so it becomes viable for analysis
  • Data processed iteratively to evaluate and improve results
  • Linear pipeline is simplified version of iterative process
  • Discover details about the data to build visualizations or predictions
  • Support ability to derive insights from processed data

Start with the end in mind - begin with the business problem and build the data pipeline that supports it. The characteristics of your data and the business problem determine:

  • Elements of the pipeline
  • Iterative process requirements
  • Infrastructure architecture decisions
  • Objective: Move data from source to destination while keeping the same data format or storage engine type
  • Process: Data ingested “as-is” without transformation
  • Use Cases:
    • Data migration (e.g., on-premises NoSQL to Amazon RDS MySQL)
    • Populating landing areas where original copies are kept
    • Raw text files stored without transformation
  • Process:
    1. Extract structured data
    2. Transform data into format matching destination
    3. Load data into structured storage for defined analytics
  • Best For: Structured data destined for data warehouse
  • Advantage: Stores data ready for analysis, saving analyst time
  • Process:
    1. Extract unstructured or structured data
    2. Load data into storage destination in format close to raw form
    3. Transform data as needed for analytics scenarios
  • Best For: Unstructured data destined for data lake
  • Advantage: Flexibility to create new queries with access to more raw data
  • Characteristics:
    • Computes results based on complete datasets
    • Every command runs on entire batch of data
    • Can be run on demand, on schedule, or based on events
  • Use Cases:
    • Daily and weekly reporting
    • Deep analysis of large datasets
    • Compute-intensive tasks during off-peak times
  • Characteristics:
    • Data stream is unbounded - continuous, incremental sequence of small data packets
    • Metrics or reports incrementally updated as new data arrives
    • Processes series of events for real-time analytics
  • Use Cases:
    • Real-time analytics requiring immediate insights
    • Continuous monitoring and alerting
FeatureBatch ProcessingStreaming Processing
Data Processing CyclesInfrequently, typically during off-peak hoursContinuously
Compute RequirementsHigh computing powerLow computing power and reliable, low-latency network
Use Case ExampleSales transaction data analyzed overnight with morning reportsProduct recommendations requiring immediate data analysis

The choice between batch and streaming depends on business requirements for data freshness and the urgency of insights needed from the data.

A data pipeline integrates ingestion, storage, processing, and analysis layers. The pipeline design should consider data characteristics and business requirements to determine appropriate ingestion patterns (homogeneous vs heterogeneous) and processing patterns (batch vs streaming).