Storage Data Pipeline

Storage in the Data Pipeline

Modern Data Architecture Storage

Data Lake Core Components

A data lake consists of three fundamental elements:

Data Storage: Centralized repository for structured and unstructured data
Data Catalog: Metadata management and schema information
Security Access: Unified permissions and authorization

Architecture Benefits

Modern data architecture removes restrictions from separate data silos, making all organizational data available for analysis using preferred analytics tools. This provides the best of both data lakes and purpose-built data stores.

Data Movement Types

Inside-Out Movement

Definition: Service (big data processing, ML) needs portion of data from data lake
Process: Data copied, moved, or filtered from data lake to service
Example: Clickstream data collected in data lake, portion moved to log analytics service for trend analysis

Outside-In Movement

Definition: Analytics solution requires data in data lake for processing
Process: Data copied, moved, or filtered to data lake from database source
Purpose: Data becomes part of bigger dataset for comprehensive analysis

Direct Integration

Definition: Service-to-service data movement without data lake involvement
Process: Direct copy, move, or query between services with direct integration

AWS Data Lake Implementation

Core Components

Amazon S3: Primary data storage
AWS Lake Formation: Management component for data lake configuration
AWS Glue: Data catalog and transformation services
Amazon Athena: SQL query engine for data lake access
IAM: Security access permissions

Lake Formation Capabilities

Configure and manage data lake from central console
Discover data sources and store data using Amazon S3
Create AWS Glue transformation jobs with Data Catalog aid
Configure IAM security permissions for user and service access

Supporting Services

Amazon Aurora/RDS: Relational, normalized data storage
DynamoDB: NoSQL key-value data storage
Amazon EMR: Transform/aggregate big datasets in parallel
Amazon SageMaker: Train models for pattern recognition
OpenSearch Service: Indexed data for fast retrieval
Amazon Redshift: Data warehouse for historical data analysis

Storage Decision Factors

Data Origin and Shape

Data Origin	Data Shape	Storage Solution
Business Applications (OLTP)	Structured/Semistructured	RDMS or NoSQL
Real-time Streaming	Stream Data	Stream Services
Analytics Applications (OLAP)	Unstructured/Raw	Data Lake
Analytics Applications (OLAP)	Processed/Aggregated	Data warehouse

Cost and Query Performance

Amazon S3

Lower cost data lake storage
Less efficient for complex querying
Best for unstructured data with partitioning

Amazon Redshift

Higher cost but efficient querying
Optimal for large datasets spanning long periods
Cluster-based data warehouse architecture

Redshift Spectrum

Reduces data warehouse costs
Query S3 buckets without data movement
Best of both worlds approach

Data Retention and Access Patterns

Amazon S3 Storage Classes:

S3 Standard: General purpose, unknown/changing access
S3 Intelligent-Tiering: Unknown access patterns with automated optimization
S3 Standard-IA / One Zone-IA: Infrequent access scenarios
S3 Glacier: Archive storage with varying retrieval times

Data Warehouse vs Data Lake Comparison

Feature	Data Warehouse	Data Lake
Data Sources	Business applications and databases	IoT devices, websites, mobile apps, social media, business applications
Schema	Structured schema on write	Unstructured schema on read
Price	Higher-cost storage	Low-cost storage
Data Quality	Curated, processed, aggregated as central truth	Raw (unprocessed) or transformed data
Analytics Use Case	Batch reporting, BI, visualizations	Logs, data discovery, profiling

Schema Approaches

Schema-on-Write: Data warehouse approach requiring known schema when writing
Schema-on-Read: Data lake approach where schema needed only when reading data

Bank Application Storage Example

Data Types and Storage Solutions

Individual Customer Transactions → Aurora, DynamoDB (OLTP requirements)
Daily Transaction Totals → Amazon Redshift (OLAP, historical analysis)
User Activity Logs → Amazon S3 or OpenSearch Service (fraud analysis)
Application Error Logs → Amazon S3 (simplest architecture)

OLTP vs OLAP Storage

OLTP: Optimized for small record writes, 24/7 availability, concurrent users
OLAP: Optimized for analytical queries, columnar storage, historical data spanning years

Design Considerations

When designing data storage, consider:

Data Origin: Where data comes from and its initial format
Data Shape: Structured vs unstructured characteristics
Cost Requirements: Budget constraints and cost optimization goals
Query Scope: Range and complexity of analytics queries
Retention Periods: How long data must be kept and access frequency

Modern data architecture centers on a data lake with peripheral analytics and storage services. Design decisions should consider data origin, shape, cost, query scope, and retention requirements while balancing between schema-on-write and schema-on-read approaches.