Skip to content
Pablo Rodriguez

Storage Data Pipeline

A data lake consists of three fundamental elements:

  • Data Storage: Centralized repository for structured and unstructured data
  • Data Catalog: Metadata management and schema information
  • Security Access: Unified permissions and authorization

Modern data architecture removes restrictions from separate data silos, making all organizational data available for analysis using preferred analytics tools. This provides the best of both data lakes and purpose-built data stores.

  • Definition: Service (big data processing, ML) needs portion of data from data lake
  • Process: Data copied, moved, or filtered from data lake to service
  • Example: Clickstream data collected in data lake, portion moved to log analytics service for trend analysis
  • Definition: Analytics solution requires data in data lake for processing
  • Process: Data copied, moved, or filtered to data lake from database source
  • Purpose: Data becomes part of bigger dataset for comprehensive analysis
  • Definition: Service-to-service data movement without data lake involvement
  • Process: Direct copy, move, or query between services with direct integration
  • Amazon S3: Primary data storage
  • AWS Lake Formation: Management component for data lake configuration
  • AWS Glue: Data catalog and transformation services
  • Amazon Athena: SQL query engine for data lake access
  • IAM: Security access permissions
  • Configure and manage data lake from central console
  • Discover data sources and store data using Amazon S3
  • Create AWS Glue transformation jobs with Data Catalog aid
  • Configure IAM security permissions for user and service access
  • Amazon Aurora/RDS: Relational, normalized data storage
  • DynamoDB: NoSQL key-value data storage
  • Amazon EMR: Transform/aggregate big datasets in parallel
  • Amazon SageMaker: Train models for pattern recognition
  • OpenSearch Service: Indexed data for fast retrieval
  • Amazon Redshift: Data warehouse for historical data analysis
Data OriginData ShapeStorage Solution
Business Applications (OLTP)Structured/SemistructuredRDMS or NoSQL
Real-time StreamingStream DataStream Services
Analytics Applications (OLAP)Unstructured/RawData Lake
Analytics Applications (OLAP)Processed/AggregatedData warehouse

Amazon S3

  • Lower cost data lake storage
  • Less efficient for complex querying
  • Best for unstructured data with partitioning

Amazon Redshift

  • Higher cost but efficient querying
  • Optimal for large datasets spanning long periods
  • Cluster-based data warehouse architecture

Redshift Spectrum

  • Reduces data warehouse costs
  • Query S3 buckets without data movement
  • Best of both worlds approach

Amazon S3 Storage Classes:

  • S3 Standard: General purpose, unknown/changing access
  • S3 Intelligent-Tiering: Unknown access patterns with automated optimization
  • S3 Standard-IA / One Zone-IA: Infrequent access scenarios
  • S3 Glacier: Archive storage with varying retrieval times
FeatureData WarehouseData Lake
Data SourcesBusiness applications and databasesIoT devices, websites, mobile apps, social media, business applications
SchemaStructured schema on writeUnstructured schema on read
PriceHigher-cost storageLow-cost storage
Data QualityCurated, processed, aggregated as central truthRaw (unprocessed) or transformed data
Analytics Use CaseBatch reporting, BI, visualizationsLogs, data discovery, profiling
  • Schema-on-Write: Data warehouse approach requiring known schema when writing
  • Schema-on-Read: Data lake approach where schema needed only when reading data
  1. Individual Customer Transactions → Aurora, DynamoDB (OLTP requirements)
  2. Daily Transaction Totals → Amazon Redshift (OLAP, historical analysis)
  3. User Activity Logs → Amazon S3 or OpenSearch Service (fraud analysis)
  4. Application Error Logs → Amazon S3 (simplest architecture)
  • OLTP: Optimized for small record writes, 24/7 availability, concurrent users
  • OLAP: Optimized for analytical queries, columnar storage, historical data spanning years

When designing data storage, consider:

  • Data Origin: Where data comes from and its initial format
  • Data Shape: Structured vs unstructured characteristics
  • Cost Requirements: Budget constraints and cost optimization goals
  • Query Scope: Range and complexity of analytics queries
  • Retention Periods: How long data must be kept and access frequency

Modern data architecture centers on a data lake with peripheral analytics and storage services. Design decisions should consider data origin, shape, cost, query scope, and retention requirements while balancing between schema-on-write and schema-on-read approaches.