A data lake consists of three fundamental elements:
Data Storage : Centralized repository for structured and unstructured data
Data Catalog : Metadata management and schema information
Security Access : Unified permissions and authorization
Modern data architecture removes restrictions from separate data silos, making all organizational data available for analysis using preferred analytics tools. This provides the best of both data lakes and purpose-built data stores.
Definition : Service (big data processing, ML) needs portion of data from data lake
Process : Data copied, moved, or filtered from data lake to service
Example : Clickstream data collected in data lake, portion moved to log analytics service for trend analysis
Definition : Analytics solution requires data in data lake for processing
Process : Data copied, moved, or filtered to data lake from database source
Purpose : Data becomes part of bigger dataset for comprehensive analysis
Definition : Service-to-service data movement without data lake involvement
Process : Direct copy, move, or query between services with direct integration
Amazon S3 : Primary data storage
AWS Lake Formation : Management component for data lake configuration
AWS Glue : Data catalog and transformation services
Amazon Athena : SQL query engine for data lake access
IAM : Security access permissions
Configure and manage data lake from central console
Discover data sources and store data using Amazon S3
Create AWS Glue transformation jobs with Data Catalog aid
Configure IAM security permissions for user and service access
Amazon Aurora/RDS : Relational, normalized data storage
DynamoDB : NoSQL key-value data storage
Amazon EMR : Transform/aggregate big datasets in parallel
Amazon SageMaker : Train models for pattern recognition
OpenSearch Service : Indexed data for fast retrieval
Amazon Redshift : Data warehouse for historical data analysis
Data Origin Data Shape Storage Solution Business Applications (OLTP) Structured/Semistructured RDMS or NoSQL Real-time Streaming Stream Data Stream Services Analytics Applications (OLAP) Unstructured/Raw Data Lake Analytics Applications (OLAP) Processed/Aggregated Data warehouse
Amazon S3
Lower cost data lake storage
Less efficient for complex querying
Best for unstructured data with partitioning
Amazon Redshift
Higher cost but efficient querying
Optimal for large datasets spanning long periods
Cluster-based data warehouse architecture
Redshift Spectrum
Reduces data warehouse costs
Query S3 buckets without data movement
Best of both worlds approach
Amazon S3 Storage Classes :
S3 Standard : General purpose, unknown/changing access
S3 Intelligent-Tiering : Unknown access patterns with automated optimization
S3 Standard-IA / One Zone-IA : Infrequent access scenarios
S3 Glacier : Archive storage with varying retrieval times
Feature Data Warehouse Data Lake Data Sources Business applications and databases IoT devices, websites, mobile apps, social media, business applications Schema Structured schema on write Unstructured schema on read Price Higher-cost storage Low-cost storage Data Quality Curated, processed, aggregated as central truth Raw (unprocessed) or transformed data Analytics Use Case Batch reporting, BI, visualizations Logs, data discovery, profiling
Schema-on-Write : Data warehouse approach requiring known schema when writing
Schema-on-Read : Data lake approach where schema needed only when reading data
Individual Customer Transactions → Aurora, DynamoDB (OLTP requirements)
Daily Transaction Totals → Amazon Redshift (OLAP, historical analysis)
User Activity Logs → Amazon S3 or OpenSearch Service (fraud analysis)
Application Error Logs → Amazon S3 (simplest architecture)
OLTP : Optimized for small record writes, 24/7 availability, concurrent users
OLAP : Optimized for analytical queries, columnar storage, historical data spanning years
When designing data storage, consider:
Data Origin : Where data comes from and its initial format
Data Shape : Structured vs unstructured characteristics
Cost Requirements : Budget constraints and cost optimization goals
Query Scope : Range and complexity of analytics queries
Retention Periods : How long data must be kept and access frequency
Modern data architecture centers on a data lake with peripheral analytics and storage services. Design decisions should consider data origin, shape, cost, query scope, and retention requirements while balancing between schema-on-write and schema-on-read approaches.