Processing Realtime Data
Processing Real-Time Data
Section titled “Processing Real-Time Data”Streaming Data Fundamentals
Section titled “Streaming Data Fundamentals”Definition
Section titled “Definition”Streaming data is emitted at high volume in a continuous, incremental manner with the goal of low-latency processing.
Gaming Example
Section titled “Gaming Example”An online gaming company collects streaming data about player-game interactions, feeds it into the gaming platform, analyzes data in real time, and offers incentives and dynamic experiences to engage players.
Industry Applications
Section titled “Industry Applications”Most beneficial where new, dynamic data is generated continually:
- Starting Simple: System logs, rolling min-max computations
- Advanced Processing: Sophisticated near real-time processing
- Business Use: Track public sentiment on brands/products through social media analysis
Streaming Pipeline Architecture
Section titled “Streaming Pipeline Architecture”Core Elements
Section titled “Core Elements”- Sources: Multiple streaming data sources
- Producers: Collect data from sources, deliver to data stream service
- Streams: Mechanism to deliver messages/data records from producers to consumers
- Consumers: Read and process data from streaming storage
- Destinations: Multiple target destinations for processed data
AWS Streaming Services
Section titled “AWS Streaming Services”Amazon Data Firehose
Section titled “Amazon Data Firehose”- Function: Captures and transforms streaming data in near real time
- Delivery: Delivers data to common storage and analytics destinations
- Processing: Uses micro batch cycles (milliseconds to several seconds)
- Format Conversion: Converts JSON to Parquet before storing in Amazon S3
- Management: No coding required
Destinations
Section titled “Destinations”- Amazon S3, Amazon Redshift, Amazon OpenSearch Service
- Custom HTTP endpoints, supported third-party service providers
Micro Batching
Section titled “Micro Batching”Balances latency and throughput by waiting short duration before batch operations - quicker than batch processing but not as fast as true streaming.
Amazon Kinesis Data Streams
Section titled “Amazon Kinesis Data Streams”- Function: Ingests stream log and event data
- Storage: Temporary storage for streaming data (up to 365 days)
- Delivery: Real-time delivery within 60 seconds
- Replay: Multiple consumers can replay stream data simultaneously
- Libraries: Uses Kinesis Producer Library (KPL) and Kinesis Consumer Library (KCL)
Producers and Consumers
Section titled “Producers and Consumers”Producers (write data):
- AWS SDK: Multiple programming languages supported
- Kinesis Agent: Tool for sending data to Kinesis Data Streams
- KPL: Enhances ingestion capabilities, simplifies retry/batching/optimization
Consumers (read/process data): 4. KCL: Handles complex distributed processing tasks (load balancing, checkpointing, failure handling) 5. Kinesis SDKs: Finer control for experienced developers
Amazon Managed Service for Apache Flink
Section titled “Amazon Managed Service for Apache Flink”- Function: Build end-to-end processing applications for streaming data
- Processing: Methods include aggregation, anomaly detection
- Insights: Gain business insights in real time (seconds/minutes vs days/weeks)
- Management: Serverless service with automatic scaling
- Languages: Process with SQL or Apache Flink
Use Cases
Section titled “Use Cases”- Streaming ETL
- Continuous metric generation
- Responsive real-time analytics
- Interactive querying of data streams
Amazon Kinesis Video Streams
Section titled “Amazon Kinesis Video Streams”- Function: Securely streams video from connected devices to AWS
- Scalability: Automatically provisions infrastructure for millions of devices
- Storage: Durably stores, encrypts, and indexes video data
- Sources: Smartphones, security cameras, webcams, car cameras, drones
Amazon Managed Streaming for Apache Kafka (Amazon MSK)
Section titled “Amazon Managed Streaming for Apache Kafka (Amazon MSK)”- Function: Fully managed Apache Kafka service
- Benefits: Reduces operational overhead
- Compatibility: Deploy with existing Apache Kafka tools
- Integration: Works with existing AWS integrations
When to Choose MSK
Section titled “When to Choose MSK”- Want open source solutions to reduce licensing costs
- Already familiar with Apache Kafka
- Need longer, configurable data retention
- Willing to manage more underlying infrastructure
Use Case Examples
Section titled “Use Case Examples”Café Clickstream Analysis
Section titled “Café Clickstream Analysis”Requirements: Evaluate when users browse/click café menu (no subsecond delivery needed)
Architecture:
- Static website (Amazon S3) returns café menu page
- Browser sends clickstream data to API Gateway
- Firehose ingests clickstream data
- Data stored in S3 bucket for analysis/visualization
Medical Device Monitoring
Section titled “Medical Device Monitoring”Requirements: Monitor patients, detect device functional accuracy, notify medical professionals in near real time
Architecture:
- Kinesis Data Streams ingests IoT sensor data
- Amazon Managed Service for Apache Flink performs anomaly detection
- AWS Lambda function initiated if anomaly detected
- Lambda sends mobile notification to medical professional
Service Comparison
Section titled “Service Comparison”Feature | Firehose | Kinesis Data Streams | Amazon MSK |
---|---|---|---|
Complexity | Simplest option | Requires code customization | Most complex setup |
Integration | Plug-and-play with AWS services | Requires KCL/KPL development | Custom process integration |
Data Retention | Max 24 hours undelivered data | 24 hours to 365 days | Longer, configurable |
Use When | Feed to approved destinations | Need destination control or custom consumption | Open source preference, Kafka familiarity |
Selection Criteria
Section titled “Selection Criteria”Choose streaming service based on:
- Business Use Case: Required latency and processing complexity
- Engineering Effort: Development and operational overhead tolerance
- Retention Requirements: How long data needs to be stored/replayed
- Integration Needs: AWS service integration vs custom solutions
The Amazon Kinesis family provides fully managed streaming services with Firehose for simple delivery, Kinesis Data Streams for low-latency processing, and additional services for video streaming and analytics. Amazon MSK offers open-source Kafka compatibility for organizations with existing Kafka expertise.