Processing Realtime Data

Processing Real-Time Data

Streaming Data Fundamentals

Definition

Streaming data is emitted at high volume in a continuous, incremental manner with the goal of low-latency processing.

Gaming Example

An online gaming company collects streaming data about player-game interactions, feeds it into the gaming platform, analyzes data in real time, and offers incentives and dynamic experiences to engage players.

Industry Applications

Most beneficial where new, dynamic data is generated continually:

Starting Simple: System logs, rolling min-max computations
Advanced Processing: Sophisticated near real-time processing
Business Use: Track public sentiment on brands/products through social media analysis

Streaming Pipeline Architecture

Core Elements

Sources: Multiple streaming data sources
Producers: Collect data from sources, deliver to data stream service
Streams: Mechanism to deliver messages/data records from producers to consumers
Consumers: Read and process data from streaming storage
Destinations: Multiple target destinations for processed data

AWS Streaming Services

Amazon Data Firehose

Function: Captures and transforms streaming data in near real time
Delivery: Delivers data to common storage and analytics destinations
Processing: Uses micro batch cycles (milliseconds to several seconds)
Format Conversion: Converts JSON to Parquet before storing in Amazon S3
Management: No coding required

Destinations

Amazon S3, Amazon Redshift, Amazon OpenSearch Service
Custom HTTP endpoints, supported third-party service providers

Micro Batching

Balances latency and throughput by waiting short duration before batch operations - quicker than batch processing but not as fast as true streaming.

Amazon Kinesis Data Streams

Function: Ingests stream log and event data
Storage: Temporary storage for streaming data (up to 365 days)
Delivery: Real-time delivery within 60 seconds
Replay: Multiple consumers can replay stream data simultaneously
Libraries: Uses Kinesis Producer Library (KPL) and Kinesis Consumer Library (KCL)

Producers and Consumers

Producers (write data):

AWS SDK: Multiple programming languages supported
Kinesis Agent: Tool for sending data to Kinesis Data Streams
KPL: Enhances ingestion capabilities, simplifies retry/batching/optimization

Consumers (read/process data): 4. KCL: Handles complex distributed processing tasks (load balancing, checkpointing, failure handling) 5. Kinesis SDKs: Finer control for experienced developers

Amazon Managed Service for Apache Flink

Function: Build end-to-end processing applications for streaming data
Processing: Methods include aggregation, anomaly detection
Insights: Gain business insights in real time (seconds/minutes vs days/weeks)
Management: Serverless service with automatic scaling
Languages: Process with SQL or Apache Flink

Use Cases

Streaming ETL
Continuous metric generation
Responsive real-time analytics
Interactive querying of data streams

Amazon Kinesis Video Streams

Function: Securely streams video from connected devices to AWS
Scalability: Automatically provisions infrastructure for millions of devices
Storage: Durably stores, encrypts, and indexes video data
Sources: Smartphones, security cameras, webcams, car cameras, drones

Amazon Managed Streaming for Apache Kafka (Amazon MSK)

Function: Fully managed Apache Kafka service
Benefits: Reduces operational overhead
Compatibility: Deploy with existing Apache Kafka tools
Integration: Works with existing AWS integrations

When to Choose MSK

Want open source solutions to reduce licensing costs
Already familiar with Apache Kafka
Need longer, configurable data retention
Willing to manage more underlying infrastructure

Use Case Examples

Café Clickstream Analysis

Requirements: Evaluate when users browse/click café menu (no subsecond delivery needed)

Architecture:

Static website (Amazon S3) returns café menu page
Browser sends clickstream data to API Gateway
Firehose ingests clickstream data
Data stored in S3 bucket for analysis/visualization

Medical Device Monitoring

Requirements: Monitor patients, detect device functional accuracy, notify medical professionals in near real time

Architecture:

Kinesis Data Streams ingests IoT sensor data
Amazon Managed Service for Apache Flink performs anomaly detection
AWS Lambda function initiated if anomaly detected
Lambda sends mobile notification to medical professional

Service Comparison

Feature	Firehose	Kinesis Data Streams	Amazon MSK
Complexity	Simplest option	Requires code customization	Most complex setup
Integration	Plug-and-play with AWS services	Requires KCL/KPL development	Custom process integration
Data Retention	Max 24 hours undelivered data	24 hours to 365 days	Longer, configurable
Use When	Feed to approved destinations	Need destination control or custom consumption	Open source preference, Kafka familiarity

Selection Criteria

Choose streaming service based on:

Business Use Case: Required latency and processing complexity
Engineering Effort: Development and operational overhead tolerance
Retention Requirements: How long data needs to be stored/replayed
Integration Needs: AWS service integration vs custom solutions

The Amazon Kinesis family provides fully managed streaming services with Firehose for simple delivery, Kinesis Data Streams for low-latency processing, and additional services for video streaming and analytics. Amazon MSK offers open-source Kafka compatibility for organizations with existing Kafka expertise.