Skip to content
Pablo Rodriguez

Processing Realtime Data

Streaming data is emitted at high volume in a continuous, incremental manner with the goal of low-latency processing.

An online gaming company collects streaming data about player-game interactions, feeds it into the gaming platform, analyzes data in real time, and offers incentives and dynamic experiences to engage players.

Most beneficial where new, dynamic data is generated continually:

  • Starting Simple: System logs, rolling min-max computations
  • Advanced Processing: Sophisticated near real-time processing
  • Business Use: Track public sentiment on brands/products through social media analysis
  1. Sources: Multiple streaming data sources
  2. Producers: Collect data from sources, deliver to data stream service
  3. Streams: Mechanism to deliver messages/data records from producers to consumers
  4. Consumers: Read and process data from streaming storage
  5. Destinations: Multiple target destinations for processed data
  • Function: Captures and transforms streaming data in near real time
  • Delivery: Delivers data to common storage and analytics destinations
  • Processing: Uses micro batch cycles (milliseconds to several seconds)
  • Format Conversion: Converts JSON to Parquet before storing in Amazon S3
  • Management: No coding required
  • Amazon S3, Amazon Redshift, Amazon OpenSearch Service
  • Custom HTTP endpoints, supported third-party service providers

Balances latency and throughput by waiting short duration before batch operations - quicker than batch processing but not as fast as true streaming.

  • Function: Ingests stream log and event data
  • Storage: Temporary storage for streaming data (up to 365 days)
  • Delivery: Real-time delivery within 60 seconds
  • Replay: Multiple consumers can replay stream data simultaneously
  • Libraries: Uses Kinesis Producer Library (KPL) and Kinesis Consumer Library (KCL)

Producers (write data):

  1. AWS SDK: Multiple programming languages supported
  2. Kinesis Agent: Tool for sending data to Kinesis Data Streams
  3. KPL: Enhances ingestion capabilities, simplifies retry/batching/optimization

Consumers (read/process data): 4. KCL: Handles complex distributed processing tasks (load balancing, checkpointing, failure handling) 5. Kinesis SDKs: Finer control for experienced developers

  • Function: Build end-to-end processing applications for streaming data
  • Processing: Methods include aggregation, anomaly detection
  • Insights: Gain business insights in real time (seconds/minutes vs days/weeks)
  • Management: Serverless service with automatic scaling
  • Languages: Process with SQL or Apache Flink
  • Streaming ETL
  • Continuous metric generation
  • Responsive real-time analytics
  • Interactive querying of data streams
  • Function: Securely streams video from connected devices to AWS
  • Scalability: Automatically provisions infrastructure for millions of devices
  • Storage: Durably stores, encrypts, and indexes video data
  • Sources: Smartphones, security cameras, webcams, car cameras, drones

Amazon Managed Streaming for Apache Kafka (Amazon MSK)

Section titled “Amazon Managed Streaming for Apache Kafka (Amazon MSK)”
  • Function: Fully managed Apache Kafka service
  • Benefits: Reduces operational overhead
  • Compatibility: Deploy with existing Apache Kafka tools
  • Integration: Works with existing AWS integrations
  • Want open source solutions to reduce licensing costs
  • Already familiar with Apache Kafka
  • Need longer, configurable data retention
  • Willing to manage more underlying infrastructure

Requirements: Evaluate when users browse/click café menu (no subsecond delivery needed)

Architecture:

  1. Static website (Amazon S3) returns café menu page
  2. Browser sends clickstream data to API Gateway
  3. Firehose ingests clickstream data
  4. Data stored in S3 bucket for analysis/visualization

Requirements: Monitor patients, detect device functional accuracy, notify medical professionals in near real time

Architecture:

  1. Kinesis Data Streams ingests IoT sensor data
  2. Amazon Managed Service for Apache Flink performs anomaly detection
  3. AWS Lambda function initiated if anomaly detected
  4. Lambda sends mobile notification to medical professional
FeatureFirehoseKinesis Data StreamsAmazon MSK
ComplexitySimplest optionRequires code customizationMost complex setup
IntegrationPlug-and-play with AWS servicesRequires KCL/KPL developmentCustom process integration
Data RetentionMax 24 hours undelivered data24 hours to 365 daysLonger, configurable
Use WhenFeed to approved destinationsNeed destination control or custom consumptionOpen source preference, Kafka familiarity

Choose streaming service based on:

  • Business Use Case: Required latency and processing complexity
  • Engineering Effort: Development and operational overhead tolerance
  • Retention Requirements: How long data needs to be stored/replayed
  • Integration Needs: AWS service integration vs custom solutions

The Amazon Kinesis family provides fully managed streaming services with Firehose for simple delivery, Kinesis Data Streams for low-latency processing, and additional services for video streaming and analytics. Amazon MSK offers open-source Kafka compatibility for organizations with existing Kafka expertise.