Processing Batch Data

Batch Processing Concepts

Definition and Characteristics

Batch processing involves periodically completing high-volume, repetitive data jobs, typically used for daily and weekly reporting. Compute-intensive tasks are processed in batches, often during off-peak times.

Banking Example

A banking system receives financial transactions throughout the day but collects and processes all transactions at the end of each day to provide reports to stakeholders and third-party systems.

Benefits

Efficiency: Makes repetitive tasks more efficient to run
Cost-Effectiveness: Uses processing power more cost-effectively
Resource Optimization: Processes compute-intensive tasks during off-peak times when resources are more available

When to Consider Batch Processing

Primary Use Cases

Reporting Purposes: Generate scheduled reports and analytics
Large Datasets: Handle datasets too large for real-time processing
Aggregation Focus: When use case focuses on aggregating or transforming data rather than real-time analysis

Medical Research Example

Analysis of large amounts of data is common in research with no need for real-time analysis. Applications include:

Computational chemistry
Clinical modeling
Molecular dynamics
Genomic sequencing testing and analysis

AWS Glue Overview

Core Function

AWS Glue is a data integration service that automates and performs ETL tasks as part of ingesting data into batch or streaming pipelines.

Multi-Purpose Capabilities

Data Integration: Read and write data from multiple systems and databases
Service Integration: Works with Amazon S3, DynamoDB, Amazon Redshift, Amazon RDS, Amazon DocumentDB
Pipeline Support: Simplifies both batch and streaming ingestion

AWS Glue Components

AWS Glue Data Catalog

Function: Data discovery and organization
Storage: Stores metadata about datasets including schema information
Scope: Table definitions and physical locations, business-relevant attributes, change tracking
Note: Does not store actual datasets, only metadata

ETL Jobs

Function: Data transformation jobs
Interface: AWS Glue Studio provides visual interface for authoring, running, and monitoring
No-Code Option: Create and edit jobs without coding
Streaming Support: AWS Glue Streaming ETL speeds up stream data availability

AWS Glue DataBrew

Function: No-code service to prepare and clean data
Interface: Visual, interactive, point-and-click interface
Transformations: 250+ prebuilt transformations for data preparation
Capabilities: Filter anomalies, convert to standard formats, correct invalid values

AWS Glue Data Quality

Function: Data quality assessment
Features: Compute statistics, recommend quality rules, monitor data, send alerts
Benefit: Identify missing, stale, or bad data before business impact

AWS Glue Workflow

Data Integration Process

Crawlers: Run on data stores, derive schema, populate Data Catalog
Schema Creation: Structured/semistructured data gets schema for efficient access
Data Quality: Automatically compute statistics and provide quality alerts
Script Generation: Use visual AWS Glue Studio interface to author jobs
Notebook Support: Interactive ETL job authoring with Jupyter-based notebooks
DataBrew: Visual data preparation without coding

Gaming Use Case Example

Gaming company produces gigabytes of daily user play data:

Data Collection: Game server pushes data to S3 bucket every 6 hours
Schema Discovery: AWS Glue crawlers run on player logs, provide data catalog
ETL Processing: AWS Glue job aggregates log data per player into 1-minute intervals every 6 hours
Data Access: Transformed data available in aggregated S3 bucket for multiple analytics applications

Data Transformation Capabilities

Format Conversion

Common transformation: Convert .csv to Apache Parquet format

CSV Limitations: Most common tabular format but inefficient for large amounts of data (>15 GB)
Parquet Benefits:
- Stores data in columnar fashion
- Optimized for storage
- Suitable for parallel processing
- Speeds up analytics workloads and saves storage costs over time

Advanced Transformations

PII Detection and Removal: AWS Glue and DataBrew support detecting and removing personally identifiable information

Process: Scan data → Detect PII entities (passport numbers, SSN) → Remediate data
Options: Mask data or store detection results for further inspection

Best Practice Guidelines

Use AWS Glue When

Analytics use case doesn’t require real-time aggregation or transformation
Need schema identification and data cataloging capabilities
Require data preparation and cleaning functionality
Want ETL job authoring with visual interface
Need data quality assessment and monitoring

Batch processing makes high-volume, repetitive tasks more efficient. AWS Glue provides comprehensive functionality including schema identification, data cataloging, preparation, cleaning, ETL authoring, and quality assessment for batch data processing workflows.