Processing Batch Data
Processing Batch Data
Section titled “Processing Batch Data”Batch Processing Concepts
Section titled “Batch Processing Concepts”Definition and Characteristics
Section titled “Definition and Characteristics”Batch processing involves periodically completing high-volume, repetitive data jobs, typically used for daily and weekly reporting. Compute-intensive tasks are processed in batches, often during off-peak times.
Banking Example
Section titled “Banking Example”A banking system receives financial transactions throughout the day but collects and processes all transactions at the end of each day to provide reports to stakeholders and third-party systems.
Benefits
Section titled “Benefits”- Efficiency: Makes repetitive tasks more efficient to run
- Cost-Effectiveness: Uses processing power more cost-effectively
- Resource Optimization: Processes compute-intensive tasks during off-peak times when resources are more available
When to Consider Batch Processing
Section titled “When to Consider Batch Processing”Primary Use Cases
Section titled “Primary Use Cases”- Reporting Purposes: Generate scheduled reports and analytics
- Large Datasets: Handle datasets too large for real-time processing
- Aggregation Focus: When use case focuses on aggregating or transforming data rather than real-time analysis
Medical Research Example
Section titled “Medical Research Example”Analysis of large amounts of data is common in research with no need for real-time analysis. Applications include:
- Computational chemistry
- Clinical modeling
- Molecular dynamics
- Genomic sequencing testing and analysis
AWS Glue Overview
Section titled “AWS Glue Overview”Core Function
Section titled “Core Function”AWS Glue is a data integration service that automates and performs ETL tasks as part of ingesting data into batch or streaming pipelines.
Multi-Purpose Capabilities
Section titled “Multi-Purpose Capabilities”- Data Integration: Read and write data from multiple systems and databases
- Service Integration: Works with Amazon S3, DynamoDB, Amazon Redshift, Amazon RDS, Amazon DocumentDB
- Pipeline Support: Simplifies both batch and streaming ingestion
AWS Glue Components
Section titled “AWS Glue Components”AWS Glue Data Catalog
Section titled “AWS Glue Data Catalog”- Function: Data discovery and organization
- Storage: Stores metadata about datasets including schema information
- Scope: Table definitions and physical locations, business-relevant attributes, change tracking
- Note: Does not store actual datasets, only metadata
ETL Jobs
Section titled “ETL Jobs”- Function: Data transformation jobs
- Interface: AWS Glue Studio provides visual interface for authoring, running, and monitoring
- No-Code Option: Create and edit jobs without coding
- Streaming Support: AWS Glue Streaming ETL speeds up stream data availability
AWS Glue DataBrew
Section titled “AWS Glue DataBrew”- Function: No-code service to prepare and clean data
- Interface: Visual, interactive, point-and-click interface
- Transformations: 250+ prebuilt transformations for data preparation
- Capabilities: Filter anomalies, convert to standard formats, correct invalid values
AWS Glue Data Quality
Section titled “AWS Glue Data Quality”- Function: Data quality assessment
- Features: Compute statistics, recommend quality rules, monitor data, send alerts
- Benefit: Identify missing, stale, or bad data before business impact
AWS Glue Workflow
Section titled “AWS Glue Workflow”Data Integration Process
Section titled “Data Integration Process”- Crawlers: Run on data stores, derive schema, populate Data Catalog
- Schema Creation: Structured/semistructured data gets schema for efficient access
- Data Quality: Automatically compute statistics and provide quality alerts
- Script Generation: Use visual AWS Glue Studio interface to author jobs
- Notebook Support: Interactive ETL job authoring with Jupyter-based notebooks
- DataBrew: Visual data preparation without coding
Gaming Use Case Example
Section titled “Gaming Use Case Example”Gaming company produces gigabytes of daily user play data:
- Data Collection: Game server pushes data to S3 bucket every 6 hours
- Schema Discovery: AWS Glue crawlers run on player logs, provide data catalog
- ETL Processing: AWS Glue job aggregates log data per player into 1-minute intervals every 6 hours
- Data Access: Transformed data available in aggregated S3 bucket for multiple analytics applications
Data Transformation Capabilities
Section titled “Data Transformation Capabilities”Format Conversion
Section titled “Format Conversion”Common transformation: Convert .csv to Apache Parquet format
- CSV Limitations: Most common tabular format but inefficient for large amounts of data (>15 GB)
- Parquet Benefits:
- Stores data in columnar fashion
- Optimized for storage
- Suitable for parallel processing
- Speeds up analytics workloads and saves storage costs over time
Advanced Transformations
Section titled “Advanced Transformations”PII Detection and Removal: AWS Glue and DataBrew support detecting and removing personally identifiable information
- Process: Scan data → Detect PII entities (passport numbers, SSN) → Remediate data
- Options: Mask data or store detection results for further inspection
Best Practice Guidelines
Section titled “Best Practice Guidelines”Use AWS Glue When
Section titled “Use AWS Glue When”- Analytics use case doesn’t require real-time aggregation or transformation
- Need schema identification and data cataloging capabilities
- Require data preparation and cleaning functionality
- Want ETL job authoring with visual interface
- Need data quality assessment and monitoring
Batch processing makes high-volume, repetitive tasks more efficient. AWS Glue provides comprehensive functionality including schema identification, data cataloging, preparation, cleaning, ETL authoring, and quality assessment for batch data processing workflows.