Skip to content
Pablo Rodriguez

Processing Batch Data

Batch processing involves periodically completing high-volume, repetitive data jobs, typically used for daily and weekly reporting. Compute-intensive tasks are processed in batches, often during off-peak times.

A banking system receives financial transactions throughout the day but collects and processes all transactions at the end of each day to provide reports to stakeholders and third-party systems.

  • Efficiency: Makes repetitive tasks more efficient to run
  • Cost-Effectiveness: Uses processing power more cost-effectively
  • Resource Optimization: Processes compute-intensive tasks during off-peak times when resources are more available
  • Reporting Purposes: Generate scheduled reports and analytics
  • Large Datasets: Handle datasets too large for real-time processing
  • Aggregation Focus: When use case focuses on aggregating or transforming data rather than real-time analysis

Analysis of large amounts of data is common in research with no need for real-time analysis. Applications include:

  • Computational chemistry
  • Clinical modeling
  • Molecular dynamics
  • Genomic sequencing testing and analysis

AWS Glue is a data integration service that automates and performs ETL tasks as part of ingesting data into batch or streaming pipelines.

  • Data Integration: Read and write data from multiple systems and databases
  • Service Integration: Works with Amazon S3, DynamoDB, Amazon Redshift, Amazon RDS, Amazon DocumentDB
  • Pipeline Support: Simplifies both batch and streaming ingestion
  • Function: Data discovery and organization
  • Storage: Stores metadata about datasets including schema information
  • Scope: Table definitions and physical locations, business-relevant attributes, change tracking
  • Note: Does not store actual datasets, only metadata
  • Function: Data transformation jobs
  • Interface: AWS Glue Studio provides visual interface for authoring, running, and monitoring
  • No-Code Option: Create and edit jobs without coding
  • Streaming Support: AWS Glue Streaming ETL speeds up stream data availability
  • Function: No-code service to prepare and clean data
  • Interface: Visual, interactive, point-and-click interface
  • Transformations: 250+ prebuilt transformations for data preparation
  • Capabilities: Filter anomalies, convert to standard formats, correct invalid values
  • Function: Data quality assessment
  • Features: Compute statistics, recommend quality rules, monitor data, send alerts
  • Benefit: Identify missing, stale, or bad data before business impact
  1. Crawlers: Run on data stores, derive schema, populate Data Catalog
  2. Schema Creation: Structured/semistructured data gets schema for efficient access
  3. Data Quality: Automatically compute statistics and provide quality alerts
  4. Script Generation: Use visual AWS Glue Studio interface to author jobs
  5. Notebook Support: Interactive ETL job authoring with Jupyter-based notebooks
  6. DataBrew: Visual data preparation without coding

Gaming company produces gigabytes of daily user play data:

  1. Data Collection: Game server pushes data to S3 bucket every 6 hours
  2. Schema Discovery: AWS Glue crawlers run on player logs, provide data catalog
  3. ETL Processing: AWS Glue job aggregates log data per player into 1-minute intervals every 6 hours
  4. Data Access: Transformed data available in aggregated S3 bucket for multiple analytics applications

Common transformation: Convert .csv to Apache Parquet format

  • CSV Limitations: Most common tabular format but inefficient for large amounts of data (>15 GB)
  • Parquet Benefits:
    • Stores data in columnar fashion
    • Optimized for storage
    • Suitable for parallel processing
    • Speeds up analytics workloads and saves storage costs over time

PII Detection and Removal: AWS Glue and DataBrew support detecting and removing personally identifiable information

  • Process: Scan data → Detect PII entities (passport numbers, SSN) → Remediate data
  • Options: Mask data or store detection results for further inspection
  • Analytics use case doesn’t require real-time aggregation or transformation
  • Need schema identification and data cataloging capabilities
  • Require data preparation and cleaning functionality
  • Want ETL job authoring with visual interface
  • Need data quality assessment and monitoring

Batch processing makes high-volume, repetitive tasks more efficient. AWS Glue provides comprehensive functionality including schema identification, data cataloging, preparation, cleaning, ETL authoring, and quality assessment for batch data processing workflows.