Skip to content
Pablo Rodriguez

Data Characteristics

Data characteristics drive infrastructure decisions and include five key dimensions:

  • Definition: How processed data can provide insight into a business problem
  • Key Questions: What insights can be gained from the data?
  • Focus: Ensuring maximum value from collected data and business value in outputs
  • Value depends on veracity - without good data, you could make bad business decisions
  • Definition: How to protect and strengthen the integrity of data
  • Key Questions: How accurate, precise, and trusted is the data?
  • Importance: Foundation of analysis for decision-making
  • Consider data sources and how integrity can be protected through the pipeline
  • Definition: How much data you need to process
  • Key Questions:
    • How long do you need to keep the data?
    • What are the access patterns?
  • Impact: Affects infrastructure for getting data into pipeline, processing, and storage
  • Storage choices depend on access and processing frequency
  • Definition: How quickly data enters and moves through your pipeline
  • Key Questions:
    • How frequently is data generated?
    • How quickly does the data need to be acted on?
  • Combined Impact: Volume and velocity together drive expected throughput and scaling requirements
  • Definition: How many data sources and data types you work with
  • Key Questions:
    • What is the format and type of the data?
    • What sources does the data come from?
  • Impact: Different data types require specific processing and analysis approaches
  • Combining datasets can enrich analysis but complicate processing

The five characteristics work together in a circular relationship - they each impact decision-making for infrastructure design:

  • End User Focus: Have the end user in mind as you design infrastructure
  • Data Duration: Understand retention requirements and access frequency to balance cost and benefits
  • Throughput Requirements: Volume and velocity together determine scaling needs
  • Processing Complexity: High volume with high velocity requires different architecture than high volume with low velocity

Three-Pronged Strategy for Data Infrastructure

Section titled “Three-Pronged Strategy for Data Infrastructure”
  • Move to cloud-based infrastructure and purpose-built services
  • Reduce undifferentiated lifting
  • Increase agility and reduce operational effort
  • Create a single source of truth for data
  • Make data available across the organization
  • Combine best elements of data lakes and purpose-built data stores
  • Apply artificial intelligence and machine learning (AI/ML)
  • Find new insights in data
  • Reimagine old processes and create new experiences

A modern data architecture provides a centralized location to access data and run analytics and AI/ML applications by integrating:

  • Data Lake: Centralized repository for structured and unstructured data
  • Data Warehouse: Optimized storage for structured analytics
  • Purpose-Built Data Stores: Specialized databases for specific use cases
  • Unified Governance: Consistent access, permissions, and authorization
  • Seamless Data Movement: Efficient data transfer between stores

The goal is to store data centrally while enabling unified governance and seamless movement, removing restrictions from separate data silos. This approach provides access to all organizational data for better decision-making with agility.

Data characteristics including value, veracity, volume, velocity, and variety must be considered together when making infrastructure design decisions. The modern data architecture integrates multiple storage types while maintaining centralized access and governance.