Data Characteristics
Data Characteristics
Section titled “Data Characteristics”Five Vs of Data
Section titled “Five Vs of Data”Data characteristics drive infrastructure decisions and include five key dimensions:
- Definition: How processed data can provide insight into a business problem
- Key Questions: What insights can be gained from the data?
- Focus: Ensuring maximum value from collected data and business value in outputs
- Value depends on veracity - without good data, you could make bad business decisions
Veracity
Section titled “Veracity”- Definition: How to protect and strengthen the integrity of data
- Key Questions: How accurate, precise, and trusted is the data?
- Importance: Foundation of analysis for decision-making
- Consider data sources and how integrity can be protected through the pipeline
Volume
Section titled “Volume”- Definition: How much data you need to process
- Key Questions:
- How long do you need to keep the data?
- What are the access patterns?
- Impact: Affects infrastructure for getting data into pipeline, processing, and storage
- Storage choices depend on access and processing frequency
Velocity
Section titled “Velocity”- Definition: How quickly data enters and moves through your pipeline
- Key Questions:
- How frequently is data generated?
- How quickly does the data need to be acted on?
- Combined Impact: Volume and velocity together drive expected throughput and scaling requirements
Variety
Section titled “Variety”- Definition: How many data sources and data types you work with
- Key Questions:
- What is the format and type of the data?
- What sources does the data come from?
- Impact: Different data types require specific processing and analysis approaches
- Combining datasets can enrich analysis but complicate processing
Decision-Making Considerations
Section titled “Decision-Making Considerations”The five characteristics work together in a circular relationship - they each impact decision-making for infrastructure design:
- End User Focus: Have the end user in mind as you design infrastructure
- Data Duration: Understand retention requirements and access frequency to balance cost and benefits
- Throughput Requirements: Volume and velocity together determine scaling needs
- Processing Complexity: High volume with high velocity requires different architecture than high volume with low velocity
Three-Pronged Strategy for Data Infrastructure
Section titled “Three-Pronged Strategy for Data Infrastructure”Modernize
Section titled “Modernize”- Move to cloud-based infrastructure and purpose-built services
- Reduce undifferentiated lifting
- Increase agility and reduce operational effort
- Create a single source of truth for data
- Make data available across the organization
- Combine best elements of data lakes and purpose-built data stores
Innovate
Section titled “Innovate”- Apply artificial intelligence and machine learning (AI/ML)
- Find new insights in data
- Reimagine old processes and create new experiences
Modern Data Architecture Solution
Section titled “Modern Data Architecture Solution”A modern data architecture provides a centralized location to access data and run analytics and AI/ML applications by integrating:
- Data Lake: Centralized repository for structured and unstructured data
- Data Warehouse: Optimized storage for structured analytics
- Purpose-Built Data Stores: Specialized databases for specific use cases
- Unified Governance: Consistent access, permissions, and authorization
- Seamless Data Movement: Efficient data transfer between stores
The goal is to store data centrally while enabling unified governance and seamless movement, removing restrictions from separate data silos. This approach provides access to all organizational data for better decision-making with agility.
Data characteristics including value, veracity, volume, velocity, and variety must be considered together when making infrastructure design decisions. The modern data architecture integrates multiple storage types while maintaining centralized access and governance.