logo_smallAxellero.io

Data Transformation

Transform and restructure datasets with operations like filtering, aggregation, pivoting, merging, and data cleaning for analysis preparation.

Data Transformation

Transform and restructure datasets within the sandbox environment with comprehensive operations including filtering, aggregation, pivoting, merging, data cleaning, and format conversion for analysis preparation.

🔄 Data Processing Pipeline

Data transformation provides powerful ETL (Extract, Transform, Load) capabilities with data cleaning, restructuring, aggregation, and format conversion to prepare datasets for analysis and reporting.

Overview

The Data Transformation tool enables comprehensive data processing and restructuring within the sandbox environment, supporting filtering, aggregation, pivoting, merging, data cleaning, and format conversion operations for efficient data preparation workflows.

Key Features

  • Data Filtering and Selection - Filter rows and select columns based on criteria
  • Aggregation Operations - Group by operations with statistical aggregations
  • Data Pivoting - Reshape data between wide and long formats
  • Merging and Joining - Combine multiple datasets with various join types
  • Data Cleaning - Handle missing values, duplicates, and data quality issues

Methods

dataTransformation

Transform and restructure datasets comprehensively.

ParameterTypeRequiredDescription
inputPathStringYesPath to input dataset file
outputPathStringYesPath for transformed output file
operationsArrayYesList of transformation operations to perform
outputFormatStringNoOutput format: 'csv', 'xlsx', 'json' (default: 'csv')
preserveOriginalBooleanNoKeep original file unchanged (default: true)
validationRulesArrayNoData validation rules to apply
chunkSizeNumberNoProcess large files in chunks (default: 10000)
{
  "inputPath": "/sandbox/data/raw_sales.csv",
  "outputPath": "/sandbox/data/processed_sales.csv",
  "operations": [
    {
      "type": "filter",
      "column": "date",
      "operator": ">=",
      "value": "2023-01-01"
    },
    {
      "type": "aggregate",
      "groupBy": ["region", "product"],
      "aggregations": {
        "sales": "sum",
        "quantity": "mean"
      }
    }
  ],
  "outputFormat": "csv"
}

Output:

  • success (Boolean) - Transformation operation success status
  • inputPath (String) - Path to original dataset
  • outputPath (String) - Path to transformed dataset
  • operationsApplied (Array) - List of successfully applied operations
  • recordsProcessed (Number) - Number of input records processed
  • recordsOutput (Number) - Number of output records generated
  • transformationTime (Number) - Processing duration in milliseconds
  • dataQualityReport (Object) - Data quality assessment results
  • validationResults (Object) - Validation rule check results

Data Filtering and Selection

Row Filtering Operations

Data Aggregation and Grouping

Group By and Statistical Aggregations

Data Merging and Joining

Dataset Combination Operations

Data Cleaning and Quality

Missing Data Handling

Error Handling

Common Transformation Issues

Error TypeCauseResolution
Schema MismatchColumn names/types don't matchVerify column names and data types
Memory ErrorDataset too large for processingUse chunk processing or optimize operations
Join Key MissingJoin columns not foundCheck column names and data availability
Data Type ErrorInvalid data type conversionsValidate data types before transformation
Operation ConflictIncompatible operation sequenceReorder operations logically

Robust Data Transformation

def robust_data_transformation(input_path, operations_list, output_path, fallback_strategies=None):
    """Perform data transformation with comprehensive error handling."""
    
    if not fallback_strategies:
        fallback_strategies = [
            {"chunk_size": 10000, "preserve_original": True},
            {"chunk_size": 5000, "preserve_original": True, "simple_operations": True},
            {"chunk_size": 1000, "preserve_original": True, "minimal_operations": True}
        ]
    
    for i, strategy in enumerate(fallback_strategies):
        try:
            print(f"🔄 Attempting transformation strategy {i+1}")
            
            # Adjust operations based on strategy
            adjusted_operations = operations_list
            if strategy.get("simple_operations"):
                # Simplify complex operations
                adjusted_operations = [op for op in operations_list if op.get("type") in ["filter", "select", "aggregate"]]
            elif strategy.get("minimal_operations"):
                # Use only basic operations
                adjusted_operations = [op for op in operations_list if op.get("type") in ["filter", "select"]]
            
            transformation_params = {
                "inputPath": input_path,
                "outputPath": output_path,
                "operations": adjusted_operations,
                "chunkSize": strategy.get("chunk_size", 10000),
                "preserveOriginal": strategy.get("preserve_original", True)
            }
            
            result = dataTransformation(transformation_params)
            
            if result['success']:
                print(f"✅ Transformation successful with strategy {i+1}")
                return {
                    "success": True,
                    "strategy_used": i+1,
                    "strategy_details": strategy,
                    "results": result
                }
            else:
                print(f"⚠️ Strategy {i+1} failed: {result.get('error')}")
        
        except Exception as e:
            print(f"💥 Strategy {i+1} exception: {str(e)}")
    
    return {
        "success": False,
        "error": "All transformation strategies failed",
        "strategies_attempted": len(fallback_strategies)
    }

# Usage with error handling
transformation_operations = [
    {"type": "filter", "column": "date", "operator": ">=", "value": "2023-01-01"},
    {"type": "aggregate", "groupBy": ["region"], "aggregations": {"sales": "sum"}},
    {"type": "calculate", "newColumn": "sales_per_customer", "expression": "sales / customer_count"}
]

robust_result = robust_data_transformation(
    "/sandbox/data/large_dataset.csv",
    transformation_operations,
    "/sandbox/data/transformed_output.csv"
)

Next Steps: Use with Statistical Analysis for analysis of transformed data, or XLSX Analysis for Excel-specific transformations.