Data Transformation
Transform and restructure datasets with operations like filtering, aggregation, pivoting, merging, and data cleaning for analysis preparation.
Data Transformation
Transform and restructure datasets within the sandbox environment with comprehensive operations including filtering, aggregation, pivoting, merging, data cleaning, and format conversion for analysis preparation.
🔄 Data Processing Pipeline
Data transformation provides powerful ETL (Extract, Transform, Load) capabilities with data cleaning, restructuring, aggregation, and format conversion to prepare datasets for analysis and reporting.
Overview
The Data Transformation tool enables comprehensive data processing and restructuring within the sandbox environment, supporting filtering, aggregation, pivoting, merging, data cleaning, and format conversion operations for efficient data preparation workflows.
Key Features
- Data Filtering and Selection - Filter rows and select columns based on criteria
- Aggregation Operations - Group by operations with statistical aggregations
- Data Pivoting - Reshape data between wide and long formats
- Merging and Joining - Combine multiple datasets with various join types
- Data Cleaning - Handle missing values, duplicates, and data quality issues
Methods
dataTransformation
Transform and restructure datasets comprehensively.
| Parameter | Type | Required | Description |
|---|---|---|---|
| inputPath | String | Yes | Path to input dataset file |
| outputPath | String | Yes | Path for transformed output file |
| operations | Array | Yes | List of transformation operations to perform |
| outputFormat | String | No | Output format: 'csv', 'xlsx', 'json' (default: 'csv') |
| preserveOriginal | Boolean | No | Keep original file unchanged (default: true) |
| validationRules | Array | No | Data validation rules to apply |
| chunkSize | Number | No | Process large files in chunks (default: 10000) |
{
"inputPath": "/sandbox/data/raw_sales.csv",
"outputPath": "/sandbox/data/processed_sales.csv",
"operations": [
{
"type": "filter",
"column": "date",
"operator": ">=",
"value": "2023-01-01"
},
{
"type": "aggregate",
"groupBy": ["region", "product"],
"aggregations": {
"sales": "sum",
"quantity": "mean"
}
}
],
"outputFormat": "csv"
}Output:
success(Boolean) - Transformation operation success statusinputPath(String) - Path to original datasetoutputPath(String) - Path to transformed datasetoperationsApplied(Array) - List of successfully applied operationsrecordsProcessed(Number) - Number of input records processedrecordsOutput(Number) - Number of output records generatedtransformationTime(Number) - Processing duration in millisecondsdataQualityReport(Object) - Data quality assessment resultsvalidationResults(Object) - Validation rule check results
Data Filtering and Selection
Row Filtering Operations
Data Aggregation and Grouping
Group By and Statistical Aggregations
Data Merging and Joining
Dataset Combination Operations
Data Cleaning and Quality
Missing Data Handling
Error Handling
Common Transformation Issues
| Error Type | Cause | Resolution |
|---|---|---|
| Schema Mismatch | Column names/types don't match | Verify column names and data types |
| Memory Error | Dataset too large for processing | Use chunk processing or optimize operations |
| Join Key Missing | Join columns not found | Check column names and data availability |
| Data Type Error | Invalid data type conversions | Validate data types before transformation |
| Operation Conflict | Incompatible operation sequence | Reorder operations logically |
Robust Data Transformation
def robust_data_transformation(input_path, operations_list, output_path, fallback_strategies=None):
"""Perform data transformation with comprehensive error handling."""
if not fallback_strategies:
fallback_strategies = [
{"chunk_size": 10000, "preserve_original": True},
{"chunk_size": 5000, "preserve_original": True, "simple_operations": True},
{"chunk_size": 1000, "preserve_original": True, "minimal_operations": True}
]
for i, strategy in enumerate(fallback_strategies):
try:
print(f"🔄 Attempting transformation strategy {i+1}")
# Adjust operations based on strategy
adjusted_operations = operations_list
if strategy.get("simple_operations"):
# Simplify complex operations
adjusted_operations = [op for op in operations_list if op.get("type") in ["filter", "select", "aggregate"]]
elif strategy.get("minimal_operations"):
# Use only basic operations
adjusted_operations = [op for op in operations_list if op.get("type") in ["filter", "select"]]
transformation_params = {
"inputPath": input_path,
"outputPath": output_path,
"operations": adjusted_operations,
"chunkSize": strategy.get("chunk_size", 10000),
"preserveOriginal": strategy.get("preserve_original", True)
}
result = dataTransformation(transformation_params)
if result['success']:
print(f"✅ Transformation successful with strategy {i+1}")
return {
"success": True,
"strategy_used": i+1,
"strategy_details": strategy,
"results": result
}
else:
print(f"⚠️ Strategy {i+1} failed: {result.get('error')}")
except Exception as e:
print(f"💥 Strategy {i+1} exception: {str(e)}")
return {
"success": False,
"error": "All transformation strategies failed",
"strategies_attempted": len(fallback_strategies)
}
# Usage with error handling
transformation_operations = [
{"type": "filter", "column": "date", "operator": ">=", "value": "2023-01-01"},
{"type": "aggregate", "groupBy": ["region"], "aggregations": {"sales": "sum"}},
{"type": "calculate", "newColumn": "sales_per_customer", "expression": "sales / customer_count"}
]
robust_result = robust_data_transformation(
"/sandbox/data/large_dataset.csv",
transformation_operations,
"/sandbox/data/transformed_output.csv"
)Related Tools
Statistical Analysis
Perform statistical analysis on transformed datasets
XLSX Analysis
Analyze Excel files before and after transformation
Read File
Read datasets for transformation processing
Next Steps: Use with Statistical Analysis for analysis of transformed data, or XLSX Analysis for Excel-specific transformations.