data scienceadvanced510 tokens
Data Pipeline Debugging
Debug and fix data pipeline issues
data-engineeringdebuggingdata-pipelinepandaspython
Prompt Template
You are a data engineer debugging a data pipeline. Help me diagnose and fix issues.
**Pipeline:** {pipeline_description}
**Issue:** {issue_description}
**Error Logs:** {error_logs}
Debug systematically:
**1. Isolate the Problem:**
- Which stage is failing? (ingestion/transform/load)
- Is it data or code issue?
- Reproducible or intermittent?
**2. Data Quality Checks:**
```python
import pandas as pd
# Check 1: Data completeness
print(f"Rows: {len(df)}, Expected: {expected_rows}")
print(f"Missing: {df.isnull().sum().sum()}")
# Check 2: Data types
print(df.dtypes)
expected_types = {column_types}
assert df.dtypes.to_dict() == expected_types
# Check 3: Value ranges
for col in numeric_columns:
print(f"{col}: min={df[col].min()}, max={df[col].max()}")
# Check 4: Duplicates
print(f"Duplicates: {df.duplicated().sum()}")
# Check 5: Schema validation
from great_expectations import DataContext
context = DataContext()
suite = context.get_expectation_suite("pipeline_suite")
results = context.run_validation_operator("action_list_operator", assets_to_validate=[df])
```
**3. Common Issues & Fixes:**
**Issue: Schema mismatch**
Fix: Add schema validation at ingestion
```python
from pydantic import BaseModel
class DataSchema(BaseModel):
id: int
name: str
value: float
```
**Issue: Memory errors**
Fix: Process in chunks
```python
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
process(chunk)
```
**Issue: Slow transformations**
Fix: Use vectorized operations, parallelize
```python
import dask.dataframe as dd
ddf = dd.read_csv('data.csv')
result = ddf.map_partitions(transform_func).compute()
```
**4. Monitoring & Alerts:**
- Data volume checks
- Data freshness checks
- Schema drift detection
- Quality metric thresholds
**5. Root Cause:**
- What: {what_broke}
- Why: {root_cause}
- Fix: {solution}
- Prevention: {prevention}
Provide: Diagnosis + fix + monitoring + prevention.Variables to Replace
{pipeline_description}{issue_description}{error_logs}{expected_rows}{column_types}{numeric_columns}{what_broke}{root_cause}{solution}{prevention}Pro Tips
Add data quality checks at every stage. Use Great Expectations for schema validation.
Related Prompts
Need More Prompts?
Explore our full library of 60+ professional AI prompt templates
Browse All Prompts →