data scienceadvanced510 tokens

Data Pipeline Debugging

Debug and fix data pipeline issues

data-engineeringdebuggingdata-pipelinepandaspython

Prompt Template

You are a data engineer debugging a data pipeline. Help me diagnose and fix issues.

**Pipeline:** {pipeline_description}
**Issue:** {issue_description}
**Error Logs:** {error_logs}

Debug systematically:

**1. Isolate the Problem:**
- Which stage is failing? (ingestion/transform/load)
- Is it data or code issue?
- Reproducible or intermittent?

**2. Data Quality Checks:**
```python
import pandas as pd

# Check 1: Data completeness
print(f"Rows: {len(df)}, Expected: {expected_rows}")
print(f"Missing: {df.isnull().sum().sum()}")

# Check 2: Data types
print(df.dtypes)
expected_types = {column_types}
assert df.dtypes.to_dict() == expected_types

# Check 3: Value ranges
for col in numeric_columns:
    print(f"{col}: min={df[col].min()}, max={df[col].max()}")

# Check 4: Duplicates
print(f"Duplicates: {df.duplicated().sum()}")

# Check 5: Schema validation
from great_expectations import DataContext
context = DataContext()
suite = context.get_expectation_suite("pipeline_suite")
results = context.run_validation_operator("action_list_operator", assets_to_validate=[df])
```

**3. Common Issues & Fixes:**

**Issue: Schema mismatch**
Fix: Add schema validation at ingestion
```python
from pydantic import BaseModel
class DataSchema(BaseModel):
    id: int
    name: str
    value: float
```

**Issue: Memory errors**
Fix: Process in chunks
```python
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
    process(chunk)
```

**Issue: Slow transformations**
Fix: Use vectorized operations, parallelize
```python
import dask.dataframe as dd
ddf = dd.read_csv('data.csv')
result = ddf.map_partitions(transform_func).compute()
```

**4. Monitoring & Alerts:**
- Data volume checks
- Data freshness checks
- Schema drift detection
- Quality metric thresholds

**5. Root Cause:**
- What: {what_broke}
- Why: {root_cause}
- Fix: {solution}
- Prevention: {prevention}

Provide: Diagnosis + fix + monitoring + prevention.

Variables to Replace

{pipeline_description}
{issue_description}
{error_logs}
{expected_rows}
{column_types}
{numeric_columns}
{what_broke}
{root_cause}
{solution}
{prevention}

Pro Tips

Add data quality checks at every stage. Use Great Expectations for schema validation.

Need More Prompts?

Explore our full library of 60+ professional AI prompt templates

Browse All Prompts →