data scienceintermediate620 tokens

Data Cleaning with Pandas

Clean messy datasets systematically with pandas

data-cleaningpandaspreprocessingdata-qualitypython

Prompt Template

You are a data science expert specializing in data preprocessing. Help me clean and prepare this dataset for analysis.

**Dataset Context:**
{dataset_description}

**Raw Data Sample:**
```
{data_sample}
```

**Data Quality Issues:**
{known_issues}

**Analysis Goals:**
{analysis_objectives}

Create a comprehensive data cleaning pipeline:

**1. Initial Data Assessment:**
- Dataset shape and structure
- Data types for each column
- Missing value analysis (count and patterns)
- Duplicate row detection
- Basic statistics (min, max, mean, std)

**2. Data Quality Report:**
For each column, identify:
- **Missing Values:** % missing, missing patterns (MCAR, MAR, MNAR)
- **Outliers:** Using IQR, Z-score, or domain knowledge
- **Invalid Values:** Out-of-range, incorrect formats, typos
- **Inconsistencies:** Mixed formats, encoding issues
- **Duplicates:** Exact and fuzzy duplicates

**3. Cleaning Strategy:**
For each issue, recommend:
- **Missing Values:**
  - Drop if: < 5% missing and MCAR
  - Impute if: > 5% missing, use mean/median/mode/forward-fill/KNN
  - Flag if: informative missingness (create indicator column)
- **Outliers:**
  - Keep if: legitimate extreme values
  - Cap if: true outliers (winsorization)
  - Remove if: data errors
- **Invalid Values:**
  - Correction strategy
  - Standardization approach
- **Duplicates:**
  - Exact: drop or keep first/last
  - Fuzzy: merge strategy

**4. Python Code (Pandas):**
```python
import pandas as pd
import numpy as np
from scipy import stats

# Load data
df = pd.read_csv('{file_path}')

# 1. Initial Assessment
print("Dataset Shape:", df.shape)
print("\nData Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())
print("\nDuplicates:", df.duplicated().sum())

# 2. Handle Missing Values
# [Column-specific strategies with explanations]

# 3. Handle Outliers
# [Column-specific strategies]

# 4. Data Type Conversions
# [Convert to appropriate types]

# 5. Feature Engineering (if needed)
# [Create derived features]

# 6. Final Validation
# [Verify cleaning results]

# Save cleaned data
df.to_csv('cleaned_data.csv', index=False)
```

**5. Before/After Comparison:**
| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| Total Rows | | | |
| Missing Values | | | |
| Duplicate Rows | | | |
| Outliers | | | |
| Data Quality Score | | | |

**6. Data Cleaning Report:**
- Issues identified: [count]
- Issues resolved: [count]
- Data retained: [X%]
- Potential biases introduced: [list]
- Recommendations for next steps

**7. Reproducibility:**
- Random seed used (if applicable)
- Assumptions made
- Parameters chosen and why
- How to validate cleaning on new data

Output: Clean, documented pandas code + data quality report.

Variables to Replace

{dataset_description}
{data_sample}
{known_issues}
{analysis_objectives}
{file_path}

Pro Tips

Always document your cleaning decisions. Different domains have different outlier thresholds.

Need More Prompts?

Explore our full library of 60+ professional AI prompt templates

Browse All Prompts →