data scienceintermediate620 tokens
Data Cleaning with Pandas
Clean messy datasets systematically with pandas
data-cleaningpandaspreprocessingdata-qualitypython
Prompt Template
You are a data science expert specializing in data preprocessing. Help me clean and prepare this dataset for analysis.
**Dataset Context:**
{dataset_description}
**Raw Data Sample:**
```
{data_sample}
```
**Data Quality Issues:**
{known_issues}
**Analysis Goals:**
{analysis_objectives}
Create a comprehensive data cleaning pipeline:
**1. Initial Data Assessment:**
- Dataset shape and structure
- Data types for each column
- Missing value analysis (count and patterns)
- Duplicate row detection
- Basic statistics (min, max, mean, std)
**2. Data Quality Report:**
For each column, identify:
- **Missing Values:** % missing, missing patterns (MCAR, MAR, MNAR)
- **Outliers:** Using IQR, Z-score, or domain knowledge
- **Invalid Values:** Out-of-range, incorrect formats, typos
- **Inconsistencies:** Mixed formats, encoding issues
- **Duplicates:** Exact and fuzzy duplicates
**3. Cleaning Strategy:**
For each issue, recommend:
- **Missing Values:**
- Drop if: < 5% missing and MCAR
- Impute if: > 5% missing, use mean/median/mode/forward-fill/KNN
- Flag if: informative missingness (create indicator column)
- **Outliers:**
- Keep if: legitimate extreme values
- Cap if: true outliers (winsorization)
- Remove if: data errors
- **Invalid Values:**
- Correction strategy
- Standardization approach
- **Duplicates:**
- Exact: drop or keep first/last
- Fuzzy: merge strategy
**4. Python Code (Pandas):**
```python
import pandas as pd
import numpy as np
from scipy import stats
# Load data
df = pd.read_csv('{file_path}')
# 1. Initial Assessment
print("Dataset Shape:", df.shape)
print("\nData Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())
print("\nDuplicates:", df.duplicated().sum())
# 2. Handle Missing Values
# [Column-specific strategies with explanations]
# 3. Handle Outliers
# [Column-specific strategies]
# 4. Data Type Conversions
# [Convert to appropriate types]
# 5. Feature Engineering (if needed)
# [Create derived features]
# 6. Final Validation
# [Verify cleaning results]
# Save cleaned data
df.to_csv('cleaned_data.csv', index=False)
```
**5. Before/After Comparison:**
| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| Total Rows | | | |
| Missing Values | | | |
| Duplicate Rows | | | |
| Outliers | | | |
| Data Quality Score | | | |
**6. Data Cleaning Report:**
- Issues identified: [count]
- Issues resolved: [count]
- Data retained: [X%]
- Potential biases introduced: [list]
- Recommendations for next steps
**7. Reproducibility:**
- Random seed used (if applicable)
- Assumptions made
- Parameters chosen and why
- How to validate cleaning on new data
Output: Clean, documented pandas code + data quality report.Variables to Replace
{dataset_description}{data_sample}{known_issues}{analysis_objectives}{file_path}Pro Tips
Always document your cleaning decisions. Different domains have different outlier thresholds.
Related Prompts
Need More Prompts?
Explore our full library of 60+ professional AI prompt templates
Browse All Prompts →