data scienceadvanced1560 tokens

Feature Engineering for ML

Create powerful features to improve ML model performance

feature-engineeringmachine-learningpandassklearnmlpreprocessing

Prompt Template

You are a machine learning engineer specializing in feature engineering. Create powerful features for this ML problem.

**Problem Type:** {problem_type}
**Target Variable:** {target}
**Raw Features:** {features}
**Domain:** {domain}

**Current Model Performance:** {baseline_performance}

Engineer features to improve model performance:

**1. Feature Engineering Strategy:**

**Domain-Specific Features:**
Based on domain = "{domain}", create:
- [Domain-specific feature 1]: Rationale
- [Domain-specific feature 2]: Rationale
- [Domain-specific feature 3]: Rationale

**2. Feature Creation Techniques:**

**A. Numerical Features:**
- **Transformations:**
  - Log/sqrt/power: For skewed distributions
  - Standardization/normalization: Scale to [0,1] or mean=0, std=1
  - Binning: Continuous → categorical (equal-width, equal-frequency, custom)

- **Interactions:**
  - Multiplicative: feature1 * feature2 (e.g., price * quantity = total)
  - Ratios: feature1 / feature2 (e.g., profit / revenue = margin)
  - Differences: feature1 - feature2 (e.g., actual - budget = variance)

- **Aggregations:**
  - Rolling statistics: mean, std, min, max over time window
  - Group statistics: mean/std by category
  - Percentiles: Where does this value rank?

**B. Categorical Features:**
- **Encoding:**
  - One-hot: For low cardinality (< 10 categories)
  - Target encoding: Mean target per category (watch for leakage!)
  - Frequency encoding: Count of each category
  - Binary encoding: For high cardinality
  - Embeddings: For very high cardinality (learned from neural net)

- **Feature Combinations:**
  - Concatenation: city + state = "Boston_MA"
  - Interaction: category1 × category2 one-hot encoded

**C. Temporal Features:**
- **From Datetime:**
  - Time components: year, month, day, hour, minute, day_of_week
  - Cyclical encoding: sin/cos transformation for circular features
  - Is_weekend, is_holiday, is_month_end
  - Time since event: days_since_last_purchase
  - Time until event: days_until_deadline

- **Lag Features:**
  - Value at t-1, t-2, ..., t-n
  - Rolling mean/std over past n periods
  - Change from previous: value(t) - value(t-1)
  - Percent change: (value(t) - value(t-1)) / value(t-1)

**D. Text Features:**
- **Basic:**
  - Length: character count, word count
  - Sentiment: Positive/negative/neutral score
  - Keyword presence: Boolean for important terms

- **Advanced:**
  - TF-IDF: Term frequency-inverse document frequency
  - N-grams: Bigrams, trigrams
  - Embeddings: Word2Vec, BERT embeddings

**E. Image Features:**
- **Handcrafted:**
  - Color histograms
  - Edge detection features
  - Texture features

- **Learned:**
  - CNN embeddings (transfer learning)
  - Pre-trained model features (ResNet, VGG)

**3. Python Implementation:**

```python
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import mutual_info_classif, SelectKBest

# Load data
df = pd.read_csv('{file_path}')

# ============================================================================
# NUMERICAL FEATURES
# ============================================================================

# 1. Transformations
df['log_price'] = np.log1p(df['price'])  # log(1+x) to handle zeros
df['sqrt_area'] = np.sqrt(df['area'])

# 2. Interactions
df['price_per_sqft'] = df['price'] / df['area']
df['total_revenue'] = df['price'] * df['quantity']

# 3. Aggregations
# Group statistics
df['price_vs_category_mean'] = df.groupby('category')['price'].transform('mean')
df['price_deviation'] = df['price'] - df['price_vs_category_mean']

# Rolling statistics (for time series)
df['rolling_mean_7d'] = df.groupby('user_id')['sales'].transform(
    lambda x: x.rolling(window=7, min_periods=1).mean()
)

# ============================================================================
# CATEGORICAL FEATURES
# ============================================================================

# 1. One-hot encoding (low cardinality)
df = pd.get_dummies(df, columns=['category'], prefix='cat')

# 2. Target encoding (watch for overfitting!)
category_means = df.groupby('city')['{target}'].mean()
df['city_target_encoded'] = df['city'].map(category_means)

# 3. Frequency encoding
df['city_frequency'] = df.groupby('city')['city'].transform('count')

# ============================================================================
# TEMPORAL FEATURES
# ============================================================================

df['date'] = pd.to_datetime(df['date'])

# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = (df['day_of_week'] >= 5).astype(int)

# Cyclical encoding (for periodic features like month, hour)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

# Lag features
df['sales_lag_1'] = df.groupby('user_id')['sales'].shift(1)
df['sales_lag_7'] = df.groupby('user_id')['sales'].shift(7)

# Time since event
df['days_since_last_purchase'] = (df['date'] - df.groupby('user_id')['date'].shift(1)).dt.days

# ============================================================================
# TEXT FEATURES
# ============================================================================

# Length features
df['description_length'] = df['description'].str.len()
df['description_word_count'] = df['description'].str.split().str.len()

# Keyword presence
df['has_premium_keyword'] = df['description'].str.contains('premium|luxury|exclusive', case=False, na=False).astype(int)

# TF-IDF (for more advanced text analysis)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=50, stop_words='english')
tfidf_features = tfidf.fit_transform(df['description'])
tfidf_df = pd.DataFrame(tfidf_features.toarray(), columns=[f'tfidf_{i}' for i in range(50)])
df = pd.concat([df, tfidf_df], axis=1)

# ============================================================================
# FEATURE SCALING
# ============================================================================

# Standardize numerical features
scaler = StandardScaler()
numerical_features = ['price', 'area', 'log_price', 'sqrt_area']
df[numerical_features] = scaler.fit_transform(df[numerical_features])

# ============================================================================
# FEATURE SELECTION
# ============================================================================

# Select top K features based on mutual information
X = df.drop(columns=['{target}'])
y = df['{target}']

selector = SelectKBest(mutual_info_classif, k=20)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()].tolist()

print("Top features selected:")
for i, feature in enumerate(selected_features):
    print(f"{i+1}. {feature}")

# ============================================================================
# SAVE ENGINEERED FEATURES
# ============================================================================

df.to_csv('features_engineered.csv', index=False)
```

**4. Feature Quality Checks:**

**Check for:**
- **Leakage:** Features that include information from the future
- **High Cardinality:** Categorical features with too many unique values
- **Multicollinearity:** Features highly correlated with each other (VIF > 10)
- **Zero Variance:** Features with no variation
- **High Missing Rate:** Features missing > 50% of values

**Validation:**
```python
# Check correlation
corr_matrix = df.corr()
high_corr = np.where(np.abs(corr_matrix) > 0.95)
high_corr_pairs = [(corr_matrix.index[x], corr_matrix.columns[y])
                   for x, y in zip(*high_corr) if x != y and x < y]
```

**5. Feature Importance Analysis:**

After training a model:
```python
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Get feature importance
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

# Plot top 20
plt.figure(figsize=(10, 8))
plt.barh(feature_importance['feature'][:20], feature_importance['importance'][:20])
plt.xlabel('Importance')
plt.title('Top 20 Feature Importances')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
```

**6. Expected Impact:**

**New Features Created:** [count]
**Features Selected:** [count]
**Expected Performance Gain:** [estimate based on feature importance]

**Next Steps:**
1. Validate features on holdout set
2. Check for leakage using temporal validation
3. Iterate based on model performance
4. Document feature definitions for production

Provide: Complete feature engineering code + validation + documentation.

Variables to Replace

{problem_type}
{target}
{features}
{domain}
{baseline_performance}
{file_path}

Pro Tips

Domain knowledge is key to good feature engineering. Always validate features for leakage before using in production.

Need More Prompts?

Explore our full library of 60+ professional AI prompt templates

Browse All Prompts →