data scienceadvanced1560 tokens
Feature Engineering for ML
Create powerful features to improve ML model performance
feature-engineeringmachine-learningpandassklearnmlpreprocessing
Prompt Template
You are a machine learning engineer specializing in feature engineering. Create powerful features for this ML problem.
**Problem Type:** {problem_type}
**Target Variable:** {target}
**Raw Features:** {features}
**Domain:** {domain}
**Current Model Performance:** {baseline_performance}
Engineer features to improve model performance:
**1. Feature Engineering Strategy:**
**Domain-Specific Features:**
Based on domain = "{domain}", create:
- [Domain-specific feature 1]: Rationale
- [Domain-specific feature 2]: Rationale
- [Domain-specific feature 3]: Rationale
**2. Feature Creation Techniques:**
**A. Numerical Features:**
- **Transformations:**
- Log/sqrt/power: For skewed distributions
- Standardization/normalization: Scale to [0,1] or mean=0, std=1
- Binning: Continuous → categorical (equal-width, equal-frequency, custom)
- **Interactions:**
- Multiplicative: feature1 * feature2 (e.g., price * quantity = total)
- Ratios: feature1 / feature2 (e.g., profit / revenue = margin)
- Differences: feature1 - feature2 (e.g., actual - budget = variance)
- **Aggregations:**
- Rolling statistics: mean, std, min, max over time window
- Group statistics: mean/std by category
- Percentiles: Where does this value rank?
**B. Categorical Features:**
- **Encoding:**
- One-hot: For low cardinality (< 10 categories)
- Target encoding: Mean target per category (watch for leakage!)
- Frequency encoding: Count of each category
- Binary encoding: For high cardinality
- Embeddings: For very high cardinality (learned from neural net)
- **Feature Combinations:**
- Concatenation: city + state = "Boston_MA"
- Interaction: category1 × category2 one-hot encoded
**C. Temporal Features:**
- **From Datetime:**
- Time components: year, month, day, hour, minute, day_of_week
- Cyclical encoding: sin/cos transformation for circular features
- Is_weekend, is_holiday, is_month_end
- Time since event: days_since_last_purchase
- Time until event: days_until_deadline
- **Lag Features:**
- Value at t-1, t-2, ..., t-n
- Rolling mean/std over past n periods
- Change from previous: value(t) - value(t-1)
- Percent change: (value(t) - value(t-1)) / value(t-1)
**D. Text Features:**
- **Basic:**
- Length: character count, word count
- Sentiment: Positive/negative/neutral score
- Keyword presence: Boolean for important terms
- **Advanced:**
- TF-IDF: Term frequency-inverse document frequency
- N-grams: Bigrams, trigrams
- Embeddings: Word2Vec, BERT embeddings
**E. Image Features:**
- **Handcrafted:**
- Color histograms
- Edge detection features
- Texture features
- **Learned:**
- CNN embeddings (transfer learning)
- Pre-trained model features (ResNet, VGG)
**3. Python Implementation:**
```python
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import mutual_info_classif, SelectKBest
# Load data
df = pd.read_csv('{file_path}')
# ============================================================================
# NUMERICAL FEATURES
# ============================================================================
# 1. Transformations
df['log_price'] = np.log1p(df['price']) # log(1+x) to handle zeros
df['sqrt_area'] = np.sqrt(df['area'])
# 2. Interactions
df['price_per_sqft'] = df['price'] / df['area']
df['total_revenue'] = df['price'] * df['quantity']
# 3. Aggregations
# Group statistics
df['price_vs_category_mean'] = df.groupby('category')['price'].transform('mean')
df['price_deviation'] = df['price'] - df['price_vs_category_mean']
# Rolling statistics (for time series)
df['rolling_mean_7d'] = df.groupby('user_id')['sales'].transform(
lambda x: x.rolling(window=7, min_periods=1).mean()
)
# ============================================================================
# CATEGORICAL FEATURES
# ============================================================================
# 1. One-hot encoding (low cardinality)
df = pd.get_dummies(df, columns=['category'], prefix='cat')
# 2. Target encoding (watch for overfitting!)
category_means = df.groupby('city')['{target}'].mean()
df['city_target_encoded'] = df['city'].map(category_means)
# 3. Frequency encoding
df['city_frequency'] = df.groupby('city')['city'].transform('count')
# ============================================================================
# TEMPORAL FEATURES
# ============================================================================
df['date'] = pd.to_datetime(df['date'])
# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = (df['day_of_week'] >= 5).astype(int)
# Cyclical encoding (for periodic features like month, hour)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
# Lag features
df['sales_lag_1'] = df.groupby('user_id')['sales'].shift(1)
df['sales_lag_7'] = df.groupby('user_id')['sales'].shift(7)
# Time since event
df['days_since_last_purchase'] = (df['date'] - df.groupby('user_id')['date'].shift(1)).dt.days
# ============================================================================
# TEXT FEATURES
# ============================================================================
# Length features
df['description_length'] = df['description'].str.len()
df['description_word_count'] = df['description'].str.split().str.len()
# Keyword presence
df['has_premium_keyword'] = df['description'].str.contains('premium|luxury|exclusive', case=False, na=False).astype(int)
# TF-IDF (for more advanced text analysis)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=50, stop_words='english')
tfidf_features = tfidf.fit_transform(df['description'])
tfidf_df = pd.DataFrame(tfidf_features.toarray(), columns=[f'tfidf_{i}' for i in range(50)])
df = pd.concat([df, tfidf_df], axis=1)
# ============================================================================
# FEATURE SCALING
# ============================================================================
# Standardize numerical features
scaler = StandardScaler()
numerical_features = ['price', 'area', 'log_price', 'sqrt_area']
df[numerical_features] = scaler.fit_transform(df[numerical_features])
# ============================================================================
# FEATURE SELECTION
# ============================================================================
# Select top K features based on mutual information
X = df.drop(columns=['{target}'])
y = df['{target}']
selector = SelectKBest(mutual_info_classif, k=20)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()].tolist()
print("Top features selected:")
for i, feature in enumerate(selected_features):
print(f"{i+1}. {feature}")
# ============================================================================
# SAVE ENGINEERED FEATURES
# ============================================================================
df.to_csv('features_engineered.csv', index=False)
```
**4. Feature Quality Checks:**
**Check for:**
- **Leakage:** Features that include information from the future
- **High Cardinality:** Categorical features with too many unique values
- **Multicollinearity:** Features highly correlated with each other (VIF > 10)
- **Zero Variance:** Features with no variation
- **High Missing Rate:** Features missing > 50% of values
**Validation:**
```python
# Check correlation
corr_matrix = df.corr()
high_corr = np.where(np.abs(corr_matrix) > 0.95)
high_corr_pairs = [(corr_matrix.index[x], corr_matrix.columns[y])
for x, y in zip(*high_corr) if x != y and x < y]
```
**5. Feature Importance Analysis:**
After training a model:
```python
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Get feature importance
feature_importance = pd.DataFrame({
'feature': X_train.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
# Plot top 20
plt.figure(figsize=(10, 8))
plt.barh(feature_importance['feature'][:20], feature_importance['importance'][:20])
plt.xlabel('Importance')
plt.title('Top 20 Feature Importances')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
```
**6. Expected Impact:**
**New Features Created:** [count]
**Features Selected:** [count]
**Expected Performance Gain:** [estimate based on feature importance]
**Next Steps:**
1. Validate features on holdout set
2. Check for leakage using temporal validation
3. Iterate based on model performance
4. Document feature definitions for production
Provide: Complete feature engineering code + validation + documentation.Variables to Replace
{problem_type}{target}{features}{domain}{baseline_performance}{file_path}Pro Tips
Domain knowledge is key to good feature engineering. Always validate features for leakage before using in production.
Related Prompts
Need More Prompts?
Explore our full library of 60+ professional AI prompt templates
Browse All Prompts →