Feature engineering & feature selection

Step 1

Certainly, feature engineering and feature selection are essential steps in preparing data for machine learning and improving the performance of predictive models. Let’s delve into each of these concepts:

Feature Engineering: Feature engineering involves creating new features or transforming existing ones to make them more suitable for machine learning algorithms. The goal is to extract relevant information from the raw data and represent it in a way that enhances the model’s ability to learn patterns and make accurate predictions. Feature engineering can include tasks like:

Feature Creation: Generating new features based on domain knowledge or combining existing features. For example, in natural language processing, creating features like word counts or TF-IDF values can be helpful.
Encoding Categorical Variables: Converting categorical variables into numerical format that can be used by machine learning algorithms, such as one-hot encoding or label encoding.
Scaling and Normalization: Scaling features to the same range (e.g., using Min-Max scaling or standardization) to prevent certain features from dominating others.
Binning: Grouping continuous variables into discrete bins to capture non-linear relationships.
Feature Extraction: Reducing the dimensionality of data through techniques like principal component analysis (PCA) or singular value decomposition (SVD).
Time-Series Transformations: Creating lag features or rolling statistics for time-series data to capture temporal patterns.
Text and Image Processing: Extracting features from text data using techniques like tokenization, stemming, and converting text into numerical vectors (word embeddings). Similarly, extracting features from images using convolutional neural networks (CNNs).

Feature Selection: Feature selection involves selecting a subset of the most relevant and informative features from the original set of features. The goal is to improve model performance, reduce overfitting, and speed up training times. Reducing the number of irrelevant or redundant features can also enhance model interpretability. Feature selection methods can be categorized into three main types:

Filter Methods: These methods assess the relevance of features independently of the chosen learning algorithm. Common techniques include correlation analysis, chi-squared test, and mutual information.
Wrapper Methods: These methods involve training the machine learning model with different subsets of features and evaluating their performance. Techniques like forward selection, backward elimination, and recursive feature elimination (RFE) fall into this category.
Embedded Methods: These methods perform feature selection during the training process of the machine learning algorithm itself. Techniques like Lasso (L1 regularization) and tree-based feature importance fall into this category.

Choosing between feature engineering and feature selection (or a combination of both) depends on the nature of the data, the problem at hand, and the machine learning algorithm being used. It’s often a iterative process involving experimentation, evaluation of model performance, and refinement of feature engineering and selection strategies.

Table of Contents and Code Examples

Below is a list of methods currently implemented in the repo.

1. Data Exploration

1.1 Variables
1.2 Variable Identification
- Check Data Types [guide] [demo]
1.3 Univariate Analysis
- Descriptive Analysis [guide] [demo]
- Discrete Variable Barplot [guide] [demo]
- Discrete Variable Countplot [guide] [demo]
- Discrete Variable Boxplot [guide] [demo]
- Continuous Variable Distplot [guide] [demo]
1.4 Bi-variate Analysis
- Scatter Plot [guide] [demo]
- Correlation Plot [guide] [demo]
- Heat Map [guide] [demo]

2. Feature Cleaning

2.1 Missing Values
- Missing Value Check [guide] [demo]
- Listwise Deletion [guide] [demo]
- Mean/Median/Mode Imputation [guide] [demo]
- End of distribution Imputation [guide] [demo]
- Random Imputation [guide] [demo]
- Arbitrary Value Imputation [guide] [demo]
- Add a variable to denote NA [guide] [demo]
2.2 Outliers
- Detect by Arbitrary Boundary [guide] [demo]
- Detect by Mean & Standard Deviation [guide] [demo]
- Detect by IQR [guide] [demo]
- Detect by MAD [guide] [demo]
- Mean/Median/Mode Imputation [guide] [demo]
- Discretization [guide] [demo]
- Imputation with Arbitrary Value [guide] [demo]
- Windsorization [guide] [demo]
- Discard Outliers [guide] [demo]
2.3 Rare Values
- Mode Imputation [guide] [demo]
- Grouping into One New Category [guide] [demo]
2.4 High Cardinality
- Grouping Labels with Business Understanding [guide]
- Grouping Labels with Rare Occurrence into One Category [guide] [demo]
- Grouping Labels with Decision Tree [guide] [demo]

3. Feature Engineering

3.1 Feature Scaling
- Normalization - Standardization [guide] [demo]
- Min-Max Scaling [guide] [demo]
- Robust Scaling [guide] [demo]
3.2 Discretize
- Equal Width Binning [guide] [demo]
- Equal Frequency Binning [guide] [demo]
- K-means Binning [guide] [demo]
- Discretization by Decision Trees [guide] [demo]
- ChiMerge [guide] [demo]
3.3 Feature Encoding
- One-hot Encoding [guide] [demo]
- Ordinal-Encoding [guide] [demo]
- Count/frequency Encoding [guide]
- Mean Encoding [guide] [demo]
- WOE Encoding [guide] [demo]
- Target Encoding [guide] [demo]
3.4 Feature Transformation
- Logarithmic Transformation [guide] [demo]
- Reciprocal Transformation [guide] [demo]
- Square Root Transformation [guide] [demo]
- Exponential Transformation [guide] [demo]
- Box-cox Transformation [guide] [demo]
- Quantile Transformation [guide] [demo]
3.5 Feature Generation
- Missing Data Derived [guide] [demo]
- Simple Stats [guide]
- Crossing [guide]
- Ratio & Proportion [guide]
- Cross Product [guide]
- Polynomial [guide] [demo]
- Feature Learning by Tree [guide] [demo]
- Feature Learning by Deep Network [guide]

4. Feature Selection

4.1 Filter Method
- Variance [guide] [demo]
- Correlation [guide] [demo]
- Chi-Square [guide] [demo]
- Mutual Information Filter [guide] [demo]
- Information Value (IV) [guide]
4.2 Wrapper Method
- Forward Selection [guide] [demo]
- Backward Elimination [guide] [demo]
- Exhaustive Feature Selection [guide] [demo]
- Genetic Algorithm [guide]
4.3 Embedded Method
- Lasso (L1) [guide] [demo]
- Random Forest Importance [guide] [demo]
- Gradient Boosted Trees Importance [guide] [demo]
4.4 Feature Shuffling
- Random Shuffling [guide] [demo]
4.5 Hybrid Method
- Recursive Feature Selection [guide] [demo]
- Recursive Feature Addition [guide] [demo]

Tài liệu tham khảo

Internet

Hết.

Feature engineering & feature selection

Feature engineering & feature selection

Step 1

Table of Contents and Code Examples

Tài liệu tham khảo

CATALOG

FEATURED TAGS

LINKS