Machine Learning Algorithm — Logistic Regression
Linear regression doesn’t have constraints on its predicted value ( eg, it can predict house with negative values), hence we go for logistic regression whenever there is categorical output.
- we take linear combination (or weighted sum of the input features)
- we apply the sigmoid function to the result to obtain a number between 0 and 1
- this number represents the probability of the input being classified as “Yes”
- instead of RMSE, the cross entropy loss function is used to evaluate the results
Logistic regression does classification for both continuous and discrete variables. For example, it can predict obesity based on weight of individuals (continuous) as well as discrete category of genotypes (discrete).y-axis in Logistic Regression always shows probability and hence its value is limited from 0 to 1. This transformation from probability scale to log odds scale is done using LOGIT function.
With this transformation of y-axis, we get straight line (best fitting line) instead of squiggly line. That will give us linear equation which will have slope and y-intercept. Logistic regression equation is same as linear regression equation, except that we scale it to log(odds) before calculating coefficients.
To calculate most fitted line, we use maximum likelihood instead of least square method. MLE asks what should this percentage be to maximize the likelihood of observing what is to be observed. Its a technique for estimating parameters of a given distribution. For example, if a population is known to follow a “normal distribution” but the “mean” and “variance” are unknown, MLE can be used to estimate them using a limited sample of the population. MLE does that by finding particular values for the parameters (mean and variance) so that the resultant model with those parameters (mean and variance) would have generated the data.
Training, Validation and Test Datasets
- Training set — used to train the model, i.e., compute the loss and adjust the model’s weights using an optimization technique.
- Validation set — used to evaluate the model during training, tune model hyperparameters (optimization technique, regularization etc.), and pick the best version of the model. Picking a good validation set is essential for training models that generalize well.
- Test set — used to compare different models or approaches and report the model’s final accuracy. For many datasets, test sets are provided separately. The test set should reflect the kind of data the model will encounter in the real-world, as closely as feasible.
Its generally, 60–20–20 split of dataset. If the dataset has date column, ensure to split it based on sorted date column instead of random order.
Save processed data to disk in parquet format. Its fast format of saving and loading data frames. Save model along with its parameters, imputers, scalers, encoder etc. using joblib / pickle library.
Model Evaluation Metrics
Confusion Matrix : It is used to describe the performance of classification model on test dataset of which true values are known.
Precision : Of all the positively predicted , what percentage is truly positive
Recall : Of all the total positive, what percentage are predicted positive. Also called True Positive Rate (TPR) and Sensitivity
Accuracy : Use it to measure the total number of predictions a model gets right, including both True Positives and True Negatives.
F1-Score : Weighted average of the precision and recall values and cannot be high without both precision and recall also being high. When a model’s F1 score is high, you know that your model is doing well all around. Its useful when dataset is imbalanced.
Is precision better than recall or vice-versa totally depends on the situation.
ROC-Curve: It is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) for a number of different candidate threshold values between 0.0 and 1.0. Put another way, it plots the false alarm rate versus the hit rate. Ultimately, we’re concerned about the area under the ROC curve, or AUROC. That metric ranges from 0.50 to 1.00, and values above 0.80 indicate that the model does a good job in discriminating between the two categories which comprise our target variable.
K-Fold Cross Validation: When evaluating models, we often want to assess how well it performs in predicting the target variable on different subsets of the data. One such technique for doing this is k-fold cross-validation, which partitions the data into k equally sized segments (called ‘folds’). One fold is held out for validation while the other k-1 folds are used to train the model and then used to predict the target variable in our testing data. This process is repeated k times, with the performance of each model in predicting the hold-out set being tracked using a performance metric such as accuracy.
Model Improvements
Hyperparameter Tuning : Hyperparameters are model-specific properties that are ‘fixed’ before you even train and test your model on data. Use GridSearchCV library to hyper tune the parameters and get the best parameter to proceed with. You can use k-fold cross validation to determine the optimal hyperparameter
Feature Scaling, Optimizing F1-Score, check for class imbalance (you can use class_weight=balanced to balance the datasets)