Traffic Prediction with Machine Learning – Random Forest

In this project, we aimed to predict traffic metrics (specifically Clicks) using historical search data from Google Search Console. We experimented with two approaches: one using a Random Forest Regressor and another incorporating improvements after analyzing the results.

Initial Attempt:

The first version of the model followed these steps:

  1. Data Preprocessing:
    • The dataset traffic.csv was loaded and the Date column was converted to a numeric format representing days since the first date in the dataset. This transformation made the time-based data suitable for machine learning models.
    • Missing values in the dataset, particularly in the CTR, Position, Impressions, and Clicks columns, were handled by replacing NaN values with the column mean.
    • Feature engineering added new columns for day_of_week, month, and week_of_year to capture cyclical patterns in traffic.
  2. Feature Selection:
    • The features for the model included Impressions, day_of_week, month, and week_of_year. Columns like CTR and Position were removed as they were not essential for the model.
  3. Modeling:
    • A Random Forest Regressor was used to predict Clicks, after scaling the features using StandardScaler.
    • We evaluated the model using two key metrics: Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). The results were as follows:
      • MAE: 105.38
      • RMSE: 149.04
    Despite these high error values, the model provided a baseline for further improvements.
  4. Visualization:
    • A plot was generated comparing the Actual vs Predicted Clicks, which visually demonstrated the discrepancy between predicted and actual values.

Improved Model:

Following the initial results, we decided to refine the model. Here’s how the adjustments were made:

  1. Data Adjustments:
    • After reviewing the initial performance, we dropped the CTR and Position columns from the features altogether, as they had little impact or were redundant with other features.
  2. Feature Engineering:
    • We continued using Impressions, day_of_week, month, and week_of_year, with additional attention given to ensuring data quality. Missing values in all relevant columns were handled by replacing them with the mean of the column.
  3. Model Refinements:
    • The Random Forest Regressor was retained, but the model was now trained with cleaner data that focused solely on the most relevant features. This adjustment improved the model’s accuracy and reduced overfitting.
  4. Evaluation:
    • After refining the model, we achieved much better results:
      • Mean Absolute Error (MAE): 53.97
      • Root Mean Squared Error (RMSE): 74.02

These improved results indicate the model is now more accurate and better at predicting Clicks.

This iterative approach has helped refine the model and highlighted areas for future improvement. The predictions made by the final model showed much better accuracy, and the process continues to evolve with each iteration.

Next Steps:

To improve the model, we will handle outliers with Z-scores or IQR, add seasonal features like holidays, and introduce lag features to capture temporal dependencies.

Also we will tune Hyperparameter through Grid or Random Search while exploring alternative models like Gradient Boosting or XGBoost that may provide better results. Treating the problem as a time series forecasting task, with additional past traffic data, could also enhance the predictions. These steps should help improve the model’s accuracy.