Model name: Area Based Risk Model
Goal: This model predicts the hourly probability of a collision occurring on a road segment, as well as the probability of a vehicle experiencing a collision while traveling through that segment.
Base model: XGBoost Predictive Model
Model type: Supervised Machine Learning Model
Model version: 1
Developed by: Geotab Safety team
Primary intended uses:
Out-of-scope uses: The model does not support forecasting future area risk. Uses other than the primary use case are also out of scope and not recommended.
Targeted users/User groups: This model is primarily designed for Geotab to improve its safety products for its customers by providing insights on area-based risk to better predict the collision risk of vehicles and drivers.
This section outlines the key aspects of the data used to develop and evaluate the model. We first describe the training and testing data, and then detail the data pipeline and preprocessing steps used to prepare the data for modeling. Lastly, we discuss the privacy considerations and protections implemented to ensure responsible handling of sensitive data.
The dataset used for training and testing includes several features designed to predict area-based collision risk. These features include:
To optimize the model, feature selection is performed on these features to identify key features and remove unhelpful or redundant ones. Feature selection involves a combined analysis of feature clustering, feature importance, and pairwise feature correlation. Features with high importance are prioritized and highly correlated and redundant features are removed. This process is iterative, with selected features being tested in the model to optimize performance until the final feature list is determined.
The following preprocessing is highlighted in the data pipeline:
In this section, we highlight some ethical challenges that we were facing during the model development, including bias and fairness considerations, and present our solutions to overcome these challenges. Additionally, we provide the assumptions and constraints of our model, including any limitations in the data or the model's scope that could affect its performance, in order to foster the understanding of the model's strengths and limitations to the stakeholders which is crucial to use the model responsibly and interpreting its results.
After gathering all features, the optimal model parameters are determined by K-fold cross validation. The primary evaluation metrics are the Receiver Operating Characteristic Area Under the Curve (ROC-AUC), which provides a comprehensive assessment of model performance.
Additionally, the public collision dataset from Ohio is used to validate the model performance: