Model name: Predictive Collision Risk (PCR) Safety Model for vehicles, drivers, and fleets.
Goal: Predict the probability of a crash and amount of crashes for a specific vehicle, driver, or fleet, based on their historical behaviors and collisions.
Base model: XGBoost Predictive Model
Model type: Supervised Machine Learning Model
Model version: 2.0
Developed by: Geotab Safety team
Primary intended uses:
Out-of-scope uses: Any intended use other than the primary intended uses.
Targeted users/User groups: The results of the PCR safety model are available to all vehicles/drivers and fleet managers (who have the appropriate clearance level to access the vehicles/vehicle groups) on the Safety page in MyGeotab or Drive App, with the following exceptions (both are very rare):
This section outlines the key aspects of the data used to develop and evaluate the model. We first describe the training and testing data, and then detail the data pipeline and preprocessing steps used to prepare the data for modeling. Lastly, we discuss the privacy considerations and protections implemented to ensure responsible handling of sensitive data.
The model is trained and tested using a dataset consisting of driving and collision data from 2022 to 2024, which was then split into training and testing sets. The training and testing data contains the following features:
To generate the training and testing dataset for vehicle predictions, distance-based aggregation is employed. This involves aggregating features such as harsh events, speeding events, and area risk for each vehicle, from trip data representing a specific historical distance at the end of each day. To improve efficiency and model performance, we down-sampled the negative class, such as, for example, vehicles with no collisions. We highlight two aspects of preprocessing:
The driver PCR model is identical to the one for vehicles, which will ensure that when a vehicle is only associated with one driver, both the driver and the vehicle will have the same predicted collision probability. For driver-specific features, daily vehicle-level trip data is associated with driver information and subsequently aggregated at the driver level on a per-day basis. The driver's vehicle class is determined by the most frequently recorded vehicle class for that driver.
The fleet collision predictions are derived from a weighted average of vehicle collision predictions, with greater weight given to vehicles that have traveled a larger distance recently.
In this section, we highlight some ethical challenges that we were facing during the model development, including bias and fairness considerations, and present our solutions to overcome these challenges. Additionally, we provide the assumptions and constraints of our model, including any limitations in the data or the model's scope that could affect its performance, in order to foster the understanding of the model's strengths and limitations to the stakeholders which is crucial to use the model responsibly and interpreting its results.
Here's how we address the identified risks:
Model performance is assessed using a test dataset derived from historical driving and collision records. For each vehicle and prediction day, the evaluation determines if a collision occurred within a defined time window following the prediction date. The actual collision occurrence (binary outcome) and the model's predicted probability of collision are then used to compute the Area Under the Receiver Operating Characteristic Curve (AUC-ROC), which serves as the primary performance metric.