Model Overview

Model name: Predictive Collision Risk (PCR) Safety Model for vehicles, drivers, and fleets.

Goal: Predict the probability of a crash and amount of crashes for a specific vehicle, driver, or fleet, based on their historical behaviors and collisions.

Base model: XGBoost Predictive Model

Model type: Supervised Machine Learning Model

Model version: 2.0

Developed by: Geotab Safety team

Intended Use

Primary intended uses:

  • Users can use this model to gain insights into collision risks and driving behavior by using predictive analytics to help anticipate and prevent collisions for their vehicles, drivers, and fleets.

Out-of-scope uses: Any intended use other than the primary intended uses.

Targeted users/User groups: The results of the PCR safety model are available to all vehicles/drivers and fleet managers (who have the appropriate clearance level to access the vehicles/vehicle groups) on the Safety page in MyGeotab or Drive App, with the following exceptions (both are very rare):

  • The device has been terminated.
  • The driver is no longer an active user in the database.
Data

This section outlines the key aspects of the data used to develop and evaluate the model. We first describe the training and testing data, and then detail the data pipeline and preprocessing steps used to prepare the data for modeling. Lastly, we discuss the privacy considerations and protections implemented to ensure responsible handling of sensitive data.

Description on training and testing data

The model is trained and tested using a dataset consisting of driving and collision data from 2022 to 2024, which was then split into training and testing sets. The training and testing data contains the following features:

  • Harsh Event: There are 3 types of harsh events: harsh acceleration (rapidly speeding up), harsh braking (slamming on brakes), and harsh cornering (taking turns too fast). These are detected using GPS signals, and the counts are collected during a predefined time period along with the total distance traveled over the same period.
  • Speeding Event: Activities where vehicles or drivers exceed the speed limit on designated road segments. These are detected using either raw GPS data or interpolated GPS data and snapped data, and include details such as speeding distance, duration, and severity.
  • Area Risk: A summary of how risky a vehicle's or driver's past trips were based on the road segments it traveled through. This feature is developed because different areas can have different risk factors that contribute to collisions, such as severe congestion, complicated vehicle types, merging roads or even extreme weather.
  • Vehicle Class: A categorical variable which classifies vehicles based on their vehicle types (e.g., truck, passenger, etc.) and their weight classes.
  • Collisions (dependent variables): AI-detected collision transactions which use a combination of a heuristic rule-based model and a Deeptab model to create collision scores for events. A threshold on the predicted collision scores is set to ensure that only events that we are highly confident of being actual collisions are used.

Data pipeline and preprocessing

To generate the training and testing dataset for vehicle predictions, distance-based aggregation is employed. This involves aggregating features such as harsh events, speeding events, and area risk for each vehicle, from trip data representing a specific historical distance at the end of each day. To improve efficiency and model performance, we down-sampled the negative class, such as, for example, vehicles with no collisions. We highlight two aspects of preprocessing:

  • Harsh and Speeding Events Normalization: The raw counts of harsh and speeding events are adjusted by dividing them by the total distance traveled within a given period. This normalization is essential to ensure equitable comparisons across vehicles. Without this adjustment, a vehicle with minimal travel (e.g., 1 kilometer last month) could misleadingly show a higher risk than another vehicle with extensive travel (e.g., 100 kilometers last month), even if both exhibited the same number of harsh events during that same month.
  • VIN Validity: To address situations where a GO device is moved between vehicles with different Vehicle Identification Numbers (VINs), only driving data associated with the device's most recent VIN is used. This strategy maximizes data coverage and maintains data accuracy. For example, if a device was recently moved to a vehicle with VIN A (say, two days ago), replacing its previous association with a vehicle with VIN B, then only data from VIN A is considered for model training and testing.

The driver PCR model is identical to the one for vehicles, which will ensure that when a vehicle is only associated with one driver, both the driver and the vehicle will have the same predicted collision probability. For driver-specific features, daily vehicle-level trip data is associated with driver information and subsequently aggregated at the driver level on a per-day basis. The driver's vehicle class is determined by the most frequently recorded vehicle class for that driver.

The fleet collision predictions are derived from a weighted average of vehicle collision predictions, with greater weight given to vehicles that have traveled a larger distance recently.

Data privacy

  • All driver and company identifiers are properly anonymized to avoid identification. To identify potential privacy risk, all our projects go through a Privacy Risk Assessment (PRA) and an AI Risk Assessment (AIRA) as required.
Ethical Considerations, Assumptions, Constraints

In this section, we highlight some ethical challenges that we were facing during the model development, including bias and fairness considerations, and present our solutions to overcome these challenges. Additionally, we provide the assumptions and constraints of our model, including any limitations in the data or the model's scope that could affect its performance, in order to foster the understanding of the model's strengths and limitations to the stakeholders which is crucial to use the model responsibly and interpreting its results.

Risks in training

  • Data Representation Bias: The training data, model development, and resulting risk predictions are based solely on data acquired from Geotab's commercial customer base.
  • Inherited Bias from Area Risk Model: The aggregated area risk feature leverages the output of an area risk model which estimates the level of risk for each road segment. Any biases present in this upstream model could potentially propagate into the PCR Safety model. For instance, areas with sparse data or limited visitation might be inaccurately assessed.
  • Inherited Bias from Collision Detection Model: The collision detection algorithm's outputs, which serve as the ground truth for collisions, may introduce inherent bias. If certain collision types are systematically under-detected, the PCR Safety model's performance for those specific collision types might be compromised.

Data bias handling

Here's how we address the identified risks:

  • Mitigation to bias from data representation: Geotab's broad and diverse customer base, spanning numerous vehicle types and operational patterns, aids in generalizing insights and reducing bias. We continuously explore and integrate new features to enable the model to discern nuanced relationships between risk and driving behavior for each distinct vehicle category.
  • Mitigation to bias carried over from the area risk model: Potential bias from the area risk model is mitigated by analyzing aggregated data over extended periods instead of granular daily or hourly views. Furthermore, the area risk assessments are validated against publicly available records to ensure comparability and minimize inaccuracies.
  • Mitigation to bias carried over from the collision detection model: To mitigate bias originating from the collision detection algorithm, the collision detection model is validated against both internal and external data sources. These sources include claims data from various customers and publicly accessible collision datasets, such as those published by government entities for example the State of Ohio.

Model assumptions and constraints

  • Assumptions: It is assumed that a vehicle, driver, or fleet's future risk of collision is correlated with its past driving behavior and exposure to contextual risk.
  • Constraints: Although the PCR model is a predictive model, the model actually measures correlation, not necessarily causality.
Evaluation Metrics

Model performance is assessed using a test dataset derived from historical driving and collision records. For each vehicle and prediction day, the evaluation determines if a collision occurred within a defined time window following the prediction date. The actual collision occurrence (binary outcome) and the model's predicted probability of collision are then used to compute the Area Under the Receiver Operating Characteristic Curve (AUC-ROC), which serves as the primary performance metric.