Model Overview

Model name: Area Based Risk Model

Goal: This model predicts the hourly probability of a collision occurring on a road segment, as well as the probability of a vehicle experiencing a collision while traveling through that segment.

Base model: XGBoost Predictive Model

Model type: Supervised Machine Learning Model

Model version: 1

Developed by: Geotab Safety team

Intended Use

Primary intended uses:

  • Predict collision risk of road segments at specific historical times, which helps more accurately assess the collision risk for Geotab customers' assets.

Out-of-scope uses: The model does not support forecasting future area risk. Uses other than the primary use case are also out of scope and not recommended.

Targeted users/User groups: This model is primarily designed for Geotab to improve its safety products for its customers by providing insights on area-based risk to better predict the collision risk of vehicles and drivers.

Data

This section outlines the key aspects of the data used to develop and evaluate the model. We first describe the training and testing data, and then detail the data pipeline and preprocessing steps used to prepare the data for modeling. Lastly, we discuss the privacy considerations and protections implemented to ensure responsible handling of sensitive data.

Description on training and testing data

The dataset used for training and testing includes several features designed to predict area-based collision risk. These features include:

  • Basic information about road segments: Provides fundamental information about road segments being analyzed, such as length, road type, and speed limit.
  • Harsh events: There are 3 types of harsh events: harsh acceleration (rapidly speeding up), harsh braking (slamming on brakes), and harsh cornering (taking turns too fast). Harsh events are defined by a threshold, and are identified when measured acceleration exceeds the specified thresholds.
  • Traffic conditions: Aggregated traffic information for each road segment.
  • Weather conditions: Area weather conditions including precipitation, temperature, and solar angles.
  • Intersection information: Provides the features of an intersection, including foundational information and the traffic conditions of roads it connects to.
  • Collisions (dependent variables): AI-detected collision transactions, which use a combination of a heuristic rule-based model and a Deeptab model to generate collision scores for events. A threshold on the predicted collision scores is set to ensure that only events that we are highly confident of being actual collisions are used.

To optimize the model, feature selection is performed on these features to identify key features and remove unhelpful or redundant ones. Feature selection involves a combined analysis of feature clustering, feature importance, and pairwise feature correlation. Features with high importance are prioritized and highly correlated and redundant features are removed. This process is iterative, with selected features being tested in the model to optimize performance until the final feature list is determined.

Data pipeline and preprocessing

The following preprocessing is highlighted in the data pipeline:

  • For area-based risk that is assessed for each road segment, features are mapped to the corresponding road segment (edge-level) for each hour.
  • To improve training efficiency and model predictive power, the model's training data was adjusted by downsampling the number of instances representing road segments that did not experience a collision.
  • For predicting individual vehicle collision probability, the data undergoes resampling: it is first transformed from edge-level to vehicle-level, and then downsampled.

Data privacy

  • Identifiable features such as harsh events are properly aggregated to mitigate privacy risk.
Ethical Considerations, Assumptions, Constraints

In this section, we highlight some ethical challenges that we were facing during the model development, including bias and fairness considerations, and present our solutions to overcome these challenges. Additionally, we provide the assumptions and constraints of our model, including any limitations in the data or the model's scope that could affect its performance, in order to foster the understanding of the model's strengths and limitations to the stakeholders which is crucial to use the model responsibly and interpreting its results.

Risks in training

  • Data imbalance: The collision instances (positive labels) are rare compared to the non-collision instances (negative labels).
  • Bias carried over from the collision detection model: The outcome of our proprietary collision detection algorithm is used as the ground truth for collisions. As a result, any inherent bias from the collision detection algorithm might get carried over.

Data bias handling

  • Migration to data imbalance: downsampling of negative instances is implemented to create a more balanced dataset for training and testing.
  • Mitigation to bias carried over from the collision detection model: The area-based risk model is validated against both internal and external data sources including claims data provided by various customers and public collision datasets published by various governments such as the state of Ohio.

Model assumptions and constraints

  • Assumptions: It is assumed that the outcome of our proprietary collision detection algorithm is the ground truth for collisions.
  • Constraints: The model only uses the existence of collisions as the dependent variable instead of their nuanced classifications.
Evaluation Metrics

After gathering all features, the optimal model parameters are determined by K-fold cross validation. The primary evaluation metrics are the Receiver Operating Characteristic Area Under the Curve (ROC-AUC), which provides a comprehensive assessment of model performance.

  • The estimated hourly probability of a collision occurring on a road segment reaches 81.1% ROC-AUC.
  • The estimated hourly probability of a vehicle experiencing a collision while traveling through a road segment reaches 76.1% ROC-AUC.

Additionally, the public collision dataset from Ohio is used to validate the model performance:

  • The estimated hourly probability of a collision occurring on a road segment reaches 78.2% ROC-AUC during the validation using the public collision dataset from Ohio.
  • The estimated hourly probability of a vehicle experiencing a collision while traveling through a road segment reaches 75.1% ROC-AUC during the validation using the public collision dataset from Ohio.