Model Overview

Model name: EV Range Estimation Model

Goal: To evaluate which Electrical Vehicles (EV) might be able to replace an Internal Combustion Engine (ICE) vehicle for a given duty cycle

Base model: Linear Regression

Model type: Supervised

Model version: 1.0 in production

Developed by: Geotab Sustainability team

Intended Use

Primary intended uses:

  • A key principle in identifying suitable EV replacements lies in establishing a common metric between ICE vehicles and EVs: energy utilization. This model is designed to estimate the energy consumption of a specific EV model over a given trip or duty cycle. This analysis helps evaluate the feasibility of fleet electrification.

Out-of-scope uses: The model is specifically designed and trained using data from light-duty vehicles, such as passenger cars and delivery vans. It is not suitable for heavy-duty vehicles.

Target users/User groups: Fleet managers.

Data

This section outlines the key aspects of the data used to develop and evaluate the model. We first describe the training and testing data, and then detail the data pipeline and preprocessing steps used to prepare the data for modeling. Lastly, we discuss the privacy considerations and protections implemented to ensure responsible handling of sensitive data.

Description on training and testing data

The EV energy estimation model works by using linear regression on historical trip data to predict the energy consumption of EVs (in kWh) based on key trip characteristics and environmental factors. It takes specific inputs such as total trip distance, total trip duration, average trip speed, and average outside air temperature.

  • The total distance traveled during a trip is determined using GPS data. However, if the GPS data conflicts with the odometer or speed trace data, the odometer or speed trace data is used instead. (More details in next section)
  • The total duration of a trip is calculated by subtracting the trip's start time from its end time.
  • Average trip speed is calculated by dividing the distance traveled by the time spent driving, excluding idle time.
  • The average outside air temperature is determined using a combination of data from multiple vehicles in the same geographic area and weather conditions of the trip.
  • The actual energy consumed during a trip is calculated from data recorded by the installed GO device.

The total energy consumption is assumed to be a weighted sum of these components: energy to overcome friction, energy to overcome drag, baseline auxiliary power (lights, seat heating, etc.), and energy to power HVAC. Different inputs are used to estimate each part. The linear regression model approximates how much each component contributes to the total energy consumption.

The data is split into a 75% training set and a 25% testing set through random selection.

Data pipeline and preprocessing

The pipeline for the input data is broken up into five stages, each capturing specific data points which are used in the Energy Prediction Model:

  • Stage 1 - Ignition Event Logging: The initial stage captures ignition events for any device that is identified to be installed in an EV.
  • Stage 2 - Engine Status: Engine data is collected and joined to the specific ignition event captured in Stage 1. Status signals for the EVs include battery signals and odometer data.
  • Stage 3 - GPS Summary: GPS data is joined to the ignition events from Stage 1 which is then used to determine trip information for the EVs. This data is processed to determine the trip's start and end location, distance travelled, and idling duration. The distance is calculated using both straight-line and speed-integral methods, which are then compared to odometer readings for error correction. The idling duration of a trip is determined by using the sum of duration spent where the speed of the vehicle was < 2 km/h. Times where vehicle speed only briefly drops below 2 km/h are excluded.
  • Stage 4 - Temperature: Since not all devices report ambient temperature, we use a combination of data from multiple vehicles in the same geographic area, and temperature records from internal tables (running on another pipeline to calculate) to determine the temperatures at the start and end of a trip.
  • Stage 5 - Push Events: Data from stages 1 through 4 are combined to generate the events that are inserted into the final destination table. There is additional logic included in this stage to accurately determine trip distance. While odometer data provides a reference, its 1 km logging interval limits its precision. The logic prioritizes GPS-derived distance, maintaining consistency with MyGeotab. Speed-derived distance is used as a secondary source for when GPS data is unreliable.

Selection criteria for distance are as follows:

  • GPS distance is selected if the value is within 2 km of odometer distance.
  • Speed distance is selected if the value is within 2 km of odometer distance and GPS distance is invalid.
    • If neither meets the 2 km criteria, and average trip speed (using either GPS or speed distance) is below 160 km/h, then the respective GPS or speed distance is selected.
    • Speed-based distance is used if the average trip speed is below the acceptable threshold.

Data privacy

  • There is no Personally Identifiable Information (PII) used during training, evaluation, or on production.
Ethical Considerations, Assumptions, Constraints

In this section, we highlight some ethical challenges that were encountered during the model development, including bias and fairness considerations. Additionally, we provide the assumptions and constraints of our model, including any limitations in the data or the model's scope that could affect its performance, in order to foster the understanding of the model's strengths and limitations to the stakeholders which is crucial to use the model responsibly and interpreting its results.

Risks in training

  • Geographic Bias: The training data has been historically skewed towards Europe and the US due to a higher volume of EV data from those regions. This could introduce geographic biases in the model's predictions. Post-regionalization, a process to combine data from different regions for training will be necessary.
  • Make-models: Different make-models use different amounts of energy. For example, ideally a BMW i7's energy consumption should not affect a Kia EV9's energy estimation.

Data bias handling

  • Vehicle Grouping: To handle any potential "make-model" bias, vehicles are grouped by make-model-generation for training, assuming similar energy consumption within a generation. This greatly increases the estimation for each make-model-generation.
  • Outlier Handling: The distance selection logic in the EVIgnitionEvents pipeline (Stage 5) helps clean out erroneous distance data by comparing GPS-based, speed-based, and odometer readings.

Model assumptions and constraints

Assumptions:

  • Vehicle mass remains constant throughout the trip or duty cycle (no change in cargo payload).
  • The EV is not charged during a trip; the ignition must be off before plugging in.

Constraints:

  • This version of the model uses average speed to estimate total drag, rather than incorporating more accurate drag calculations based on instantaneous speed changes.
  • The model accounts for baseline auxiliary power (like lights and radio) based on total trip duration, which is a simplification.
  • While some versions were tested with elevation data, the production version has limitations due to the granularity and availability of elevation data.
  • The v1 model doesn't fully capture the energy inefficiencies of frequent acceleration and deceleration. Future versions aim to incorporate this using second-by-second GPS data.
  • The model does not consider different road conditions. For example, it does not differentiate between wet, iced, or dry road surfaces.
Evaluation Metrics

Mean Absolute Error (MAE) and R-squared (R²) are used as evaluation metrics to assess the performance of the model on different make-models (e.g. Renault Zoe):

  • Mean Absolute Error (MAE) measures the average magnitude of the errors between the predicted energy consumption and the actual energy consumption. It provides a sense of the average absolute difference between the model's predictions and the observed values. A lower MAE value indicates a more accurate model.
  • R-squared (R²) indicates the goodness of fit of the model to the data. It represents the proportion of the variance in the dependent variable (energy consumption) that is predictable from the independent variables (such as total trip distance, average outside air temperature, etc.). The closer the R² value is to 1, the better the model fits the data.

Performance is also assessed across different trip lengths (short, medium, long) to ensure a balanced and more nuanced accuracy.