Recall Versus Precision In Machine Learning
Recall Versus Precision In Machine Learning

In machine learning, recall is a performance metric that corresponds to the fraction of values predicted to be of a positive class out of all the values that truly belong to the positive class (including false negatives). It differs from precision, which is the fraction of values that actually belong to a positive class out of all of the values which are predicted to belong to that class. Recall is also known as the true positive rate (TPR), sensitivity (SEN), probability of detection, and hit rate.

Equation: Recall = predicted true positives / (true positives + false negatives)

An example outside the realm of industry is sometimes illustrative. Say you are traveling for the holidays and need to bring gifts to specific family members while leaving behind other gifts intended for other recipients. For the purposes of this example:

  • True Positives (TP) = correctly remembered presents
  • True Negatives (TN) = correctly forgotten presents
  • False Negatives (FN) = incorrectly forgotten presents (forgot to bring the right presents)
  • False Positives (FP) = incorrectly remembered presents (brought wrong presents)

In this case, recall differs from precision in that it is all about what you pack. If you only bring three out of five of the gifts needed in the destination city, for example, then the formula would be:

Recall = TP / (TP + FN) = 3 / (3 + 2) = 60%

In that sense, recall penalizes not bringing all of the correct gifts for your family while precision penalizes bringing the wrong gifts. Would your family maximize precision, even if that means getting their gift at a later date? Or would your family maximize recall to each have a gift to open, even if it is the wrong one? Below is a breakdown of what different scenarios might look like, from recall of zero to 100%.

When To Use Recall To Evaluate Performance

Whether to use recall to evaluate and monitor performance of a machine learning model or large language model application depends on the circumstances.

When to use recall in modern generative AI systems

When an large language model system is built to perform a classification task, recall is a useful metric to use in tandem with precision and f1 score (the harmonic mean of precision and recall). Often with LLM apps, only accuracy is used but it can be a misleading metric when you need to optimize against specific business goals (i.e. minimizing false negatives in predicting data breaches) or when there is class imbalance in the data.

Example of recall used in llm task evaluation to reduce hallucinations
Example of recall used in LLM task evaluation to reduce hallucinations

As always, it pays to be clear on the goals in using any given metric and ask pointed questions about when and why it’s being used.

When Is recall preferred to other classification metrics?

As the equation above makes clear, recall has false negatives in the denominator – so it makes sense to use when false negatives are important and you want to minimize them. A false negative error is a type II error that predicts a given condition exists when it does not. A simple way to remember when to use recall is reCALL for insurance company CALL centers. If you build a model to predict fraudulent insurance claims, it is often more cost effective for the company to be conservative and over-predict on potential fraudulent claims. This is because if a claim is flagged as fraud, it will undergo further investigation as opposed to being paid out directly by the insurance company.

Paying out a false claim by someone committing insurance fraud is the worst case scenario for the company because it is a profit loss and can result in additional fraud. In this case, you would want to minimize these false negatives while maximizing the true positives (customers who meet the conditions of their claim). Note that false positives (FPs, or denying a real claim) aren’t good either because it decreases customer satisfaction and increases customer churn. However, a false positive – a claim flagged for fraud that is not in fact fraudulent – can be cleared up by assigning an experienced claims specialist who can gather additional evidence like security videos and a police report without inconveniencing the customer to the point that they switch providers.

What are your key performance indicators (KPIs)?

KPIs are relevant but distinct from evaluation metrics. KPIs reflect business goals. While metrics like precision and recall give you insights into how the model is doing, KPIs give you insights into how models are impacting business results.

Based on KPIs, a hospital might opt for a metric like recall that tries to minimize false negatives over a metric like precision that tries to minimize false positives to hospitals because it is better to recall all true diagnoses and run additional tests than let someone with a serious condition go thinking they are fine.

What are the costs to consider?

When you put a machine learning model into production there are various costs to consider. It’s important to know:

  1. Who pays for a negative?
  2.  What is the cost of a positive?

For example, let’s say you have a click through rate model to predict if someone will click on an email to register for an event from your company, and eventually be converted to a lead. The cost of sending an email is low, so it doesn’t matter if you have some false positives or false negatives because the reward of a true positive is very high. This means precision would be a fine metric to use to evaluate performance.

On the other hand, if your model predicts which event registrants are high-value leads –  and therefore should have additional support and resources allocated to them from the team – your cost considerations change. If you are allocating time and resources from your team to pursue a lead and the registrant has no interest in buying, then you are wasting money. At the same time, you need to be sure any registrant who is a potential buyer is given the best experience possible. For this type of cost-benefit relationship, precision is not as important as recall.

When Does Recall Fail?

To recap, recall should be used when you care about the number of missed positive predictions. If you care about the positive events that were predicted but didn’t happen (false positives), however, then recall will not give you the performance insights you need. If you care about false positives – and therefore care about minimizing them – you should not use recall. In recommendation systems, for example, ​​false-negatives are less of a concern than false positives.

If you are monitoring a computer vision model or you are evaluating an LLM application focused on a task like diversity or user feedback, you may want to rely on specific metrics better suited to to a particular task (i.e. ROUGE metric for diversity). For a regression model, you might be more likely to use mean square error (MSE), root mean square error (RMSE) or mean absolute error (MAE) – while metrics like precision, recall, and F1 would be more common for a classification model.

The F-score is a great option due to it being a measure of the harmonic mean of precision and recall. F-score is a result of integrating these parameters into one for a better understanding of the accuracy of the model. F-score can be modified into F, 0.5, 1, and 2 based on the measure of weightage given to precision over recall. So if precision or recall alone don’t take into account the necessary considerations when evaluating your model – and if you are dealing with a class imbalance – consider the F-score.