Confronting The Boogeyman: How SolasAI’s Testing Software Works For Good

  • Blog
  • Confronting The Boogeyman: How SolasAI’s Testing Software Works For Good
Confronting The Boogeyman: How SolasAI’s Testing Software Works For Good
Keyboard

As market concepts for artificial intelligence continue to expand for both businesses and consumers, much of the conversation has centered on ChatGPT, chat bots, and other consumer-facing generative AI products. And recently, the conversation hasn’t been great.

Pieces like The New York Times’ recent conversation with Geoffrey Hinton and the Associated Press’ focus on the dangers of AI from scientists seem to paint a one-sided — and largely negative — sentiment on the development of AI solutions.

While these concerns may be founded as more consumer AI innovations come to market, they create a false “boogeyman” that causes trepidation even as more businesses explore ways to incorporate AI into their practices and models. These articles and viewpoints also fail to recognize the trailblazing innovations that got us here, many of which are transforming business practices that positively affect people every day.

We’re speaking from experience.

SolasAI’s testing software is built on 45 years of combined experience in measuring and mitigating discrimination, and we’re solving disparities and biases in models that can make the difference between entire communities being either included or passed over for resources like loans and healthcare plans — resources that can help families thrive or even save lives. AI isn’t scary; it’s the thousands of people discriminated against by model bias that goes unidentified without proper testing software. That’s what keeps us up at night, and we’re here to help.

So, what’s under the hood of the SolasAI testing software that’s going to save lives?

First, let’s take a step back. In a typical governance process, models are tiered according to the level of legal, reputational, and regulatory risk that they face. Models in the top tier — including clinical outreach, underwriting models, and pricing models — are ones that have a known directional impact on consumers, among other things. This means that some people scored by the model receive a favorable outcome and others receive less favorable outcomes or are not given anything at all. These types of models can be contrasted to ones that do things like segment populations for differential, but non-rankable treatments.

Take a model that differentiates credit card customers into behavioral profiles as an example. This model splits people into two groups with high credit card usage: “travel aficionados” and “gourmands.” The travel aficionados receive travel-related offers through their behavior, while the gourmands receive restaurant offers. It’s generally hard to rank whether a travel offer is favorable compared to a restaurant discount, so these models are generally assigned a lower risk-tiering and tend to not be tested for disparities.

Our testing capabilities focus on the first group of models: those with rankable outcomes. These can be binary “Yes/No” outcomes, ranked categorical outcomes like “High/Medium/Low;” or, they can be continuous outcomes, such as a pricing model.

Although it is metric-dependent, SolasAI’s Disparity & Bias Testing Library can allow the user to specify whether a higher or lower value for the outcome variable is favorable in the modeling context. For example, a lower value is more favorable for a life insurance model predicting mortality and a higher value is favorable for a credit line increase model predicting future card spending. Binary favorable outcomes that can be tested include the adverse impact ratio, relative false positive rate ratio, and several others.

Categorical outcomes can be tested with the adverse impact ratio, which is useful for situations where customers are assigned to various ordinally ranked categories like risk tiers. Continuous outcomes can be tested with the standardized mean difference, residual standardized mean difference, or a quantile-based adverse impact ratio. Determining which test is appropriate typically depends on the model’s usage and regulatory expectations, however regulators tend to focus on metrics like the adverse impact ratio or standardized mean difference compared to confusion-matrix-based methods focused on in academic articles.

Once the model has been assigned a risk tiering and a rankable outcome has been determined, one must define the groups of people to test for disparities — these are often referred to as protected groups or protected classes. Our customers make these decisions primarily as a function of regulatory requirements and data availability.

While many anti-discrimination statutes name numerous classes of people against which a company cannot discriminate, oftentimes the company cannot get access to the demographic data required to test whether there are issues. Many local and state laws do not permit discrimination by veteran status, but it is exceedingly rare for a company to have information about customers’ veteran status, so this group is not typically tested for disparity.

People walking through a heavily-trafficked area of a city in a crosswalk leading toward a subway station.
When AI models are at heightened risk of discrimination, the effects are felt by real people in their everyday lives. (Photo: Christopher Burns)

Most commonly, testing is done by age, sex, and race or ethnicity discrimination. Even if a company does not collect information on customers’ metrics, there are standard ways to estimate this information like using the U.S. Census Bureau’s gender first name table to estimate a person’s sex. There is also a methodology known as the Bayesian Improved Surname Geocoding (BISG) approach that is used to estimate the probability that a person is Black, Hispanic, Asian, or non-Hispanic white. This method, while somewhat controversial, is commonly used and accepted by regulators.

In many situations, it is impossible to test for race/ethnicity discrimination without using a proxy estimate because that data is rarely collected. SolasAI is, to our knowledge, the only software capable of performing disparity testing on estimated probabilities that a person is a member of a particular class. It is therefore well-suited for regulatory settings in a way that no other software can compare.

Along with model tiering, an important part of any model governance program is setting what is known as “practical significance” standards. These are the guidelines that identify whether any potential disparity found in a model rises to a level that is worthy of further review. In employment settings, if the disparity found using the adverse impact ratio is greater than or equal to 0.80, then that disparity is seen as not being “practically significant.” Barring any other reasons for concern, a model that passed this threshold would not be held to further scrutiny. It is worth noting that different industries and different metrics require different practical significance thresholds.

What we have seen in healthcare is that companies are adopting stricter standards than are used in employment law, which matches typical practices in lending as well. SolasAI allows users to specify practical significance thresholds for all metrics. This encourages standardized practices across the organization while still allowing for justified exceptions to policies.

In addition to practical significance thresholds, it is also common practice to incorporate statistical significance standards. Per U.S. Supreme Court precedent, a two-tailed 95 percent test is almost always used. SolasAI uses statistical significance tests that are commonly applied to the various metrics. When a metric does not have a closed-form statistical test, SolasAI uses a bootstrap approach to estimate statistical significance.

Our metrics and methodologies have been tested exhaustively across industries and areas including employment law, fair lending law, fair housing law and litigation, life insurance, and health insurance. In time, we expect many of the same standards and practices we have used in these fields will also become best practices within the insurance industry.

AI doesn’t have to be scary. The framework is already there to help real people while expanding businesses in the process. The Disparity & Bias Testing Library is proof of concept, and SolasAI is here to be the solution, not add to the noise.