We recently examined how data scientists can better understand and work with attorneys. In this iteration, we flip the interaction — so that attorneys can better understand and communicate with data scientists.
We should know, after all. Our team members have worked as lawyers and data scientists. Plus, we partner with both audiences often. We’ve seen first-hand how attorneys and data scientists — when they do interact — often seem to talk past each other in spite of their collective skills at developing and deploying accurate, effective, ethical and incredibly valuable machine learning systems.
Data science is about the objective function: a metric the data scientist is trying to maximize or minimize. A classic example is a machine learning system that labels whether an image contains a cat or a dog. The only goal is to maximize the objective function, or rather, the accuracy with which the system labels cats and dogs.
The data scientist needs two ingredients to build this function. First, they need data. Lots and lots of data. All machine learning algorithms — from the simplest approaches that have been around for decades to the latest cutting-edge AI approaches — require vast amounts of data to “learn.” Second, they need an algorithm, a set of steps that a computer follows to solve a problem. In its most basic form, think of an algorithm as a recipe — a list of instructions followed to create, for example, banana bread.
In the legal space, scores of algorithms exist to choose from: Logistic regression, XGBoost and deep neural networks are algorithms that lawyers working with data scientists regularly encounter. Once the data scientist chooses an algorithm, they can fine-tune numerous levers to minimize or maximize the objective function, making the algorithm as accurate as possible.
While labeling photos of cats and dogs is a staple of data science research, a more likely place for a lawyer to encounter a system developed by a data scientist is a credit model. Such a model may try to predict the likelihood of whether a borrower will repay a loan. The objective function is accuracy: minimizing incorrect predictions and maximizing the correct ones.
In reaching this goal, the data scientist leverages various data sources. They likely use structured data from the applicant’s credit application and their credit file from the credit bureaus.Increasingly, data scientists are gaining access to (and in some cases, using) things like where an applicant went to school, what they studied, web browser history and even social media connections and interactions (broadly termed as “alternative data”). Metadata, such as the type of device used to fill out an application, the presence of the applicant’s name in their email address, and the use of proper punctuation, may also be considered.
The data scientist’s primary goal is to improve objective function, so they will leverage any and all data sources they can get their hands on that may be helpful in accurately predicting whether an applicant will repay their loan. Once the data scientist gathers this data, they will select their algorithm and repeatedly tune it until finding a configuration that makes the most accurate predictions possible in the real world.
It’s also important to note machine learning algorithms are so complex data scientists often do not actually know how the algorithms are working. The models are virtual black boxes. But this doesn’t bother the data scientist because these black boxes yield more accurate predictions.
All of this is driven by a relentless focus on the data scientist’s objective function: accuracy.
The lawyer has a different objective function, working to minimize things like litigation and regulatory risk, along with reputational harm. Attorneys may also offer business advice, an objective function requiring a more nuanced and qualitative review of the data and algorithm, but that’s something that often causes friction with data scientists and their laser focus on their own objective function.
Ask a data scientist why they included a particular variable in their model and the automatic answer will invariably be something along the lines of, “Because it improves accuracy.”
It’s not uncommon for a model’s variables to be largely composed of cryptic names that make it difficult to decipher what any given variable actually measures. Ask a data scientist about the meaning or measures of one of the variable names, and you may get a blank stare back. Why? Because to the data scientist, what the variable actually measures is irrelevant. The only thing that matters is the accuracy improvement that the variable provides.
As a lawyer working with a data scientist, you need to know what each of the variables mean, how they are measured and how they are obtained — not just whether they provide a lift to accuracy. You also need to know whether any given variable encapsulates a prohibited basis, whether it might be tainted with protected class data in some way, or if it might be so correlated with protected class status that it rises to the level of a proxy.
While the relationship will never be entirely frictionless, nuances can improve the relationship between attorneys and data scientists. Systems can be built to develop models that meet objective function for both parties. At the heart of these efforts is recognizing that both objectives — accuracy and fairness — are important and certainly worth pursuing.
Telling a data scientist you understand their objective function — and that you have one of your own — can go a long way towards addressing future conversations when issues arise. Indeed, neither objective function is right or wrong; both are legitimate and worth optimizing.
Ideally, these conversations can even spur innovations and the development of new tools. For example, data scientists can turn fairness into an equation and use it as an objective function in their models. Determining how to do this will require close collaboration and communication between you and your data scientists.
<h2
<p”>Variables like race, ethnicity, gender and age should not be used in most models. However, there are exceptions and nuances to every rule. For example, while credit models should avoid these variables, appropriate uses for each of these variables in healthcare models do exist. There are also variables with more subtle and nuanced impacts on fairness, such as a variable for what university someone attended may need to be removed or replaced if the overall model starts to yield unacceptable levels of unfairness.
All of this requires significant oversight and leadership from you, as the lawyer, including providing clear guidelines and frameworks at the outset and developing clear and complete processes governing data decisions. This will significantly streamline the entire modeling process and facilitate meaningful dialogue between attorneys and data scientists.
As noted prior, data selection and model tuning decisions made by data scientists are based on objective function of accuracy. However, these decisions can also be tested for functions like fairness and compliance crucial to a lawyer. So, collaborate with your data scientists to create processes to test models and tunings — not just for accuracy but also for fairness. When the data scientists start utilizing a new data source, encourage them to keep fairness in mind as they look for tweaks and improvements.
“You can’t improve (or manage) what you don’t measure.”
While your work as an attorney requires qualitative analysis and review, some of that review can be quantified. Fairness can be measured, and it just so happens that your data scientists are exceptional at measurement and quantification. Work with them to build and implement tools that help you conduct fairness audits of the models they develop.
Building tools and metrics like these will help your data scientists incorporate and internalize your objective function, providing clear visibility into how fair and compliant the models they develop are over time. They’ll start to notice issues that impact fairness and develop best practices to maximize fairness in addition to accuracy. Additionally, you may discover ideas for new approaches to minimize unfairness while continuing to maximize accuracy.
There is also the opportunity for synergy here: Tools built to monitor and enhance fairness may also be useful for gaining insights for improving accuracy. For example, explainable AI (xAI) is one set of tools that data scientists use to illuminate the black box of machine learning models. Data scientists typically use these tools to investigate which variables (or collections of variables) are important for the accuracy of their models. But there is no reason these same tools can’t also be used to investigate which variables are leading to unfairness.
Furthermore, it may be possible to gain a deeper understanding of which sets of variables result in models that meet both of your objective functions — models that are both fair and accurate.
Businesses, governments, researchers and society as a whole need models that are ethical while providing powerful insights at the same time. That is when real value is unlocked.
Lawyers and data scientists must work together effectively and efficiently, understanding and appreciating one another — that’s how we achieve this.
Learning about and openly communicating the importance of objective functions is crucial to working together and to finding innovative ways to optimize both.