Observing the Effects of Automating the Judicial System with Behavioral Equivalence
Joseph A. Blass*
Building on decades of work in Artificial Intelligence (AI), legal scholars have begun to consider whether components of the judicial system could be replaced by computers. Much of the existing scholarship in AI and Law has focused on whether such automated systems could reproduce the reasoning and outcomes produced by the current system. This scholarly framing captures many aspects of judicial processes, but overlooks how automated judicial decision-making likely would change how participants in the legal system interact with it, and how societal interests outside that system which care about its processes would be affected by those changes.
This Article demonstrates how scholarship on legal automation comes to leave out perspectives external to the process of judicial decision-making. It analyses the problem using behavioral equivalence, a computer science concept that assesses systems’ behaviors according to the observations of specific monitors of those systems. It also introduces a framework to examine the various observers of the judicial process and the tradeoffs they may perceive when legal systems are automated. This framework will help scholars and policymakers more effectively anticipate the consequences of automating components of the judicial system.
The possibility that components of the judicial system might be replaced by computers has recently moved from the realm of science fiction into the halls of academia. And why not? Research in Artificial Intelligence (AI) and Law has steadily produced results for decades, advances in data-driven Machine Learning systems are being used for legal analysis and prediction, and computers are generating court filings and assessing defendants’ risk if released. When legal scholars ponder what other judicial decisions might be made by computers, they are not engaging in speculative fiction, but only imagining today’s and tomorrow’s technology.
Much of the work in both AI and legal scholarship argues—or at least, implicitly assumes—that if an AI system generates the same legal decision (or argument, or analysis) as does the current system, then that AI system might reasonably replace the humans who currently perform that judicial reasoning and decision-making. This conclusion assumes that the legal outcomes—and their justifications—being simulated are the only relevant outputs of the legal process to be replaced. Under this view, if the subsystem that generates a legal outcome—usually, a judge—can be replaced without changing that outcome or the reasoning that justifies it, then the overall system has not been significantly affected by the replacement.
This Article demonstrates why that assumption is wrong, using methods from computer science that analyze the effects of replacing components of systems. While other scholars have discussed what components of the judicial system might be replaced by computers (and whether they should be), this Article is the first to approach the issue by examining what it means for one system to behave identically to or differently from another. The analysis focuses attention on the outputs of the judicial system beyond the decision made in a case. A difference-detecting method will empower scholars and policymakers to discern unintended consequences of making changes to the judicial system, particularly where they might otherwise assume the effects of the changes will be limited to the specific components that are altered.
When computer scientists consider whether to replace one system with another, they ask whether the two systems are behaviorally equivalent, i.e., whether their behavior is interchangeable. Behavioral equivalence is not an objective standard but a subjective one: whether two systems are behaviorally equivalent depends on who is observing those systems, and only those things being observed are included in the equivalence analysis. For example, consider washing dishes by hand versus in the dishwasher: to the sleeping child who takes a clean glass from the cupboard the next day, the processes are behaviorally equivalent; to the parent doing the washing, they are clearly distinguishable. Furthermore, behavioral equivalence demonstrates that if truly everything about some process is observed, then no replacement to that process can be equivalent to it. The judicial system presents exactly such a case, where no change can be made that will not be visible to some stakeholder in society who might care about that change. This means that any claim that some replacement system is faithful to the original must be ignoring the ways in which it is not. Focusing only on the legal decisions that a system makes risks ignoring the downstream effects of replacing it with an automated “equivalent.”
But just because true behavioral equivalence is impossible, policymakers should not throw up their hands and avoid trying to make positive changes. A behavioral equivalence analysis is useful even where a replacement system is intentionally different from the original, because a full accounting of the effects of a change allows decision-makers to properly understand and weigh its consequences. The key is to be sure to understand what is being changed, and for the changes to be made deliberately. Recognizing that whatever system might replace the current one cannot, by definition, be equivalent to it is an important first step toward determining the kind of system that is desirable, in light of the tradeoffs involved.
A behavioral equivalence analysis can help identify those tradeoffs. Although behavioral equivalence is not a crystal ball that perfectly predicts unintended consequences, it shows where and why those consequences are likely to appear, often from parties that interact with the changed system but are not the direct target of the changes. For example, attempts to amend the criminal justice system are common, but the changes made often result in such unintended consequences. The Sentencing Reform Act of 1984 was designed to reduce judicial discretion and eliminate disparities in sentencing, but it unexpectedly led to most criminal defendants taking plea deals rather than going to trial and facing sentences under those guidelines. Similarly, computational risk predictors are meant to use data-driven analyses to help judges more objectively grant defendants bail or probation rather than prison, but these predictors have been used improperly by judges to determine how punitive a prison sentence should be. Civil tort reform that limited noneconomic payouts in medical malpractice cases to reduce the incidence of such cases and control insurance premiums reduced overall payments, but it also led to dramatically higher awards for economic damages, limiting the overall reduction of awards.
In all of these instances, a process driven by human judgment was replaced by one designed to reduce discretion and guesswork, but the change failed to take into account the characteristics and perspectives of those used as inputs to the systems, which led to unintended consequences. Sentencing reform focused on judges and the disparities and discretion in their sentencing decisions, but did not consider how the prospect of a guaranteed sentence would affect prosecutors’ and defendants’ decision-making. The adoption of risk assessment software focused on defendants and sought to translate data about those defendants into a score usable by judges, but it also introduced new biases into judicial decision-making and insufficiently constrained how judges would use those scores. And although economic damages are more calculable than pain and suffering losses, tort reform to limit non-economic damages in malpractice cases did not account for attorneys innovating new ways of calculating economic losses—calculations that juries would fail to closely examine. In each instance, the unintended consequences arose in a different part of the system than that which was targeted by the change.
This Article provides a framework to evaluate the consequences of replacing components of the legal system. It demonstrates that it is theoretically possible to engineer perfect substitutes for those components only if one limits where one looks for the effects of those changes. But if one instead takes a broader view of the legal system, any substitutions are guaranteed to result in observable changes to that system. Focusing on the different observers of the legal system will allow policymakers to anticipate the hidden effects of making intentional changes to the system. The Article describes the tradeoffs that might arise from replacing parts of the legal system, which fall into four categories: informational access tradeoffs, which concern the information that emanates from a legal process; reasoning tradeoffs, which concern the mechanisms by which a legal decision is made; outcome tradeoffs, which involve changing the actual results of legal processes; and process tradeoffs, which concern how participants can interact with the system. This framework can help policymakers anticipate unintended consequences and factor them into their decision-making by drawing attention to the interests that will perceive the consequences of any changes being made.
This Article proceeds in three parts. Part I introduces the scholarship in Law and AI on automating legal decision-making. It draws out the assumption in this scholarship that the primary factor to attend to when considering automating some part of the judicial system is whether legal reasoning and decision-making remain unchanged. It then introduces the concept of behavioral equivalence and explains why behavioral equivalence must be evaluated from the perspective of some observer. Finally, it explores the implications of observer-dependent evaluations by asking how to evaluate behavioral equivalence in a non-deterministic domain like a courtroom, why an all-seeing observer defeats behavioral equivalence, and how observers’ limitations enable malicious actors who understand the inner workings of a system to take advantage of it.
Part II considers several possible observers that might evaluate behavioral equivalence, beginning with the judicial system itself. It shows that with this observer, behavioral equivalence can be assessed using the concept of appealability. If a new substitute process differs in some way from the old one, and if the legal outcome using that new process can be appealed specifically because of that difference, then the judicial system distinguishes between the processes. But if the difference cannot be the basis of an appeal, the judicial system does not distinguish the processes: the difference introduced by the replacement is irrelevant to how the process operates within the larger legal system. Using the judicial system as an observer is congruent with the assumption that only legal outcomes matter in considering legal automation. This explains why AI and legal scholarship have thus far implicitly assumed behavioral equivalence is possible: it has largely focused on a single observer.
To illustrate why this single observer fails to capture all the changes that might matter, Part II brings in principles of procedural justice. Focusing on procedural justice demonstrates why only observing legal outcomes—as relying on the judicial system as the observer does—may be short-sighted. Doing so ignores the observations made by outside observers who are relevant to the judicial system’s role in the body politic but are irrelevant to the internal operations of the system itself. After examining the various societal interests that observe different parts of the legal system, Part II concludes that these observers are collectively equivalent to the all-seeing observer, and that they therefore defeat any hope of achieving behavioral equivalence when replacing components of the legal system.
Although true behavioral equivalence may be impossible in the eyes of society at large, an analysis grounded in specific observers can help evaluate the consequences of making changes to the legal system. Part III begins by describing the tradeoffs involved in making such changes. It then examines which of these tradeoffs are implicated by scholars’ proposals for legal automation but are overlooked by a focus only on legal outcomes and reasoning.
We begin with the AI & Law literature and behavioral equivalence.
As AI further permeates modern life, the third decade of the twenty-first century promises exciting changes, including to the judicial system. AI researchers have long studied computational modeling of legal reasoning, and legal scholars have begun contemplating which parts of the legal process might be replaced or improved by computer systems. This research largely focuses on recreating the outcomes and reasoning techniques of current legal processes, with the (at times, implicit) assumption that an automated system that works the same way as the current system could replace it.
But what does it mean for two systems to work the same way? Computer scientists asking this question examine the systems’ performance through the lens of behavioral equivalence. Behavioral equivalence allows one to turn away from the philosophical question of what something is and towards the grounded question of what it does. Two systems display equivalent behavior (meaning they are behaviorally equivalent) when they yield the same output given the same inputs. A behavioral equivalence analysis involves determining what counts as the outputs of those systems, and observing them. Insights from behavioral equivalence are not limited to computer science, and behavioral equivalence therefore provides a method by which to examine the effects of replacing legal subsystems—by computer algorithms, or new analog systems—and to reason about the tradeoffs created by making those replacements.
This Part begins by introducing scholarship in AI and law that contemplates replacing some aspect of the legal system with a computer. It then introduces behavioral equivalence, explains why equivalence is always relative to some observer, and explores the implications of the observer. Throughout, it illustrates how to apply these concepts to legal systems.
For over thirty years, computer scientists in the AI & Law research community have created computational models of legal reasoning. While these researchers have never argued that their systems ought to eventually replace human decision-makers, the existence of these models and other AI advances invite the public and members of the legal academy to imagine that they might one day be used in legal decision-making. Indeed, computer systems have recently become sufficiently advanced that legal scholars focused on AI have begun to imagine what sorts of legal reasoning are ripe for being replaced by AI. Work from both research communities reveals a common assumption that the decisions of a legal system, and the reasoning with which it arrives at those decisions, lie at the core of faithful models of the judicial system.
A survey of landmark research in AI & Law—including work on precedential reasoning by analogy, with rules, and using machine learning, and on models of legal argumentation—shows that AI & Law research generally focuses on modeling the reasoning in or outcomes of legal cases. Understanding what these systems cover is necessary to understand the blind spots in the literature and their implications for judicial automation.
The HYPO family of algorithms, developed by Professor Ashley and colleagues, uses a library of resolved cases to reason about new cases. These algorithms are tailored to particular domains and represent cases as collections of legal factors—legally salient concepts identified by researchers—along with the outcome of the case (for the cases in the library). A HYPO-style algorithm first retrieves the most similar case—the one sharing the most factors with the new case—from the library and proposes its outcome for the new case. It then responds, by retrieving the most similar case with the opposite outcome and proposing that the differences in factors across the retrieved cases are salient. Finally, it responds to the counterargument. HYPO algorithms thus model at least three components of legal reasoning: reasoning from case precedents, legal argumentation, and deriving verdicts.
Others have used AI techniques to predict case outcomes from precedents using logical rules extracted from cases (sometimes also relying on factors). Professor Horty developed a model that weighs rules extracted from precedents to determine which rules should apply to a new case. This model captures not only what rules apply but why, to distinguish otherwise-applicable precedents. Professor Verheij uses case models: sets of cases in formal logic that are collectively logically consistent, different, and mutually compatible, and which together encode logical rules governing a body of law. These cases can be applied as rules or by analogy. Like HYPO, these approaches model both the verdict in a case and its justification.
AI & Law research has not only focused on modeling judges’ reasoning and decision-making. For example, researchers have studied how to build coherent stories that connect case facts to the rules that resolve it. Instead of predicting outcomes, such approaches explain how outcomes are derived. Others have studied how to evaluate formal argument structures, in case-based reasoning systems and using logic. The focus is again on modeling reasoning, but of litigants, not judges. Research on argumentation examines not only the structure and form of arguments, but how to manage and resolve them. For example, Professor Prakken has described a formal model that captures litigants’ discourse, along with an adjudicator who manages the dialogue and tracks whether burdens are shifted or have been met. This model can capture the reasoning of parties and judges, but it also models the process itself.
Finally, several researchers have used Neural Networks to predict case outcomes from case facts. One such approach also identified intermediate factors that contributed to the outcome. Although this system generated both predictions and the facts underlying them, its reasoning is unlike that which humans use to solve similar cases; instead, its data-driven approach leverages human annotation and computer-derived measures of similarity.
Though researchers sometimes suggest their systems could help lawyers, pro se litigants, and adjudicators make and analyze arguments, none claim their systems can or should replace human decision-makers, or that their models capture all of the information within the legal system. A recurring theme in the literature is that modeling work can help illustrate and formalize how legal reasoning works, to help the legal world understand itself. But these and other advances in AI naturally invite legal scholars and the public to consider how AI systems might be given a role in legal decision-making. In fact, legal scholars have begun thinking along these lines; we now turn to them.
As with researchers in AI & Law, legal scholars who consider what parts of the legal system could be automated and what it would mean to do so tend to focus on automating the system’s internal processes. They also look to those same internal processes to discover the consequences of that automation, a view which this Article argues will fail to detect many such consequences. We examine several proposals in turn.
Several legal scholars have imagined that computer systems could independently make factual determinations. Professor Gowder argues that a machine learning system could eventually be used to determine whether, for example, something is a vehicle for the purposes of a “no vehicle in the park” rule. Professor Gowder argues that such a system could never entirely replace human judges because human judgment is necessary both to fill in gaps in—or to change—the law, and for the requirements of procedural justice. Professor Livermore argues that, far from leaving gaps for a human to fill, a Deep Learning system that classifies potential vehicles in parks might solve legal indeterminacy by allowing policymakers to write new kinds of laws, such that something is a vehicle under the law exactly if the algorithm says it is. Professor Livermore argues that such a system would eliminate the need for human participation in resolving certain kinds of disputes. And Professor Genesereth suggests that the technology in our pocket could collect and use data to inform us that we are violating the law (for example, our cell phones could tell us if we are speeding)—a use he characterizes as taking the role of a friendly police officer advising us on the law. While Professor Genesereth imagines a benign, helpful legal advisor, the same technology could be used as an unfriendly police officer that writes users a ticket.
Legal AI systems could do more than make factual determinations. Professor McGinnis and Steven Wasick argue for dynamic rules, rules and standards that change depending on empirical data, without human intervention. Professor Coglianese and David Lehr argue that administrative agencies could use Machine Learning like any other tool, not only to adjudicate disputes but in crafting regulations. And Professor Volokh has argued that an AI which passes the Turing test, such that it could write a persuasive judicial opinion indistinguishable from that of a judge, should be allowed to be a judge in actual cases.
These pieces paint a picture of an emerging focus in the legal academy regarding what legal processes could be automated: factual determinations, crafting rules, determining whether and how some rule applies, and composing judicial opinions. As with the research in AI & Law, the focus is largely on what occurs in the courtroom: the interior systems of the legal process. That list surely covers a great deal of the legal process, and perhaps an automated system that perfectly models these elements could replace the humans who currently do so, with minimal effect upon the overall system.
This Article argues that an exclusive focus on reproducing reasoning or outcomes leaves out important considerations. By attending only to perspectives internal to the judicial system—how judges and litigants reason and argue—it leaves out those external to the operations of that system, that is, the perspectives of interests who are not involved in court cases. Thus, a system that automates legal decision-making “works the same way” as the current judicial system only to the extent that legal decision-making is the only part of the judicial process. And as this Article will show, it is impossible to build a system that will “work the same way” as the current one in the eyes of everyone who cares about how the legal system works. To see why, we must examine what it means for two things to work the same way. Behavioral equivalence is a natural lens through which to examine this question because it is the tool with which computer scientists assess the consequences of replacing one program with another, and this Article contemplates replacing legal processes with computerized ones.
Before discussing behavioral equivalence and the law, note two basic assumptions this Article starts from to limit the scope of its analysis. This Article explores what it means to substitute components of the legal system with computational systems that mimic their operation, and how to examine the consequences of doing so, but is not concerned with how to accomplish that substitution. It therefore does not address two questions crucial to doing so: “Is this actually technologically possible?” and “Would this be legal?”
The first assumption holds that it will one day be possible to perfectly mimic any given legal process’s computation of legal conclusions as an input-output system. That is, given some input to a legal subprocess, this Article assumes technology will one day exist that computes the same output as does the current system. For example, given some system that rules on whether evidence is admissible, this assumed system would generate the same answer and explanation as the judge who would otherwise rule on it. Whether this is actually true for any given legal process is a separate question orthogonal to this Article. This assumption is revisited in Part III.
Second, this Article assumes such a system could be legitimately authorized. The issues of authorization and democratic delegation of authority implicated by this discussion are not so obviously insurmountable that it is useless to even consider what it means to replace current systems with digital substitutes. This question of how an automated system could properly be authorized is a serious one to be dealt with more directly in future work.
We begin with these assumptions so as to be able to describe and discuss behavioral equivalence in the ideal case.
Court systems are not about to be replaced by computers. But the judicial system is not a single process. It is a set of interconnected processes that interact with and hand off to each other, and computers have already been used to replace some subprocesses. For example, automated recidivism predictors are used to predict whether criminal defendants should receive bail, supervised release, or a prison sentence—determinations that used to be made by humans. And as discussed above, scholars in both law and AI have described and developed systems that might one day replace aspects of human judicial decision-making. Because this Article examines the consequences of replacing elements of the judicial system with algorithms, it is natural to use tools from computer science that evaluate such consequences.
Behavioral equivalence is a concept developed in engineering research that has been especially studied within the Computer Science literature concerning programming languages (PL). (PL as a field refers to the study of principles underlying the programming languages with which users can write programs; the field, which is capitalized and abbreviated in this Article, is distinct from its objects of study.) Two systems are behaviorally equivalent if they behave in the same way, but what does it mean to behave the same way? To avoid the sometimes-tricky question of precisely defining behavior, PL researchers ask whether two systems are observed to behave the same way. This subtle change in formulation avoids defining behavior and instead asks whether two processes are indistinguishable, which focuses on the perception of that behavior by the system within which a process is embedded.
To illustrate the concept of behavioral equivalence, imagine two programs that perform addition and are given the equation “47 + 85.” The first one computes the way children are taught to in school, adding digits from right to left using memorized single-digit sums and carrying the one as needed. It first adds “7+5” (which it knows is “12”), and records a “2” in the rightmost position. It then adds “4 + 8 + 1” and records the result. The second program instead increases the left number by “1” and correspondingly decreases the right until the latter is “0,” at which point it returns the left number: “47 + 85” becomes “48 + 84,” then “49 + 83,” and so on until it has “132 + 0.” Both programs return the same answer—the first in three operations and the second in eighty-five. To a user who observes only the programs’ outputs, they are behaviorally equivalent. But crucially, these programs are not equivalent to the programmer who wonders why her computer runs more slowly when using the second program and who checks the programs’ memory usage. Thus when one asks whether a component of some larger system is behaviorally equivalent to a potential replacement, one is asking whether the old component can be replaced with the new without the relevant observer being able to tell the difference after the switch. Whether a process is equivalent therefore depends on the observer: to a novice user the two addition algorithms are behaviorally equivalent, but not to the expert attuned to her computer’s performance.
Behavioral equivalence provides a way to focus on what matters while ignoring what does not: it asks what something does, not what it is. Behavioral equivalence helps to abstract information, to strip away irrelevant details. For example, filing a lawsuit on the morning of the last day before the statute of limitations runs is behaviorally equivalent to filing it that afternoon. Morning and afternoon are different times, but for the purposes of filings, they are the same: What matters is not the filing time, but the date. Or consider the legal subprocess to determine whether evidence is hearsay. One can imagine a variety of alternative ways of making that determination, and any such replacement could be evaluated for its behavioral equivalence to the current system.
Currently, judges decide the admissibility of evidence. Judges can offload this work to a clerk, checking the clerk’s work and following the clerk’s reasoning, rather than completing the work themselves. The clerk working through the admissibility analysis and the judge thoroughly checking it is equivalent to the judge doing the work herself, assuming she would have come to the same conclusion. Alternatively, the judge might simply rubber-stamp the clerk’s work without examining the analysis, only glancing over it to ensure the order is complete and to see the bottom line she will declare. Critics might reasonably have reservations about endorsing such a practice, but the ruling would be as binding as if the judge had done the analysis herself.
We can also imagine a computer algorithm that does the admissibility analysis and outputs both a decision and some minimum explanation of it. The rules of evidence might be encoded as logical rules, information about the evidence translated into a logical format, and the information fed to the rule-based system. Or a dataset of real-world evidentiary rulings could be used to train a machine learning system to make those decisions. Or the rules of evidence could be used to generate a dataset of fake but realistic evidentiary rulings to train such a system. As in the current regime, these systems would take in information and output an admissibility decision.
What would have to be true to conclude that one of those theorized computer systems is behaviorally equivalent to the judge or her clerk doing the analysis (assuming the judge signs off on the system’s output, as with the clerk’s)? One could examine whether the system perfectly mimics an individual judge: for admissibility questions where the judge’s ruling is known, does the system output the same ruling? More importantly, for new admissibility questions, can any decision be attributed to the system or to the judge? Because individual judges may display some variability in their decision-making, an analyst might instead gather several decisions from both the judge and the computer to see whether they could match decisions to their author. Judgments vary not only within judges but across them, so yet another possibility is to ask whether the system’s judgments are consistent with a range of judges’ judgments. Thus the analyst could poll five judges plus the computer about whether evidence should be admitted, and see whether they could pick out the computer’s judgments.
While this Article contemplates replacing components of the legal system with computer systems, behavioral equivalence also helps to analyze the effects of swapping those components with any replacements, not just automated ones. Clerks’ duties provide a helpful illustration. For example, the cert pool system at the Supreme Court, where clerks prepare memoranda regarding certiorari petitions for many Justices at once, replaced a regime wherein each Justice’s clerks prepared memoranda for every certiorari petition, allowing clerks more time to attend to other duties. Every petition still gets a memo and every Justice still receives one, but the memoranda are no longer tailored to individual Justices. Though it frees up clerks’ time, the cert pool system has been credited for reducing the Court’s case load. If the observer evaluating behavioral equivalence only checks whether Justices get memos, the cert pool system might be equivalent to the old regime; if it observes secondary effects, it might distinguish them. Similarly, before judges regularly had clerks they drafted all their own opinions, but the practice of having clerks draft opinions for judges is now widely accepted.
If a clerk’s work is treated as behaviorally equivalent to a prior practice, it is because the work is seen as in fact coming from the judge. Indeed, the judge is supposed to check the clerk’s work sufficiently closely that it truly becomes the judge’s work. But would anyone be able to know whether a judge is checking the clerk’s work, or is rubber-stamping it without even reading it? If a judge goes into her chambers alone with a clerk’s draft opinion and emerges with the opinion entirely unchanged, is there a difference between the judge who read the opinion carefully and was so impressed that she didn’t change a word, and one who played solitaire and did not glance at the opinion? What if the clerk was particularly competent, and was so good at predicting a judge’s reasoning and style that over time the judge has come to find that she need not edit the clerk’s work, because any change would make the opinion less like that the judge would have written herself? Every word is written by the clerk; the judge reads and agrees with them, changes not a one, and affixes her signature. It seems there is no meaningful difference between that situation and the one where the judge fixes errors in the clerk’s draft. But what if over time, the judge comes to conclude that there is no point to her reading the opinions, because she knows she will not want to change a word? Why should anyone care if the judge reads it, if the reading is simply a side effect that does not affect the decision-writing process?
The question is facetious—of course people would care if they learned a judge did not bother even to review opinions written by her clerk. They might also conclude the opinion lacked legitimacy since the clerk was not Senate-confirmed with Article III protections of salary and tenure (or appointed as a judge under state law). But the issue is not whether this would happen, or what the consequences would be. The question that nags is: how would anyone know?
This returns us to the key insight that behavioral equivalence is only definable relative to some observer. The judge will know what she did but no one else will, not even the clerk, if the judge assures the clerk he simply did a perfect job drafting the opinion. The case will proceed as though the judge herself wrote the opinion, because no relevant observers will be able to tell the difference between the legitimate process and the illegitimate one. The illegitimate process is behaviorally equivalent to the legitimate one because two processes are behaviorally equivalent exactly if the observer cannot distinguish between them. The processes having different internal mechanisms or side effects does not defeat behavioral equivalence if the observer is blind to those factors. Furthermore, the observer must be defined within the outermost bounds of the larger system: if some observer outside the system can distinguish between the two processes but cannot transmit that information inside the system, then those two processes are still behaviorally equivalent for the purposes of that system. The judge may have a wise pet cockatiel in her chambers that knows if the judge is working or not, but because the cockatiel has no way of communicating that information, it does not defeat behavioral equivalence.
In summary, asking “Do these two systems behave the same way?” can involve an unwieldy inquiry because it assumes a clear understanding of what constitutes behavior. Behavioral equivalence reframes the question as “Can these two systems be distinguished?”, which emphasizes the observer of the system rather than the system itself. And this focus on the observer yields further insights about evaluating whether two systems are equivalent.
That behavioral equivalence is only ever relative to an observer carries three implications: to the omniscient observer a process has no equivalent; two systems need not be strictly identical to be behaviorally equivalent to a non-omniscient observer; and a clever malefactor who knows how a system works can manipulate it undetected by the observer. We examine each proposition in turn.
An observer who can observe everything about some process—an omniscient observer—defeats behavioral equivalence, such that a process is only equivalent to itself. In our earlier arithmetic example, an observer who witnessed the steps the two addition algorithms went through would be able to distinguish them by their internal states. Similarly, an observer who peered into the judge’s chambers could distinguish between the judge reading the draft opinion and playing solitaire. A judge who knew she was being watched could hold up the draft opinion and pass her eyes over it to fool that witness into thinking she is working, but an omniscient observer—one who observes truly everything about a process—could read the firings of her synapses and distinguish her pretense from actual reading.
Just as a single omniscient observer defeats behavioral equivalence, so too does the union of all possible observers. Our discussion of behavioral equivalence focuses on the observer because it directs our attention to that which is being observed. If every part of some system’s behavior is observable, then all possible observers collectively observe everything, just as a single omniscient observer does. For example, a neuroscientist could strap the judge into a brain-scanner and, from the next room, determine if—though not what—the judge is reading. Sitting inside the room is the clerk, who can see what paper the judge’s eyes are focusing on. Neither alone can tell if the judge is reading the clerk’s draft opinion, but their combined observations can. Together, they are equivalent to the omniscient observer who sees everything about some process.
If the observer is not omniscient then behavioral equivalence demands less than strict identicality, because two systems that differ only in ways the observer cannot detect are behaviorally equivalent. This implies that if any part of the legal system is nondeterministic, an algorithm could still be behaviorally equivalent to it so long as the observer sees them as operating the same way. For nondeterministic systems, behavioral equivalence does not demand the same outputs given the same inputs. Instead, it requires that both systems either compute an output by using the same rules to pass through the same set of intermediate states (though not necessarily the same actual states), or that the outputs of both systems fall along the same probability distribution. The legal system already tolerates the same legal processes leading to different outcomes, for example, when a judge makes increasingly harsh parole decisions as lunch approaches, then becomes more lenient again after eating. It also already tolerates observable differences across sets of outcomes, for example, when different judges at the same court have observably different patterns of rulings or are more or less favorable to plaintiffs. Different rulings within and across judges at the same court are still treated as equivalent within the legal system.
Thus, for those parts of the legal system that are nondeterministic, it is already acceptable for an observer not to be able to differentiate between different outputs from the same input. When this nondeterminism occurs inside the brains of humans, it occurs out of sight of the legal system and any observers of it. When it is visible (e.g., the assignment of judges to cases), different instantiations of a process are still meaningfully considered to be the same process. A case proceeding through a district court takes but one of many possible branching paths as variables in the court case are resolved: the case will get only one of several possible judges; only a dozen of many possible jurors; and only one of a variety of outcomes. If it travelled a different path nearly everything about how the case proceeded might be different. Nonetheless, no matter what path the case travelled, it has received the same legal process: any of these paths that do not involve a legal error are treated as equivalent to each other. This is true even when litigants try to control or account for the variables, for example through forum shopping, calibrating arguments to judges, and other tailoring of cases based on the variables in court. When litigants affect the process, they simply become additional sources of nondeterminism and complexity within that process. Unless the case is resolved in a way that is traceable to the peculiar idiosyncrasies of a judge (a judicial signature, of sorts), an observer will be unable to tell from the case’s outcome what process the litigants received.
Assessing behavioral equivalence in a nondeterministic system means asking whether an observer would see the system’s outputs as equally acceptable or probable under the old regime, not whether they are the same. The bar to achieve behavioral equivalence might thus be lower than one might assume, since it does not require perfect mimicry. Regardless, this Article assumes that even nondeterministic legal processes can be modeled, and the equivalence of such systems assessed.
One downside of behavioral equivalence being relative to an observer is that a bad actor who understands the internals of how a system works might take advantage of that knowledge and manipulate the system, out of sight of and undetected by the observer. For example, imagine a computer program that stores users’ private information (say, social security numbers) and only reveals such information when given some password. Unbeknownst to anyone but the programmer, the program stores this information on a computer’s hard drive such that one user’s ID is stored in memory location 1, then that user’s SSN is in memory location 2, then the next user’s ID is in memory location 3 and their SSN in location 4, etc. This is generally a perfectly secure way to store this information. But if some malicious actor knows the program is written this way, he will know that if some user’s ID is stored at memory location N, he can find the corresponding SSN by reading the contents of memory location N+1, without going through the secure program and providing a password. Observers may also have blind spots that should be observed and can be exploited. Exploiting unmonitored blind spots is like a bank robber drilling directly into a bank vault from the storefront next door, bypassing the security measures in the bank itself. And though a banker is likely to eventually notice an empty vault, other back-door exploits (like information security leaks) might go entirely unobserved and never be detected.
Thus, if an algorithmic replacement for some aspect of the legal system cannot be distinguished from the current system by the relevant observer, that replacement might still be exploited by someone who knows its internal operations, like a defendant blackmailing a judge’s trusted clerk to persuade the judge a case should be dismissed. Some such bad actor might get the system to consistently rule in her favor.
A related issue could arise if the new system was computationally faithful to the old system but was more transparent about its operations. This might lead savvy participants to provide different inputs to make their preferred outcomes more likely than they previously could have. For example, a judge might find an expert witness exceedingly persuasive and always rule for the side that engages her, but if the judge publicly stated “I’ll never rule against this persuasive witness,” litigants might start hiring her in every possible circumstance. In such a case, a change alters the overall system not by changing its operation but by changing what inputs are given to it.
Calibrating inputs to the legal system to get one’s desired output already happens, of course, and does not stop at jurisdiction shopping. Litigants may tailor their arguments to judges based on what they think those judges will find persuasive. A defense strategy at trial or whether a criminal defendant takes a plea deal can depend on which judge was assigned to the case. And sadly, there is evidence to support the perception that litigants receive the justice they pay for. The issue is not that introducing a behaviorally equivalent algorithm to replace some part of the legal system will allow people to manipulate that system to their advantage; that is already happening. The issue is that the observer’s blind spots may introduce new ways to manipulate the system. Some blind spots overlook backdoor exploits; others overlook whether the system transmits information that can be used to carefully calibrate inputs to the system that manipulate it towards a desired outcome.
Having explained the basic concept of behavioral equivalence and the effect of it always being evaluated relative to some observer, this Article now returns to the consequences of replacing aspects of the judicial process with computer systems, by asking what the observer evaluating behavioral equivalence in the legal domain should be, and what it will observe.
Part I explained how to consider replacing part of a legal process with a different process, and that when two processes are behaviorally equivalent, it means some observer cannot distinguish between them. This Part explores who the relevant observer of the legal system might be, and what they might observe. Attending to observers is critical not only when one wants to ensure a system’s replacement leaves its overall function unchanged, but when one seeks to change how a system works, because the observer and its observations define where changes will be detected. That means the observer defines what counts as a change: a change the observer cannot detect is perceived as no change having been made at all. Carefully specifying the observer is therefore critical when considering altering the legal system, whether to preserve its functionality or alter it, because observers define which changes are perceived and which are ignored.
Several observers are considered, beginning with the judicial system itself, then moving to other observers in society at large. Each has its advantages and disadvantages and corresponds to different benchmarks for how faithful a computational replacement to the current legal system ought to be. Using the judicial system as the observer might make behavioral equivalence possible, but at the cost of threatening the system’s legitimacy by ignoring the ripple effects from changing the legal system that are felt outside the courthouse walls.
Research in AI & Law has focused on modeling reasoning, argumentation, and decision-making in the legal system; research in Law & AI has imagined automating fact-finding and other legal reasoning and analysis, and establishing computer-augmented legal standards. When scholars describe systems as reproducing such behaviors, behavioral equivalence instructs that they must mean displaying the same behavior in the eyes of some observer. So what observer sees the outputs of such systems and can evaluate any changes for behavioral equivalence? That is, is there an observer that observes enough about the legal system to determine whether systems designed to faithfully replicate current behavior do so, while others designed to effect change lead to the desired results?
The judicial system itself, including the judiciary and its processes, might serve as such an observer. The judicial system is the natural observer for the changes proposed by scholars in Section I.A because those changes concern processes interior to the judicial system itself: the rules applied in cases, the arguments made therein, the reasoning used to resolve them and the final decisions governing them. Because these subprocesses are all internal to the legal process, the larger judicial system is not only well-positioned to observe and regulate them, but already has mechanisms to do so, most importantly the appeals process. These could be used to monitor changes made by automation. Using the judicial system as the observer would also require potential replacement components to be relatively less faithful to current components. This Section describes what the judicial system’s observations would entail, addressing first the question of how it would make observations to determine behavioral equivalence, then what it would observe.
An observer of a system captures information about that system, information that is potentially transmitted back into the system to further affect its processing. The judicial system processes and disposes of cases, and those processes, outcomes, and their justifications are among its outputs. These are further processed by higher courts through appeals, whose decisions are in turn fed back into the judicial system. The appeals process is therefore already a mechanism by which the judicial system observes its own outputs. The judicial system as the observer could therefore assess whether two processes are equivalent by asking whether a difference between them could be the basis of an appeal (and potentially a reversal). Under this view, if two processes are identical but for a single difference, and the parties could not have appealed the case under the old system had that difference occurred, then the judicial system treats them as equivalent. Thus a replacement system is behaviorally equivalent to that which it replaces if its operation and output would not be reversed on appeal if they occurred under the old system. Note that the question is not whether the outcome under the new system would be reversed (although that is salient), but whether anything that happened (or failed to happen) differently in the new process would lead to reversal had it occurred under the old one.
This formulation of the observer and its observations is appealing because it corresponds to how the judicial system currently manages itself. For example, a party cannot appeal a case simply because he was unhappy with which judge was randomly assigned to hear it. Similarly, judges are given discretion over many aspects of managing a trial, including ruling on admissibility of evidence. Parties may have difficulty appealing a case over a ruling on admissibility if the judge has carefully explained the reasoning underlying her exercise of discretion, even if the case turns on the presence or absence of that evidence or the reviewing judges might have ruled differently. In such cases, the system has determined that the parties have received a legally indistinguishable process from the one where the admissibility decision went the other way, even if the outcomes would be enormously different.
Under this formulation, the judicial system already knows how to evaluate behavioral equivalence in the legal process and has been doing so since before the founding of the country; it would just apply that evaluation process to new systems and ask whether cases under the new system would have been appealable under the old system. Furthermore, under this definition many things are already known to be behaviorally equivalent because they are currently not subject to appeal, while others are known not to be equivalent. For example, any judgments that require explanations could not be made by a system that does not generate any; nor could a system exercise discretion that judges lack. That said, the insight that current legal doctrine already defines certain equivalence relationships previews a criticism of this formulation, that the current legal system’s doctrine governing appeals ignores too much. For example, while statistics about general trends may be used to draw inferences about specific cases in some instances, this is not true generally; an automated system that significantly changed a pattern of decisions across cases but whose individual decisions were always acceptable might therefore be seen as behaviorally equivalent to the current system.
Nonetheless, using the formulation of non-appealability as equivalence, we can imagine some potentially valid replacement systems from the constraints placed on current components of the legal system. For example, jurors may already be struck from a jury pool for a variety of reasons, so a computer system that helped determine which jurors to strike (for reasons other than those forbidden by law) would not be the basis of an appeal and therefore would be behaviorally equivalent to the lawyer doing it herself, even if juries ended up with noticeably different compositions compared to now. Lawyers who meet certain criteria are treated as legally interchangeable subject to certain constraints; a party represented by a computer system that met these criteria and was approved to practice law could presumably not appeal based on its competent representation—if they could, one would ask why the computer had been approved to practice law. Computers already help judges make bail and sentencing determinations. A judge could similarly get assistance from a computer to help determine how to rule on a range of motions, and those rulings could be upheld as not being an abuse of discretion, even if they differ from how the judge would have ruled without the computer’s input.
Thus far our discussion has focused on what happens within a trial, but the idea of the judicial system as the relevant observer determining behavioral equivalence could apply beyond the walls of the courtroom. In fact, the judicial system already routinely uses this concept without framing it as such. This is a sort of de jure equivalence: binding arbitration is legally behaviorally equivalent to litigation, even though the practice is controversial and its outcomes are not equivalent to those of court cases. Similarly, when states conform to federal standards to receive federal funds, the determination that the standards are met amount to a declaration that the states’ programs are equivalent for the purposes of federal law, no matter their differences. International treaty law performs a similar role in determining compliance with the rules by which nations agree to be governed. What these observers (and the appeals system) have in common is that they are top-down, with the observers clearly defining what they will be observing. While our discussion has focused on who the observer is, what is being observed crucially defines what will be seen as behaviorally equivalent. We now turn to these observations.
Recall that, under the formulation of the judicial system as observer, a replacement to some component of the legal process is behaviorally equivalent to it if the replacement does not lead to reversible error. Therefore an effective observer within the judicial system must monitor every source of such errors. One obvious place to start is on the outcomes of cases based on their facts: if a party is ruled against but there is not evidence sufficient to support a finding of their liability, the case will be reversed. This narrow frame treats the judicial system as a pure input-output system, a black box. But this formulation is insufficient: even determining that the facts could not support the judgment requires peering inside the system to examine the rule or standard used to reach that judgment. The observer must therefore also have access to the rules and reasoning used to derive outcomes from inputs.
But even the inputs, outputs, and rules are not enough, because the intermediate states those inputs go through, and the process by which they are transformed, can also be the subject of an appeal. The observer must track not only the rules themselves, but how they are applied. Our judicial system is not one large system that cranks over facts and spits out a decision along with its justification, but many subsystems yoked together, and the process by which they transform and operate over things from one step to the next matters as well as what is transmitted across those steps. When a criminal defendant is denied the opportunity to object to evidence as being obtained through an illegal search, the denial of due process (the right to object) might lead to an appeal even if the evidence is otherwise admissible. The judicial system observes not only the decision to admit the evidence, but the process by which that decision was made. Thus when the judicial system observes itself, it is not enough to know what conclusion some system has come to and why, but how. Because the judicial system protects due process, the process by which rules are applied, and not only the rules and decisions themselves, must be considered an observable output of the system that factors into any determination of behavioral equivalence.
The inputs to a legal process include not only filings and evidence but the participation of the parties, and due process requirements govern what is owed to those participants. An observer evaluating behavioral equivalence in the judicial system must therefore attend not only to inputs, outputs, rules, and process, but participation too. For example, if the government were to use a new automated system to determine when to take away something in which a person has an interest, that system must give the person the opportunity to respond (and must attend to the response). Nor would an automated decision-maker be allowed to deny a criminal defendant the right to an attorney. These violations constitute reversible error in our current system; if “cannot constitute reversible error” is the standard for behavioral equivalence, the observer must include due process amongst its observations.
Due process moves our discussion beyond questions of reversibility and to core issues regarding the role of a legal system in society and how its citizens accept it. Due process implicates procedural justice; the justice inherent in the system itself rather than only its outcomes. Due process does not act only to guarantee the correct outcome—indeed, it sometimes requires letting a person the court knows is guilty to walk free—but also to confer legitimacy upon and public trust in the proceedings. In fact, legal systems may be more capable of ensuring equitable and fair dispute resolution than determining with certainty the facts in dispute. This evenhanded and fair treatment is crucial to the legal system’s legitimacy.
The problem with treating the judicial system’s appeals process as the relevant observer is that sometimes process violations are deemed “harmless errors” and do not lead to reversal. Our formulation of that observer would therefore sometimes treat a proceeding with a process violation as behaviorally equivalent to one without. This observer might thus confer the blessing of behavioral equivalence upon a system that, through compounding harmless errors, lacks legitimacy in the eyes of the people whose behavior it regulates.
Harmless error occurs when a reviewing court finds that a lower court made an error, but the error did not affect the outcome of the case or overly offend principles of justice and therefore does not require reversal. For example, when a court instructs the jury to apply the wrong standard to determine guilt, the error is reversible if the jury convicts using too low a standard, but not if the jury convicts using too high a standard. Under the reversibility formulation of behavioral equivalence, that rule means that applying too high a standard of guilt when finding a defendant guilty (but not when acquitting them) is exactly the same as applying the right standard.
The harmless error rule was developed to promote judicial efficiency by only reversing a case for an error that has a substantial effect on the rights of the parties. The Supreme Court has furthermore said that “the central purpose of a criminal trial is to decide the factual question of the defendant’s guilt or innocence,” meaning that when a trial record establishes guilt, errors should be considered harmless notwithstanding a process error that may have even included a violation of a constitutional right. But because harmless errors do not lead to reversal, the judicial system would treat a replacement system that increased the incidence of such errors as behaviorally equivalent to one that eliminated them.
It is a sad observation that a system which routinely applies the wrong rules and standards against overwhelmingly evidently guilty criminal defendants would be legally indistinguishable from the legal system we have now. It also suggests that the judicial system’s appeals process is the wrong observer for evaluating behavioral equivalence, because it would treat too many things as equivalent to each other when they patently are not. Appeals might not be the only way the judicial system could observe its own operations: even though harmless error is not reversible, for example, a court can take note of whether an automated system led to more such errors. But the judicial system’s current tolerance for significant differences across processes and outcomes suggests that courts may be too insensitive to such differences to be the only observers of whether a changed legal system is faithful to the original. Litigants whose rights may be burdened by such changes (and their lawyers) have a great interest in detecting whether changing a legal process changes its outcomes. And why stop at litigants and litigators? The legal system is embedded within a larger society, and broadening our analysis to include other interests within that society reveals that many changes might affect those interests in ways the judiciary might not observe. We therefore must consider these other societal interests as potential observers of whether replacements to the legal system are behaviorally equivalent to that which they replace.
The judiciary’s appeals system is the wrong observer, indifferent as it is to certain errors and extra-judicial interests. Modern society furnishes many possible alternate candidates interested in the operation of the judicial system; one or more of these, or the union of all of them, might be the relevant observer. These interests may care about those parts of the legal process that the appeals system treats as irrelevant to disposing of a case, and would not describe a system that did not reproduce them as equivalent to the current one. Furthermore, focusing on these external observers can draw attention to the unintended, downstream consequences of making changes to the system.
If there are many such observers, why only consider the perspective of one? The proper observer may be the union of all the possible observers in society interested in the operation of the legal system. This mega-observer—what this Article calls society at large—assesses behavioral equivalence by asking whether any societal constituent observes the part of the system being changed, can differentiate the new version, and cares about the difference. Society’s observations must be attended to because the legal system acts on society’s authorization. It is literally authorized through the adoption of the Constitution that vests the judicial power in the judiciary and through the elected Congress which created and populated lower courts that exercise that power (and the corresponding authorization of State courts). But the legal system is also meaningfully authorized by society because society’s institutions exist to ensure justice. If that authorized system is significantly modified in a way that is felt by the body politic and that alters it from that which was authorized, then the new system will lack legitimacy.
Unfortunately, as this Section will demonstrate, no aspects of the legal system can be assumed to be unobserved by society at large. Because the legal process is fully embedded within the society in which it operates, a replacement system that generates the same explanations and intermediate representations as the current system in a way that satisfies litigants will still be perceived by—and may affect the interests of—other observers outside the legal system. Under this view, the legal system is like a kidney, cleansing for the body that which passes through it. Looking only at the system itself and its effects on those who pass through it is like taking the perspective of the blood being cleaned. To the blood, a dialysis machine may be equivalent to a kidney. Not so to the person attached to the machine.
Examining the interests of stakeholders in society at large draws focus back to procedural justice, the justice inherent in a system, rather than substantive justice, the just outcomes that system produces. The interests in society-at-large may care less about the final allocation of rights and duties than about how the allocation happens. It is especially important that procedures be fair because though the procedures of the American legal system are designed to arrive at the right result, they may not do so perfectly (and indeed, convictions get overturned). A computer is unlikely to always arrive at the right result through perfect procedures, because it would need perfect knowledge to guarantee those results and never make errors the current legal system makes. Even were such a system possible, the populace submitted to it—to whom the subjective experience of fair treatment by the legal system matters, regardless of their cases’ outcomes—might not perceive it as just.
There are many observers beyond the judicial system that might care about the effects of altering that system. Unless there is a reason to ignore one, all these interests should be attended to when considering such changes. We examine several such interests, and the implications of considering the observations of society at large, or the union of all these interests together.
Litigants’ observations are closely related to those of the judicial system; they overlap but are not equivalent to them. Litigants care not only about the outcome of the case, but about how they can argue it, and their interests go beyond what might lead to reversible error. For example, the legal system is largely unconcerned with who the attorney representing the party is, but the litigant will care if his attorney is rude or unresponsive in addition to her performance as a litigator. Indigent criminal defendants are not always granted their requests for new representation, but would surely see a system that always denied those requests as meaningfully different from the current one that only sometimes denies them. Litigants care about being heard and allowed to present their case, not just the verdict, even when no due process interests are implicated. Similarly, judges care not only about the outcome of a case and the rules with which the outcome is derived, but also whether those rules are just. Judges might therefore be less satisfied with a rule that says “X therefore Y” as with one that says “because of interests a, b, and c, X therefore Y,” even though these rules are logically equivalent. Lower-court judges who criticize in their opinions the rule deciding a case prove that judges can apply a rule with which they disagree, but clearly care about the rules they apply.
Society extends beyond the courthouse walls and includes many more people and groups who will never be involved in a lawsuit than those who are, but who may still care about the process and procedures of justice. These include, at a minimum, members of the general public, the media, governmental institutions, corporations, interest groups, and other institutions. Exhaustively describing all such interests is beyond this piece, but even the partial listing that follows illustrates the extent to which society’s observations pierce and peer into the legal system.
Members of the general public may care about the specific outcomes of cases, just as litigants do, and about the rules applied, just as judges do; therefore there is a societal interest in keeping court proceedings public. People also care about justice being applied fairly even when they have no interest in a specific case. People might also be interested in a case for the facts made public in court proceedings. What happens in courtrooms also has secondary effects on society about which the public may care. Because the public does not want the police to torture criminal suspects, it wants to ensure confessions obtained in violation of Miranda v. Arizona are excluded from criminal cases. Members of the public who value not imprisoning defendants for years at a time before and during their trials will be interested to learn about court systems’ creative accounting of time to handle backlogs of criminal cases. Like judges, members of the public care not only about the rules applied in a trial but also about the justification for those rules and the social policies those justifications support. For example, an anti-abortion small-government conservative may feel better about a ruling striking down abortion restrictions that is grounded in preventing government overreach rather than as respecting a pregnant person’s right to make choices about their own body; a pro-choice progressive who believes the government should provide citizens’ healthcare and that abortion rights protect personal autonomy might feel the reverse. Not only the outcomes nor the rules generating them but the reasons for those rules matter to the public. Furthermore, individuals are not the only members of the public whose values are implicated by what happens within the legal system: so do various institutions that process and manage social values, such as political parties, religious organizations, and universities. Just as individuals care about what happens in courtrooms, as a microcosm of what happens in society and a reflection of society’s rules and policies, so too will these institutions.
Interest groups care about the extent to which their interests are advanced or threatened within the legal system. Interest groups include corporate interests concerned with how courtrooms interpret and apply regulations, enforce accountability with legal rules, and interpret and enforce contracts; public-interest groups that engage in impact litigation; labor unions; and professional groups like bar associations. Because these groups protect particular interests, they are invested both in the outcomes of cases that affect them and the rules that govern their behavior. These groups, along with the arbiters and promulgators of social values, are attuned to shifts in legal doctrines that change judicial decision rules. These groups may therefore intervene in cases directly, file amicus curiae briefs, and advance cases designed to change those doctrines.
All these interests rely on the media to identify and disseminate information about what happens inside the judicial system. Aside from the media’s own interests, other members of society rely on having media that observe how the legal system implicates those members’ interests. This interest often extends beyond the courtroom. For instance: when ProPublica revealed racial disparities in bail recommendations made by a software system, public outrage stemmed not only from the disparities, but from the company’s lack of transparency about its technology.
Finally, the government observes all of this and can respond by changing the way the legal system operates and the rules it applies. The government is also directly made up of the people, and members of the legislative or executive branches sometimes leave to join the judiciary, and vice versa. Because the government both observes the legal system—directly and through the interests it serves—and can modify it, this analysis has arrived at a snake eating its own tail, for behavioral equivalence purposes. If any part of the system is of interest to some stakeholder in society, that stakeholder might convince the government to modify the system. Even if the judiciary’s constitutional foundations do not change, it is clear that no part of the judiciary goes unobserved both internally and externally: the rare parts of the legal system that are constitutionally immune from Congressional interference (for example, the right to counsel) are observed internally by the judicial system; the rest are observed and potentially changed by Congress.
Again, no single one of these observers should be attended to, but they all should be: unless there is a reason to ignore some observers’ perspectives, an automated system and the judicial process it replaces do not “work the same way” unless all these observers see them as behaviorally equivalent. Similarly, if the replacement is designed to change the system or change is unavoidable, the effects should be examined from the perspectives of all these observers. The proper “external” observer to detect and evaluate changes to the legal system is therefore the union of all observers within society, including the judiciary. This union of all societal observers—society at large—captures all aspects of the legal system observed by some interest in society.
It is possible to disentangle the observations of each of the observers in society at large. In fact, taking these observers and their observations one at a time is crucial to taking stock of the consequences of making changes to the judicial system and deciding whether those consequences are worth the change. That these observers can be interrogated individually, however, does not suggest that only one matters at a time. As components of a larger society, to say “the academy cares about X” or “the welders’ union cares about X” or “Jane cares about X” is equivalent to saying “some part of society cares about X.” And because the government—representing the people—observes and can change any part of the system protected (and observed) by the judiciary, we should conclude that these collective observations capture everything that occurs within the legal system.
If the observer monitoring and evaluating the legal system is society-at-large—the collective observation of every interest observing that system—then that agglomerated interest resembles the omniscient observer to whom no replacement can be equivalent to the original process. A judge could offload the task of ruling on the admissibility of evidence onto a computer that perfectly mimicked the judge’s own decisions, but the clerk who was accustomed to participating in that analysis, and the litigator who assumed he could help persuade the judge with a clever word, would not see that system as equivalent. Nor would the judge, who would be applying the computer’s judgment rather than her own, making her a passenger in her own courtroom. It is possible that some areas of the legal system are truly observed by no one in society, but it is not clear what those would be: even what kind of coffee is served in the jury pool waiting room is observed by someone with preferences about it. For a part of the legal system to be truly unobserved, it would have to either be something literally no member of society cares about, or it would have to be managed with no human input or oversight at all or by humans who had no capacity to report on it back to the rest of society.
Again, the point is not that everything in the legal system ought to be preserved as it is, but simply to note that if anything in the legal system changes, some member of society will notice it changed and might have preferences about that which was changed. Having noticed the change, that member of society would not view the changed system as strictly behaviorally equivalent to the original system, and therefore neither will the society at large that includes that member. The danger is not that the judicial system might change, but that the change’s impacts will only be considered from a narrow range of perspectives. The literature at the intersection of legal and AI scholarship has thus far focused on perspectives internal to the judicial system: judges, advocates, litigants, and potential litigants (i.e., people subject to the jurisdiction of laws seeking to understand their rights and obligations). These are important perspectives to consider, arguably the most important ones. When considering the effects of automating some component of the legal system, scholars and policymakers should surely look first to the effects on the direct participants in that system. But the analysis should not end there. A view that defines the current judicial system in terms of its rules, reasoning, and outcomes will ignore the effects of changes on participants and other observers of the system.
We thus arrive at an uncomfortable situation. On the one hand, this Article has described an observer that can analyze whether replacement components to the judicial system are behaviorally equivalent to the original components, and can monitor and correct those changes using the judicial system’s rules surrounding reversal on appeal. But that observer will happily treat as behaviorally equivalent systems that members of society at large will not see as faithful replacements to their predecessors. And if the observer is instead the union of those members of society with an interest in any part of the legal system, there may be no part of the legal system that could be replaced while leaving the overall system equivalent in the eyes of the observer. If we use the first observer behavioral equivalence is achievable but might undermine society’s faith in our legal system; if we use the second, behavioral equivalence is impossible. Because society at large’s concerns about the legal system should not be discarded, we must conclude that developing a system that is truly behaviorally equivalent to the current legal system is impossible.
But that’s fine—just because some changes might be observed and cared about does not mean they are not worth making. Someone might care if the coffee in the jury pool room changed and conclude that the system generating that coffee had changed, but that doesn’t mean that the courthouse must stick with the same brand of coffee. Someone might also notice if racial disparities in charging decisions disappeared and would rightfully laud that change. In both cases, whoever effects the change must decide to do so in light of the preferences of society at large’s interests, and not only rely on courts to catch any undesirable changes. Recognizing that behavioral equivalence is an impossible goal, regardless of whether it is desirable, means that any changes made are guaranteed to have observable consequences which must be reckoned with. It also means that if anyone claims their system perfectly replicates some part of the legal process, they must be ignoring the perspective of some observer in society at large. Instead of asking whether a new system is exactly behaviorally equivalent to the old one, policymakers should evaluate the extent to which it is, carefully consider the differences, and use that analysis to consider the tradeoffs involved in making any such changes.
The judicial system will not remain static, nor should it. Behavioral equivalence does not weigh in favor of or against any change, but can help to detect and evaluate those changes. This Part frames that analysis by focusing on where changes will be observed and what tradeoffs they may implicate. But note that the first assumption with which this Article began— that a computer will someday be able to perfectly mimic any given legal subprocess—has been weakened: if the society-at-large observer detects the difference between any subprocess and its replacement, perfect mimicry is impossible. Instead, let us assume only that the reasoning, participatory processes, and outcomes of every legal subprocess might one day be emulated.
Using the judicial system’s observations is an insufficient lens through which to evaluate behavioral equivalence, given society’s interests in the legal process that go beyond those observations. Nonetheless, that analysis is a useful floor for evaluating changes to the legal system, because at a minimum the legal system must be able to determine whether a case using a new system warrants reversal. It is also important to know if a change leads to markedly different legal outcomes, because that change should only be made if those differences are desirable (or at least acceptable). This Part begins by examining what is involved in detecting differences even in the simple case. It then turns to the tradeoffs policymakers will face in deciding what changes to make when true behavioral equivalence is an impossible goal. It finishes by examining the tradeoffs implicated by other scholars’ systems as described in Section I.A.
Simply detecting the change in outcomes may be a challenge. Recall that our formulation for evaluating behavioral equivalence in the eyes of the judicial system asks whether anything that happens under a new system would have led to reversal under the old system, i.e., whether a difference between the old and new system is appealable. Such differences must be detectable for this formulation to have teeth. It would be circular to define differences as detectable based on whether an overall process is appealable, because the appeal depends on having detected the difference. Defining differences as “anything detectable” risks the conclusion that any difference the observer fails to detect does not—and should not—matter.
For processes like determining admissibility of evidence, a replacement component could be run alongside the current one, and have its outputs compared to the current one’s outputs given the same inputs. But processes that operate over humans, not data, cannot happen in parallel because humans are not duplicable. Witnesses cannot be asked the same questions twice in a row (by a human and by a computer) and be expected to respond precisely the same way, because their second response will be influenced by their having been asked twice. It would be difficult to attribute differences in witnesses’ responses to the system rather than to the way it was being tested. Nor could an entire trial be run more than once. For anything not run on the old and new systems concurrently, the result under one system will be unknown and outcomes will not be directly comparable.
A statistical analysis could show whether the system computes outputs within some acceptable range. This would reveal whether a system performed acceptably overall, but not whether its performance was true to the old system in a specific case, and so might not affect whether any given case could be appealed. For processes with clear standards to which the system must conform, behavioral equivalence can be evaluated directly by determining whether the standard has been met. But this describes few parts of the legal process, because standards that depend on a judge exercising judgment and discretion lack such a clear specification. Where a generous standard of review like abuse of discretion applies, the judicial system might treat even outcomes that greatly diverge from those of the old system as behaviorally equivalent to them.
It thus appears that outcomes could change a great deal under a new system and the judiciary’s appeals process would treat them as equivalent to the old system. But the judicial system attends to more than outcomes, so detecting differences can be supported by the roles of explanation and due process in legal decision-making. Any explanation accompanying a decision under a new system can be evaluated against the reasoning employed under the old system; any unexplained decision is observably different from one the old system would have explained. Requiring and evaluating a quality explanation any time one would be expected under the old system could ensure most differences remain detectable. This will not help where decisions are currently not accompanied by explanations, when a policymaker decides they need not be, or when there is a risk that the new system’s explanation may not correspond to its actual reasoning. Regardless, the decisions of new systems designed to mimic the behavior of old ones should be audited, so that policymakers may decide whether any differences detected are acceptable or not. Finally, litigants can be relied on to protect their own participation and due process rights, and therefore will notify an appeals court when a system abridges those rights.
Part II discussed a system introducing harmless errors as a problem when society is the observer, but not when the judicial system is the observer: because harmless errors do not lead to reversal, the legal observer that sees differences based only on reversals will see a system that introduces harmless errors as behaviorally equivalent to the old system. The current judicial system seems to treat harmless errors as constant, such that having a dozen harmless errors is equivalent to having only one. But the possibility that a new system will introduce increasing numbers of such errors should give policymakers pause, and provides an opportunity to reevaluate the assumption that harmless errors accumulate harmlessly.
Simply adopting a new system, one that faithfully mimics the old, may lead the new system to get different inputs than the current one. If the new system is more transparent or predictable it may be more easily influenced to users’ ends than is the current system. People will attempt to manipulate such a legal system to their benefit, just as they already do through contracts requiring binding arbitration using the drafter’s preferred arbitrator, jurisdiction shopping, or by tailoring arguments to judges. Thus even a behaviorally equivalent system might lead to differences in outcomes from the current system, not because it works differently but because people work it differently. These differences might only be detectable through post-hoc analysis across cases to identify shifting trends in the kinds of claims brought and legal arguments made.
In summary: if the behavior of some component of the legal system is governed by a detailed, top-down standard, or if its potential replacement generates as much explanation as the original—and assuming harmless errors are constant—then the component and its replacement just might generate enough information for the legal system to assess whether they are behaviorally equivalent. But behavioral equivalence is only an end in itself if building a completely faithful replacement process is desirable. Perhaps the new system is meant to be different from the original in some way. Or perhaps it is fine to accept that pure behavioral equivalence is impossible, either because the above constraints are missing or because the relevant observer is society-at-large, which observes all changes. Thinking in terms of specific observers can nonetheless help evaluate the tradeoffs those changes introduce.
Considering who are the relevant observers and what they observe about a system that may be changed can help assess the ramifications of those changes. This is true when any change to the system is made, not just when the replacement is a computer. Thus, for any given subprocess that is being considered for replacement, the policymaker should consider who the stakeholders are and what outputs of that process they see. For example, even if harmless errors are invisible to the legal system’s appeals process, they may matter to court observers or to lawyers who account for such errors in their strategies. This approach naturally guides policymakers towards perceiving (and weighing) the consequences of the change that matter to each stakeholder.
If every aspect of a process matters to some interested party, then no replacement system, analog or digital, can be behaviorally equivalent to the current system to all observers. But change may be desirable, and not every observation from every observer should matter. It should be fine to switch the coffee in the jury pool room. Recognizing that whatever system replaces the current one cannot, by definition, be equivalent to it (at least, not for everyone) supports conscious, deliberate decision-making about the changes introduced. Knowing that change is unavoidable frees decision-makers to focus not on whether they want change, but on which changes they want and how those changes will be perceived, and helps policymakers focus on what tradeoffs can be countenanced.
The tradeoffs described below fall into four categories: informational access, process, reasoning, and outcome. Informational access tradeoffs concern how much information comes out of a system, and who has access to it. Process tradeoffs change the legal process itself, the steps through which participants in the legal system move as they travel through it. Reasoning tradeoffs change how a system reasons its way to a conclusion. Outcome tradeoffs change legal outcomes.
Replacing components of the legal system might change what information that system reveals, and to whom. Informational access tradeoffs occur when a change to the system makes it less (or more) transparent than the current system. For example, juries are often referred to as black boxes, but jurors do sometimes speak out after trials or grand jury empanelings. This information can have an impact on the real world, as when in 2020 a grand juror publicly contradicted the Kentucky Attorney General’s claims about the indictment process after the killing of Breonna Taylor, prompting protests and demands for the AG’s resignation. If an algorithm rather than a grand jury assessed whether evidence could support an indictment, there might be no opportunity for that information to be revealed to the public. Similarly, records of trials include a courtroom transcript. If court stenographers were replaced by voice-to-text transcription software, errors in the software might create errors in the transcript, which would lead to incomplete explanations of judges’ rulings, confusion about legal arguments advanced at trial, or even errors about witness testimony. When judges publish opinions, they do so in English meant to be comprehensible at least to judges and lawyers, and possibly others. A computer-generated opinion in a logical form might encode the same information, but would be less comprehensible to as wide an audience. Finally, courtrooms are generally open to the public; any change that moved any part of the legal process out of public courtrooms would reduce the information coming out of that process.
Though constitutional due process rights set a floor on how much process can be stripped from the legal system, some changes may lead to process tradeoffs, which alter how cases proceed and affect participation in them. For example, pleading a motion orally versus in a written brief changes whether the litigant can react to a judge’s response in real time or clear up a misunderstanding. A process change might also affect who may participate in the legal system. Changes that make the legal process more efficient or cheaper might increase participation, but such benefits might be inequitably distributed. For example, an automated system that reads filings for administrative purposes could be more efficient in general but less accurate than a human at deciphering hand-written filings, negatively impacting pro se litigants.
Process tradeoffs also include the risk that knowledgeable bad actors can take advantage of a system. Someone who can predict what output they will get given some input may carefully calibrate those inputs to guarantee their preferred outcome, which would invite abuse. For example, if someone determined a set of words consistently led an evidentiary admissibility system to invoke the residual exception to the rule against hearsay, that litigant would defeat the hearsay rule in general. And computer systems’ vulnerabilities differ from those of humans. When the process changes, so too might the ways people can participate in and therefore affect it, even when due process rights are not implicated. Unfortunately, automated legal decision-making has often burdened people’s ability to participate in the decision-making process.
The judicial system reasons its way to conclusions through the thoughts and analyses of judges, juries, and other adjudicators. When a change alters how conclusions are reached, the reasoning tradeoff may implicate procedural justice. Indeed, this has already happened. If avoiding reasoning tradeoffs is desirable, engineers should design replacement systems to mimic the reasoning of the current system as closely as possible. For example, if debate is an important element of jury deliberations, juries might only be replaced by multi-agent systems that can encode disagreement and argument. Some AI systems are designed to perform rule-based or analogy-based reasoning, or to combine the two. And where reasoning matters little compared to accuracy, perhaps machine learning techniques that resist inspection and interpretation are fine for certain aspects of legal reasoning.
One reasoning tradeoff computers might introduce is a certainty tradeoff: a system that tells users how probable it finds an outcome is different from humans seeking to establish whether a threshold has been reached. For example, one study found that judges generally rated a 90% certainty as being “beyond a reasonable doubt,” but there is reason to doubt that judges or jurors use numerically precise standards in reasoning about guilt or liability. If grainy footage shows twenty-five prisoners in a prison yard, with twenty-four beating a guard to death and one trying to stop them, charging a random prisoner from the yard with murder yields a 96% chance of guilt. A computer system that indicated a defendant was 96% likely to be guilty might seem trustworthy, but if its calculation relied only on the defendant’s proximity to an event, it would not be seen as dispensing justice.
Outcome tradeoffs occur when a change in a system leads cases to be decided differently than under the old system. These are the worst tradeoffs to have happen by accident—changes to the legal system should not unexpectedly change legal outcomes—but may be the most desirable ones to achieve on purpose. There are as many situations in which outcome tradeoffs would be the goal of altering the system as there are problems with legal outcomes under the current system: racial disparities in criminal charging and sentencing; bias and disbelief against victims of sexual assault; long delays in criminal process, and many more. Unfortunately, legal automation again has a bad track record in this area. Change is not always bad, but should be made deliberately, especially when making changes to the dispensation of justice.
Tradeoffs will generally occur across those categories, not within them. For example, a machine learning system designed to make quick, first-cut assessments on whether a lawsuit is well-pleaded may make the legal system more efficient and therefore open to more litigants but generate less comprehensible decisions, thereby improving process at the expense of reasoning and informational access. A system to reduce variations in outcomes across judges might apply rules predictably but rigidly and change judgments relative to the current system, trading outcomes for reasoning.
Policymakers can attend to the effects of replacing components of the legal system, and therefore properly consider the tradeoffs implicated, by first enumerating the specific, immediate changes wrought by the replacement. The areas where those changes occur can then be examined for secondary changes, to capture downstream effects of the primary change. Then, for each change and each relevant interest in society, the analyst asks what that interest will perceive of that change. Each alteration is thus treated as a potential observation, and the analyst asks what observers might make of it. Having a list of interests like that described above in Section II.B will help guide this analysis, but such lists will likely be incomplete, so a mechanism by which interests can identify themselves will be helpful. Although predicting the specific effect of the change for each interest may be difficult, this method can help generate potential tradeoffs implicated by the change.
Reasoning about these tradeoffs is not merely an academic exercise. One controversy over the recidivism prediction software accused of racial bias is that because the software is a trade secret its workings were kept from both the prisoners being denied bail and the judges issuing the denial. Judges jailed defendants pending trial because a computer told them to without telling them why, not because the algorithm was incomprehensibly complex but to protect its intellectual property. The goal was to improve decision-making regarding who receives bail (an outcome tradeoff), but regardless of whether that was successful, it came at the expense of defendants understanding why they were sent to jail (an informational access and reasoning tradeoff). In turn, the system operating out of sight makes it difficult to know whether a negative outcome tradeoff has occurred, i.e., whether the system is perpetuating racial disparities. Unfortunately, that this has already occurred raises concerns about whether lawmakers will consider the tradeoffs involved in modifying the legal system, as this Article urges them to do.
The foregoing discussion helps to assess the potential tradeoffs involved in implementing the systems discussed in Section I.A. What follows is not criticism of the systems or their designers: models cannot capture everything about the processes they are modeling. But in considering whether to implement a model in the real world it is crucial to attend to what the model leaves out, because therein will lie the tradeoffs.
Much of the research in AI & Law focuses on modeling judicial decision-making: how the judge uses the evidence adduced at trial to dispose of a case. One strain of research within this literature has sought to replicate both legal outcomes and the reasoning used to reach them. HYPO and rule-based precedential reasoners implement different theories of precedential reasoning, but all seek to capture the mechanisms of such reasoning to replicate real-world decisions in cases. This design decision means these systems should minimize reasoning and outcome tradeoffs. Because they reason explicitly (step-by-step), depending on their implementation they might minimize informational access tradeoffs if the system can publicize its reasoning to those who currently observe it.
Nonetheless, these systems would involve process tradeoffs. All these systems are designed to work their way to the “best” possible answer; in fact, they rely on the idea that there is a right answer at all. HYPO algorithms involve argumentation and the distinguishing of cases, but the algorithm itself performs these steps, not a lawyer. Professor Horty’s system involves weighing conflicting rules and precedents, but that weighing process happens before the consideration of a case, and leaves no room for argumentation. Professor Verheij’s system not only does not allow argumentation, but requires that cases in a case model not contradict one another. All these systems take in information and reason their way to an answer without external input; all would require changes to allow participants in the judicial system the same process they have now.
These systems could be modified to reduce process tradeoffs. HYPO algorithms evaluate and distinguish precedents; this mechanism could be modified to evaluate arguments and precedents furnished by litigants. Professor Horty’s system considers reasons that some rules outweigh others; the system could additionally consider reasons and rules provided by litigants. But current implementations of these systems primarily model the reasoning a judge performs using facts and precedents, so adopting such systems as they are would reduce the role litigants have in shaping that analysis.
A Deep Learning system could theoretically reproduce a pattern of legal outcomes in a body of law at scale. However, in addition to the process tradeoffs precedential reasoners share, such models also implicate reasoning and informational access issues. Deep Learning systems reason in fundamentally different ways from humans, and their reasoning is often uninspectable and unexplainable. Even when they generate explanations, the explanations may be untrustworthy. Branting et al.’s approach predicts which facts in a case text correspond to legally relevant features; they argue that these features can explain outcomes by pointing to the factors in a case that contributed to it. Identifying these intermediate features could mitigate, but will not eliminate, the reasoning and informational access tradeoffs. Those tradeoffs are intertwined here: because a Deep Learning system’s reasoning is hidden within its network, the reasoning may not track with human reasoning and will not be observable to the outside world. In systems like HYPO every reasoning step is inspectable, but a Deep Learning system is closer to a black box. These approaches offer significant benefits: they are more scalable than other approaches, and Branting and colleagues argue that a system like theirs could help pro se litigants understand and frame their claims. But as a legal decision-maker, such systems implicate process, reasoning, and informational access tradeoffs.
Not all the AI research described above necessarily involves process tradeoffs. Professor Prakken’s formal model of argumentation preserves the role of litigants and adjudicators in litigation. Indeed, formal models of argumentation are motivated in part to preserve due process and the adversarial system. Because Professor Prakken has proposed a formal model, not implemented a system, it is difficult to evaluate what tradeoffs an implementation would bring. For example, the model does not specify how the adjudicator determines when the burden of persuasion has been met. Without a specification for the judge’s decision-making we cannot know what reasoning, outcome, or informational access tradeoffs that replacement judge might bring. The model does not account for fact-finding and witness questioning, which may implicate process and informational access tradeoffs. But the model’s representation of a trial as a dialogue between advocates and managed by an adjudicator, and its separation of these roles, should minimize process tradeoffs.
Professors Gowder and Livermore suggest that machine learning systems might one day be used to eliminate ambiguity in open-textured legal terms, or at least provide a reliable way of resolving ambiguities. Professor Livermore argues that such a system could lead to new kinds of statutes, wherein, for example, an object is a vehicle under a “no vehicles in the park” statute exactly if some specified neural network system says so. Such classifiers could involve process and outcome tradeoffs by eliminating argument and persuasion in close cases: there may be putative vehicles for which both prosecutor and defense can mount a strong case, and which would count as a vehicle for being-in-the-park purposes depending only on which side presents a more finely-crafted argument. A Deep Learning system will eliminate that ambiguity, changing both the outcome and the process to reach it. Professor Gowder limits his proposed system to the elimination of factual ambiguity, to preserve the due process rights of litigants to argue what the law should be to the judge, and to preserve the legitimacy of the system by having the judge continue to determine what the rules should be and what justice entails. He also suggests that litigants might argue about which features of a system go into the machine learning model, which reduces process tradeoffs by maintaining a role for argumentation. Such a system might implicate reasoning tradeoffs if it can tell you that something is a vehicle, but not why. For complex factual determinations it may also implicate informational access tradeoffs: providing an object’s features and a judgment of whether it is a vehicle—the inputs and outputs—may give as much information as the current system; not so when the system must synthesize and transform its inputs into intermediate representations to reach its conclusion. On the other hand, it may provide a beneficial informational access tradeoff, if the public can learn what counts as a vehicle without getting a ticket.
Professor Genesereth has argued that if consumer technology has access to data that could be used to determine whether behavior is unlawful, it should advise its users of that information. Such technology might eventually take on the role of enforcer. If it does, the informational access tradeoffs might benefit users: they would understand exactly why they were suffering an enforcement action. The outcome tradeoffs might benefit society, if additional enforcement of laws is good. If the system uses the same reasoning as would an officer writing a ticket, there is no reasoning tradeoff. But there would be an enormous process tradeoff: gone would be the ability to duck the law, argue one’s way out of a ticket, or get the ticket dismissed by going to court and not having the officer show up to defend it. Having the arm of the law sitting in every consumer’s pocket would dramatically change the process by which laws are enforced.
Professor McGinnis and Steven Wasick argue that cheap computation and large-scale data could enable dynamic rules, rules that change depending on real-world conditions. Such rules will constrain different behaviors under different circumstances, leading to outcome tradeoffs. If dynamic rules closely match the rules and standards they replace (but with shifting criteria for when they apply), they might not implicate reasoning and process tradeoffs. But the authors argue that one advantage of these rules is to “thwart judicial discretion” and apply more mechanistically than standards, suggesting reasoning tradeoffs are desirable. Dynamic rules might also lead to informational access tradeoffs if the rules’ changing standards are not communicated clearly to those whose behavior is governed.
Professor Coglianese and David Lehr argue that administrative agencies can use Machine Learning like any other tool, including for adjudication and rulemaking. They describe several associated risks which fit within this Article’s framework. These include the risks that algorithms might introduce biases into agency decision-making—implicating reasoning and outcome tradeoffs—and that individuals challenging agency decisions will be unable to effectively scrutinize them—an informational access tradeoff. They also raise concerns regarding using quantitative rather than qualitative standards, determining acceptable error rates, emotional and dignitary costs to participants, and privacy concerns, which respectively correspond to reasoning, outcome, process, and informational access tradeoffs.
Finally, Professor Volokh argues that an AI that writes judicial opinions that pass the Turing test should be permitted to serve as a judge. The Turing test is likely the wrong test of intelligence and capability because it depends on humans perceiving an author as human and intelligent, and humans are prone to anthropomorphizing and perceiving intelligence where none exists. Natural language generation systems have already proven themselves better at generating convincing-seeming language than displaying an understanding of the world. Some natural-language generation system may soon be able to produce text that humans—even experts—will be convinced was written by a judge about a case, but I expect that close examination will yield faults in reasoning and fact-finding that might be forgiven if attributed to a human judge, but not to a computer judge.
Professor Volokh suggests the standard should be whether human experts are persuaded by the opinion, not only convinced a human generated it. But just as humans want to believe their interlocutors are human and read intelligence into automatically-generated texts, humans can be persuaded by all sorts of arguments, including those that contain logical fallacies. Professor Volokh allows that there may be risks involved in such a system, but he does not count among them flawed reasoning that nonetheless passes his test. But a system that generates faulty-but-persuasive reasoning would implicate not only reasoning tradeoffs (at least, against a human whose reasoning is not faulty), but outcome tradeoffs too, if faulty reasoning leads to mistaken outcomes. Additionally, the reasoning relied on by the AI system may not be the same as that which it writes on the page. If the AI derives outcome X using the internal rule “A → X,” but in its opinion states that it relied on the rule “B → X,” which rule did it use? If “B → X” is persuasive to the reader, does it matter that the system actually used “A → X” in its reasoning? Assuming that human judges do not disguise the grounds for their decisions, an AI system that did so would involve informational access tradeoffs as well as reasoning tradeoffs. That said, these criticisms mostly concern Professor Volokh’s test of a successful system, not its capabilities. If an opinion-writing system accurately represented its reasoning and only took over opinion-writing, the only informational access tradeoffs involved might be the lost conversations between judge and clerk, and the process tradeoffs may only arise from limiting the judge to trial management and not adjudication.
The foregoing discussion shows that not only will there be tradeoffs involved in automating components of the judicial system, but the consequences will often appear in areas other than those that the replacements target for change. Even if a replacement leaves a system’s internal processes intact, it may change how people interact with that system or the information it transmits. Those considering making changes to the legal system must attend to these unintended consequences.
Behavioral equivalence is a useful concept because it shows that asking whether two systems work the same way is an insufficiently precise inquiry. The behavior of any system can only be assessed through some observer making observations about that system. Two systems may be equivalent from one observer’s perspective and quite different according to others. By focusing attention on observers and their observations, behavioral equivalence transforms the question from “are these systems the same?” to “who is looking at these systems and what do they see?”
Because equivalence is inherently subjective to the observer, the frame of reference policymakers use will determine how difficult it is to create a system that is behaviorally equivalent to the current one. When scholars and policymakers focus on the reasoning and outcomes of judicial decision-making, they implicitly treat the judicial system as the observer, an observer whose observations are confined to that which remains within the judicial system. Under this view, whether some new process is equivalent to the old one can be evaluated by asking whether differences in their behavior would be appealable if done by the old process. This focus on appealable differences would allow the system to tolerate a variety of changes, such as increases in harmless error, by declaring processes de jure equivalent if not de facto so.
But ignoring differences that are invisible to the judicial system is misguided if what matters is not only what the judicial system observes about itself, but also what others see in it. And if all those outside perspectives matter—if society at large is the observer—behavioral equivalence is an impossible goal because every change will be detected by some observer. This realization is liberating, because it allows policymakers to abandon any idea that technology can perfectly mimic some part of the legal process for every observer. It also reminds scholars and policymakers that if they think some new system is behaviorally equivalent to the current one, they are likely leaving out someone’s observations of that system. By assuming that any change made will matter to some interest in society, attention can be focused on what the impacts of a change will be, and who will perceive them. Expanding the frame of reference beyond the legal system and to society at large invites a careful analysis of the consequences of change and helps to consider how proposed changes will affect processes, outcomes, reasoning, and informational access.
Components of the legal system may one day be replaced by computer algorithms, and there may be significant benefits to doing so. But the consequences of making those changes will likely extend beyond whatever specific issues the systems are designed to address. Legal outcomes are not the only outputs of the legal system, and the other outputs matter to interests beyond the system’s participants. Scholars and policymakers must attend to these secondary changes—and to the observers who will perceive them—in order to consider properly the full consequences of any change. This is true regardless of whether the replacement is by a computer algorithm or simply a different system run by humans. Change is not inherently good or bad, but a change that advances some interest at the expense of another should be made deliberately and with careful consideration of that tradeoff.
Though this Article limits its analysis to scholarship proposing an active role for AI in automating the legal system and legal decision-making, a wealth of scholarship has examined the risks to, and potential impacts on, the law of AI and automation. See, e.g., Danielle K. Citron, Technological Due Process, 85 Wash. U. L. Rev. 1249, 1278 (2008); Robert Brauneis & Ellen P. Goodman, Algorithmic Transparency for the Smart City, 20 Yale J.L. & Tech. 103, 116–18 (2018); Pauline T. Kim, Data-Driven Discrimination at Work, 58 Wm. & Mary L. Rev. 857, 883–92 (2017); Solon Barocas & Andrew D. Selbst, Big Data’s Disparate Impact, 104 Calif. L. Rev. 671, 671 (2016); Kate Crawford & Jason Schultz, Big Data and Due Process: Toward a Framework to Redress Predictive Privacy Harms, 55 B.C. L. Rev. 93, 106–09 (2014); Joshua A. Kroll et al., Accountable Algorithms, 165 U. Pa. L. Rev. 633, 633 (2017); Mireille Hildebrandt, Law as Information in the Era of Data-Driven Agency, 79 Mod. L. Rev. 1, 24–25 (2016). ↑
PL researchers have described other equivalence relations. See generally Cédric Fournet & Georges Gonthier, The Join Calculus: A Language for Distributed Mobile Programming, in Applied Semantics: International Summer School, APPSEM 2000, at 268, 288–313 (2000). One useful equivalence is contextual equivalence, which asks whether two different subprograms behave the same way when plugged into a larger program. See Morris, supra note 69, at 17–18. An analogous inquiry in the law might ask how the larger legal system assesses similarity of legal decision-making processes that are embedded within that system. See infra Section II.A. ↑