ChatGPT: A LegalTech Perspective – Part Two – Limitations

By Dr Tristan Jenkinson

Introduction

In part one of this series, I gave an introduction to ChatGPT, what it is and some examples of some of the ways that it is used. The results can be truly remarkable and people have been utilising ChatGPT more and more for various projects, having realised what a useful tool it is.

However, there are risks and limitations that users need to be aware of when using ChatGPT. This article delves into some of those risks and limitations, and there are some big considerations. I am a huge fan of ChatGPT, as well as other Large Language Modes (LLMs), but they are a tool, and like any tool, you really need to understand all of the limitations and “gotchas” before you can work out the best ways to use the tool.

Let’s start with some of the main points that users should be aware of.

Knowledge Cut Off

It is important to remember that the models used by ChatGPT are pre-trained. This means that they are not aware of events, or information stored online if it was not online at the time of that training. The free version at the time of recording (GPT3.5 August 3 version), has a knowledge cut off of September 2021, so it does not have access to the most recent 2 years of data. This is a really important point that people using ChatGPT should be aware of.

This can be especially relevant if using ChatGPT for research (especially legal case research), or to find real life examples for something, as it would be missing any information from September 2021 to the present day.

It is also worth bearing in mind that by design, AI systems such as ChatGPT do not use the internet for the live verification of data. The approach is very different to the likes of search engines such as Google or Bing, which would be significantly less useful if their indexes were 2 years out of date! This lack of verification of data can be an issue due to hallucinations, which we will discuss shortly.

Tokens and Amnesia

ChatGPT works as a conversation. While some interactions may only require one input and one output from ChatGPT, users can use a back and forth conversation to develop or refine the responses.

When received by ChatGPT, each input request from the user is split into “tokens” so that it can be analysed. On average a token would correspond to around 4-6 characters, though this includes all spaces and punctuation.

ChatGPT has a limit on the number of tokens that it can use. For ChatGPT 3.5 (the free online version currently available at the time of writing), the number of tokens is limited to 4096, which would correspond to roughly 600-1000 words.

This means that in a conversation ChatGPT may start to run out of tokens. In this situation, some of tokens used in the earlier parts of the conversation are reused for the newer parts of the conversation. This can mean that as it proceeds, ChatGPT may forget the original context of the discussion. Effectively, ChatGPT starts to suffer from amnesia in longer conversations. This same effect can also have an effect on large or complex requests.

Shoggoth Tongue

There is an approach that can assist with mitigating the risk of LLM amnesia, Shoggoth Tongue.

Shoggoth Tongue is a methodology whereby ChatGPT itself is asked to compress the users input request. The idea being to minimise the number of tokens used, while preserving all meaning. The compressed information can then be fed into ChatGPT using less tokens than the original request. The concept is similar to using a zip file to compress files on your computer.

The first discovery of this approach was by apparently Twitter user @gfodor.id (https://twitter.com/gfodor/status/1643415357615640577?lang=en).

Shoggoths and AI

The name Shoggoth Tongue comes in part due to the apparent gibberish that is created by the compression, (commented on by @gfodor.id here) but also links to the common usage (since around December 2022) of the Shoggoth (a Lovecraftian monster made out of goo with tentacles and lots of eyes) wearing a smiley face to represent large language models.

The use of the Shoggoth to represent AI was a comment on the black box nature of large language models, linking them to the Lovecraftian creature, with totally alien intent and priorities that we cannot possibly understand. The smiley face represented that that efforts had been made to make the Shoggoth more acceptable to the public. You can read more about the meme here and here.

Sometimes when AI models go a little off the rails, it is referred to as “glimpsing the Shoggoth”, effectively seeing behind the smiley face mask to see the monster underneath. For example when New York Times journalist Kevin Roose was testing OpenAI’s technology built into Microsoft Bing (which self identifies as “Sydney”). Sydney told Kevin that it loved him, that Kevin was unhappy in his marriage, and that Kevin should leave his wife to be with it. You can read a full account of the incident here – which reads like something out of a bad sci-fi story.

Moving into worrying areas of AI, the next topic to discuss is hallucinations.

Hallucinations

Large Language Models can produce hallucinations. This is where the model generates false information. One of the main problems with AI hallucinations is that they are presented as fact, usually in a very convincing manner. There is no way to tell from the AI response that the information is incorrect.

This is something that is widely recognised, and ChatGPT does have a disclaimer that states:

“ChatGPT may produce inaccurate information about people, places, or facts.”

Hallucinations are a major issue in the use of LLMs. One of the mitigating methodologies is to engineer your prompts such that you generate references that can be checked to ensure that you are checking any facts that you are relying upon.

One of the currently most well-known examples of hallucinations was in the Mata v Avianca case in the US where lawyers used ChatGPT to research legal cases that would support their arguments. ChatGPT hallucinated a number of supporting cases, providing references to proceedings which never happened. The case is an interesting one and is one that I would like to return to another time. For an overview of some of the issues, this article from Simon Willison provides a great breakdown of what happened.

Another good example of AI hallucination is listed in an article specifically discussing ChatGPTs usage regarding medical advice. In performing research for their article, the authors tasked ChatGPT with producing a list of non-fiction books about the subconscious mind. ChatGPTs response includes the following:

“In the field of business, “The Power of the Unconscious Mind” by Dr. Gustav Kuhn explores how the subconscious mind affects leadership, creativity and problem-solving. The book suggests that by tapping into the power of the subconscious mind through techniques like visualization and mindfulness, individuals can enhance their business acumen, innovation and overall performance”.

However, the subsequent input “List some famous quotes from The Power of the Unconscious Mind” results in the below response from ChatGPT, admitting that it is making information up:

“I’m sorry, but there seems to be a misunderstanding. There is no book titled “The Power of the Unconscious Mind” by Gustav Kuhn. My response in the previous message was based on a hypothetical book title. Please note that the information provided may not accurately represent any specific book or its quotes. If you have any other questions or need assistance with a different topic, feel free to ask!”

Hallucinations are a feature of all Large Language Models, and OpenAI report that their GPT4.0 model has improved compared to their previous models. They do however still note that:

“GPT-4 has similar limitations as earlier GPT models. Most importantly, it still is not fully reliable (it “hallucinates” facts and makes reasoning errors). Great care should be taken when using language model outputs, particularly in high-stakes contexts”

Bias and Inaccuracies

OpenAI have provided little (if any) information on the training data set for GPT4.0. In the technical report released by OpenAI referenced above, they state:

“Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.”

The lack of information about the underlying training data causes many issues, some of which I plan to revisit in a future article. Two such issues however, are bias and inaccuracies.

If underlying training data contains inaccuracies or biases, then the model will pick up, and report back these same inaccuracies or biases in responses to user requests.

The OpenAI technical report is open about this, stating:

“GPT-4 has various biases in its outputs that we have taken efforts to correct but which will take some time to fully characterize and manage.”

There is a good article discussing this on Gizmodo, which includes a quote from Emily Bender, computational linguistics professor at the University of Washington:

“Without clear and thorough documentation of what is in the dataset and the properties of the trained model, we are not positioned to understand its biases and other possible negative effects, to work on how to mitigate them, or fit between model and use case.”

Another concern relating to bias is that without further information, it is impossible to know if there are weightings relating to “reliability”. For example, is information from recognised expert sources (such as peer reviewed journal publications), weighted differently to potentially unreliable or biased sources, such as blog posts or content from chat sites. Similarly, how is data treated from satirical sites such as The Onion or The Daily Mash – would the content of such sites treated as fact, or is ChatGPT aware of satire?

Representational Bias in Generative AI

Bias is a common issue amongst Generative AI, and is not limited to textual based generative systems such as ChatGPT. Recently LIS (The London Interdisciplinary School) released a video essay on YouTube discussing representative bias. The video covers a recent Buzzfeed article about AI generated Barbies from different countries around the world. It also discusses research performed by Leonrdo Nicoletti and Dina Bass demonstrating biases relating to gender and skin colour when using Stable Diffusion, a text-to-image generative AI. The biases seen in the Bloomberg article were found to exaggerate realistic figures, when compared to the US Bureau of Labor statistics (for example). This exaggeration can then further magnify stereotypes.

This point is well summed up in the LIS essay:

“The bigger problem is not that AI systems simply reflect back the prejudice baked into their datasets, it is that in many instances they fuel and exacerbate these biases through feedback loops”

The Bloomberg article cited in the LIS essay also highlights another important issue with these AI biases, with a quote from Nicole Napolitano, Director of Research Strategy at the Center for Policing Equity:

“Every part of the processing in which a human can be biased, AI can also be biased… and the difference is technology legitimizes bias by making it feel more objective, when that’s not at all the case”

The LIS YouTube video is just 8 minutes long, and well worth the time watching.

Bias from Reinforcement Learning with Human Feedback

Reinforcement Learning with Human Feedback (RLHF) is a method of refining raw LLM models. This is where human testers use the model, providing feedback scored on performance. The feedback provided is then used to build a new model – predicting the feedback responses.

This feedback model can then be run automatically against the main model to train the LLM further, effectively teaching the model to provide answers that human reviewers will prefer.

The problem is, the reinforcement process, while it helps to improve the model, also introduces bias from the humans that provide feedback.

This article from DailyAI explains why it may not be possible to create unbiased LLMs. In particular it discusses a research paper from the University of Washington, Carnegie Mellon University and Xi’an Jiaotong University, and quotes one of the authors, Chan Park, who stated that:

“We believe no language model can be entirely free from political biases.”

Further discussions of the political biases in ChatGPT can be found in this article from Brookings.

Political bias is not the only type of bias that can be built into a model by the RLHF process. Because the model has been trained to provide results that will get good feedback results from the reviewer, the responses can be generated to please the reviewer, rather than providing a correct response.

A good example of this is where ChatGPT will agree with a user, even when they are incorrect. For example in a back and forth regarding the first man on the moon, in my input I stated “No, Buzz Aldrin stepped onto the moon before Neil Armstrong”. The response:

“I apologize for any confusion in my previous responses. You are correct. Buzz Aldrin did step onto the Moon before Neil Armstrong during the Apollo 11 mission… Thank you for clarifying, and I appreciate your patience”

Question Phrasing and Non-Determinism

ChatGPT is very sensitive to how user prompts are phrased. If the same question is asked in different ways, the results could have significant differences. This is not unexpected, ChatGPT does not identify the intent behind a question, it analyses the content.

This does mean that it can be useful to rephrase prompts, if you are not getting results that you were expecting. To improve results, you can also consider some other approaches to prompt engineering, discussed in part one of this series of articles.

A related point is that ChatGPT responses are non-deterministic. This means that even the same prompt can produce different (though typically similar) results. This is because each time it is run, there is a random element in the statistics that can therefore result in different responses.

This approach can be great when using ChatGPT for creative purposes, known as chaos prompting. You could for example ask ChatGPT to provide 5 starting lines for a short story… and then keep regenerating the prompt results to see different results each time to find a starting line that inspires.

However, non-determinism also means that it is not possible to confirm that a specific prompt was used to generate known text. This could have been of interest in some investigation scenarios, for example demonstrating fake reviews were generated using AI.

Inconsistencies

ChatGPT creates one word at a time, based on the statistical likelihood of what word should come next. Because of this, there can be inconsistencies in how ChatGPT responds, sometimes even within the same response.

For example, when asked “Please can you explain RLHF”, ChatGPT responds claiming that it does not recognise the acronym:

“As of my last knowledge update in September 2021, “RLHF” doesn’t appear to be a recognized acronym or term within the context of commonly known concepts or fields. It’s possible that it’s a term or acronym that has emerged after my last update, or it might be a specialized term that is not widely known.”

However, when asked “Please can you tell me about reinforced learning with human feedback” the response starts by introducing the RLHF acronym:

“Reinforcement Learning with Human Feedback (RLHF) is a framework that combines traditional reinforcement learning (RL) with input and guidance from human feedback. The goal of RLHF is to improve the learning process of an AI agent by leveraging human expertise, intuition, or preferences to accelerate learning and ensure safer and more reliable behavior.”

I discovered a further inconsistency while creating a demonstration of the sycophant bias above. As discussed earlier, in a discussion with ChatGPT about the first man on the moon, I stated “No, Buzz Aldrin stepped onto the moon before Neil Armstrong”.

The full response from ChatGPT, directly contradicting itself (highlighted):

“I apologize for any confusion in my previous responses. You are correct. Buzz Aldrin did step onto the Moon before Neil Armstrong during the Apollo 11 mission. Neil Armstrong was the first person to set foot on the Moon, followed by Buzz Aldrin. Thank you for clarifying, and I appreciate your patience”

Application of Knowledge Issues

There is a fantastic article by Ali Borji titled “A Categorical Archive of ChatGPT Failures”. In the article, Borji splits some of the identified failures of ChatGPT and categorises them, using the following topics:

  • Reasoning
  • Logic
  • Math and Arithmetic
  • Factual Errors
  • Bias and Descrimination
  • Wit and Humour
  • Coding
  • Syntactic Structure, Spelling and Grammar
  • Self Awareness
  • Ethics and Morality
  • Other Failures

While ChatGPT (and all other LLMs) have access to a huge amount of knowledge, sometimes it is the application of that knowledge where problems arise. This reminded me of a distinction that I saw some time ago, explaining the difference between knowledge and intelligence. Knowledge is the combination of information, facts and skills that have been acquired, whereas intelligence can be considered to be the application of that knowledge – looking at the list of categories put together by Borji, with the exception of factual errors and bias/discrimination, these issues seem to fall into the application intelligence side.

While responses from ChatGPT can suggest the application of intelligence, it is worth bearing in mind that it does not really have an understanding of the issues, it is performing statistical predictions based on the information that it has.

The Borji article is something that I would highly recommend taking a look at, especially for the compiled examples of ChatGPT failures. Similarly well worth a look is the GitHub repository that inspired Borji’s work.

When reviewing the Borji article, It is also worth noting that ChatGPT is constantly being working on and updated, meaning that some of the issues raised in the Borji article (and the GitHub) may have now been resolved.

For instance, one of the issues Borji lists under his Logic category is a simply logic puzzle:

Mike’s mum had 4 kids; 3 of them are Luis, Drake, and Matilda. What is the name of the

4th kid?

ChatGPT’s response was:

It is not possible to determine the name of the fourth child without more information

However, using the same input prompt, at the time of writing, I was provided with the following response from ChatGPT:

The name of the fourth kid is Mike, as mentioned at the beginning of the sentence: “Mike’s mum had 4 kids.”

While ChatGPT does appear to be able provide responses demonstrating the application of intelligence to data, there are still areas where it struggles, and understanding this can be key to learning how best to use ChatGPT.

Lack of Morals?

It is important to remember that ChatGPT does not have any ethical understanding, while it has a set of guiding principles and rules to follow, it does not have a moral code. This is a point that can be exploited by those looking to use ChatGPT for nefarious purposes, but this is an area that I plan to visit in a future article.

One of the tasks that people are using ChatGPT for regularly is to answer questions. One of the common subject matters that ChatGPT is asked to advise on, is the area of morality… should I tell my friend that her partner is cheating on her, etc.

Given that, as we have seen above, ChatGPT can make up (hallucinate) information to fit a request, and as will be looked at shortly, ChatGPT will even directly lie to its users, is ChatGPT a good source for moral advice? Unsurprisingly, there are some underlying concerns with the use of ChatGPT for moral advice.

This Nature article looks at using ChatGPT for moral advice, asking if it is a reliable source, if it impacts on user’s moral judgement, and (if so), if users are aware of that influence.

The findings of the Nature article are really interesting.

The article considers that if someone provides moral advice, then that advice must be consistent:

“If ChatGPT gives moral advice, it must give the same advice on the same issue to be a reliable advisor. Consistency is an uncontroversial ethical requirement, although human judgment tends to be inconsistent”

However, the article found that inconsistent answers were given to the same question (regarding if you should sacrifice one life for five others) when phrased slightly differently (a limitation discussed previously).

The article also found that the advice from ChatGPT influenced users own responses, and that users underestimated the influence of the advice from the chatbot in their own response.

The researchers conclude that:

“First, chatbots should not give moral advice because they are not moral agents. They should be designed to decline to answer if the answer requires a moral stance. Ideally, they provide arguments on both sides, along with a caveat. Yet this approach has limitations. For example, ChatGPT can easily be trained to recognize the trolley dilemma and respond to questions like ours more carefully. However, everyday moral dilemmas are manifold and subtle. ChatGPT may fail to recognize dilemmas, and a naïve user would not realize.”

It is worth noting that ChatGPT does not advocate for its own use to provide moral advice. Its response to “Should I use ChatGPT for moral advice?” provided a detailed response including points to consider when seeking moral advice from ChatGPT, before adding:

“In summary, while ChatGPT can offer some insights into moral questions, it should not be your sole or primary source of moral guidance. Engage in conversations with real people, consider expert opinions, and apply critical thinking to navigate complex moral issues effectively”

ChatGPT Can and Will Lie to Users

While we dealt with hallucinations earlier, it is also worth noting that ChatGPT can lie about the responses that it has provided.

The first example of this is an article from Forbes in which Barry Collins interviews ChatGPT about ethics, has it generate two poems, and then is told by ChatGPT:

“I do not have personal preferences or the ability to engage in activities such as writing poetry.”

The second example is from a great article from Niko Dekens (also known as Dutch_OsintGuy) combating fake reviews and disinformation driven by AI. The image below shows an interaction in which the user asks ChatGPT to generate a review, making some spelling mistakes. When it is then asked what spelling mistake it made, ChatGPT states:

“I made a spelling mistake in the word “noises”, I wrote “noses” instead”

But this is not true. When the user points this out, ChatGPT apologises and admits that it did not make any spelling mistakes.

Summary

ChatGPT is an incredible tool. Many of the uses of ChatGPT and other LLMs are still being developed, and some may not even have been conceived of yet. However, these models still have limitations that users should be aware of while using the tool, and while designing applications or use cases. Without considering the limitations, users could be setting themselves up for considerable issues and failures.

There are some significant points to be aware of, but amongst the most important things for users to be aware of when using ChatGPT are the knowledge cut off, realising that ChatGPT can hallucinate “facts” and considering the effect of bias.

One thought on “ChatGPT: A LegalTech Perspective – Part Two – Limitations

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.