Skip to main content

Defining, measuring and establishing LLM trustworthiness

· 5 min read
Ben Johns
Founder of complyleft
LLM Trustworthiness

The primary use case for LLMs is Generative AI, based on a user providing an input, “a prompt,” which can be a text string or an image encoded into a series of tokens. The LLM then takes those tokens and predicts the next tokens that will most likely follow. The prediction or the generated information becomes the output of the LLM. This is all based on the data the LLM was pre-trained on and fine-trained on and the data or information used for reinforcement learning.

Question: Do we trust people to provide correct, accurate, and trustworthy information?

This is an almost impossible question to answer.

If that's the case, then how could we possibly trust LLMs? After all, a significant portion of their training data primarily comes from vast and often unverified sources on the Internet.

Who created all that data on the Internet? we did. That's our information, opinions, ideas, theories, and work.

I’ll ask the question again: Do we trust people to provide correct, accurate, and trustworthy information?

Does that have any correlation to “how could we possibly trust LLMs”?

Yes, there is some correlation, and to me, this sounds like a risk.

There are positive impacts to gain from using and developing LLMs - I don’t think I need to spell them out - but negative impacts need to be addressed.

Risk management is always about deciding whether it is worth taking a risk and what we can do to minimise the negative impacts while still gaining positive impacts.

I feel it’s entirely possible to establish trust in the tech systems we use to build, manage, deploy, and host LLMs.

However, the data issue seems like trust will be a cat-and-mouse gain. The more we know and understand the inner workings of the black box, “the neural networks within the LLMs”, the more we can do to reduce the potential negative impacts.

I’ll ask another more pointed question:

Do we really trust the data we are using to train these models?

Is the data good enough to train our LLMs so they provide correct, accurate and trustworthy outputs?

This is a new frontier, and we don’t have all the answers today. As science advances and new technology is deployed, trustworthiness can only follow. There will always be a period when the unknown could happen.

It's not all doom and gloom; after all, this risk has immense potential upside.

Rest assured, companies leading the way, like Anthropic and OpenAI, are managing these risks at the bleeding edge. As a result of this work at the bleeding edge, there are some great pieces of work that I feel are the starting point to initiate what seems like a capability to prove that controls and measures are in place to reduce the potential negative impacts when it comes to lack of trust.

Please see:

Anthropic: Constitutional AI

https://www.anthropic.com/news/claudes-constitution

Anthropic: Responsible Scaling Policy

https://www.anthropic.com/news/anthropics-responsible-scaling-policy

OpenAI: Preparedness Framework

https://cdn.openai.com/openai-preparedness-framework-beta.pdf

In addition to this groundbreaking work, there are newly published frameworks to help us all manage the risks within the black box. See here:

Databricks AI Security Framework

https://www.databricks.com/sites/default/files/2024-03/databricks-ai-security-framework-dasf-whitepaper-v4-final.pdf

OWASP Top Ten for Large Language Model Applications

https://owasp.org/www-project-top-10-for-large-language-model-applications/

OWASP Top Ten LLM Application Checklist

https://owasp.org/www-project-top-10-for-large-language-model-applications/llm-top-10-governance-doc/LLM_AI_Security_and_Governance_Checklist-v1.pdf

OWASP LLM Security Verification Standard

https://owasp.org/www-project-llm-verification-standard/

NIST AI Risk Management Framework

https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf

MITRE ATTCK | ATLAS

https://atlas.mitre.org/

ISO 42001 Artificial Intelligence Information System - AIMS

https://www.isms.online/iso-42001/

The biggest risk we face is the quality of the data we use to train LLMs. It all comes back to the data. Can we trust the data?

Establishing trust should not be a one-off process; rather, it should be revisited or reassessed. The question here is the outputs of LLMs or the predictions the models make. Trust can only be gained by seeing or feeling the impact of the model's output.

Therefore, Trustworthiness must be a lifecycle and must be upheld by principles. The criteria for trust must be well-defined and measurable. Trust must be established through assurance to be proven effective. Once established, trust must be maintained as advancements and changes are made to AI systems.

Criteria of trustworthiness:

Ethics, honesty, morals, and value alignment. Transparency, interpretability and explainability. Robustness, reliability and dependability. Privacy and security. Human oversight, control and feedback. Monitoring and feedback loops. Governance and accountability.

Principles of trustworthiness:

Establish confidence that the results are true, credible and believable from the perspective of the audience. Ensure the findings are dependable and repeatable if the inquiry reoccurs. The confidence of trustworthiness that is formed can be corroborated by other forms of authority. The results of establishing trustworthiness can be generalised and transferred to other contexts or settings. The lifecycle of trustworthiness must follow the entire lifespan of the AI system, from data collection and validation to data cleaning and categorisation, design and development, testing and validation, deployment and monitoring, feedback and adjustment, and finally, retirement or transition.

All the best!

Ben