An Executive's Checklist for LLM Deployments

Is our confidential data still confidential?
Have you set up enterprise accounts with an LLM provider (OpenAI, Anthropic, Google/AWS/Azure etc) with proper data protection terms? “If you aren’t paying for the product, you are the product” applies to LLM use too.
Have we lawyer-proofed our code?
Is there a clear process for vetting LLM-generated code? Is there a system in place to track where bits of code came from and from which AI? Does your org know to vet LLM generated code for license issues? Does your org clearly mark any LLM generated code to make it easy at review time? You may think you don’t need this yet, but, right now your engineers are probably copying code into free ChatGPT instances faster than you can say “intellectual property violation.”
Can you explain to auditors, regulators, or angry customers exactly how you’re using AI? Or is your documentation still in the ~~“we’ll write it later”~~ “we’ll use GPT to write it later” phase?
Do we actually know it’s working before we deploy?
Are you testing your products for biased or inappropriate outputs? Or are you planning to find out about problems the same way everyone else does - on social media? Remember: models are only as good as the data that they have been trained on and no one really knows where the data’s coming from and what it’s been through.
Do you have a custom evaluation framework for your specific use case? Do not depend on leaderboards to help you pick which LLM to use. Every model is different and the best model on a leaderboard may not be the best model for your org’s or product’s needs. Even when you pick a model there is no guarantee that the next version of that model will be better for your specific use case.
Is someone paying attention to the output?
Your existing monitoring probably isn’t ready for LLM shenanigans. Update your alerting and observability - unless you enjoy those 3 AM “why is our AI trying to rewrite the database in Shakespearean English” incidents. The easiest way to do that is to make sure that products using LLM generated outputs have a human in the loop (even when the output is sanitized) especially if it’s the org’s first time deploying such a product. Do not pass LLM output to control systems (especially code execution) unless well sandboxed.
Is our alerting, observability, and monitoring adjusted for LLM use?
Sometimes it’s not possible to always have a human pay attention to every move your LLM makes. That’s when having a set of guardrails helps. Have you updated your alerting and monitoring systems to handle LLM-specific failures? LLMs can fail silently and are often confidently incorrect. When they do, can you trace your way back to what caused it?
Is there an emergency shut-off switch?
Can you quickly disable LLM features without taking down your entire application? Because sometimes you’ll need to turn things off faster than an LLM hallucinates an entirely new continent. When you do turn off that LLM, can your app fall back to using something else in place of the LLM?
Are we dependent on one single provider (or model)?
Don’t put all your tokens in one ~~basket~~ model because today’s GPT-whatever might be tomorrow’s MySpace. Models and model performance is constantly changing and there is no clear leader or winner yet. It helps if you can plug and play models from different providers. Same for local LLMs.
Do we have a token budget? Are we alerting on it? And are we estimating spend when changes are made?
You know the joke about how to make a million dollars? You start with 2 million and leave your AWS instance running. LLM overspend can happen quickly. Sometimes it’s because access to the LLM is open to the world and is used in ways you didn’t anticipate. Other times it’s errors in the dev process or bugs in the code that end up calling an API it shouldn’t.
Are we clear on AI disclosure?
People love to talk to real people to solve their problems. They are less happy when that “real” person turns out to be math in a trench coat. And that’s rarely a winning customer strategy.
Are we anthropomorphizing again?
Has everyone in the team been trained to treat the LLM like what it is - a sophisticated probability calculator - rather than some omniscient digital oracle? Because it will confidently tell you 2+2=5 if that’s what’s in the training data.
Bonus question: Did you read all the way to this bonus question?
If yes, congratulations! You’re more thorough than 90% of executives implementing AI. If no, well, at least you’re honest about your skimming.

P.S. If you’re wondering whether you need all these checkboxes - yes, yes you do. Trust me, it’s easier to check these boxes now than to explain to the board why your enterprise CRM product started offering financial advice in Homer Simpson’s voice.

Disclaimer: The views expressed in this article are my own and do not necessarily represent the views of my employer.