Software runs. Much like a motor car, software applications are initiated (keys in ignition), driven forwards to execute their functions (foot on the gas), steered in various directions depending on the user’s requirements (asked to perform specific functions and instructions) and sometimes accelerated (additional processing power), remodelled (we call it code refactoring) and eventually stopped.
As we now drive our software systems with increased use of generative Artificial Intelligence (gen-AI) and direct this new automated smartness with the new controls available to us, our fuel system will be largely pumped by the Large Language Model (LLM) datasets that propagate and populate our machine brains’ know-how. But this vehicle will still be subject to wear and tear as it also exhibits a need for maintenance and upgrades.
To drive the analogy home one more time, much like running a vehicle, a business needs to know that its LLM is operating at peak performance and efficiency at all times. While there is always testing and maintenance in the software industry, we don’t always think about this process with the same rigor that a good mechanic might apply to a vehicle inspection test.
AI model cascade mixture of thought
Subir Mansukhani, staff data scientist at Domino Data Lab agrees with this whole comparison and says that one of the key ways we can test roadworthiness is through a novel approach, known as Model Cascade with Mixture of Thought (MoT).
To break down those terms and provide an explanation, an AI model ‘cascade’ pipeline is AI software built with an implicit understanding that some questions are complex and some are comparatively simple. With this knowledge, we can channel simpler questions toward less expensive (typically smaller, perhaps less exhaustively checked with weaker answer consistency) Large Language Models and push the more complex questions to stronger higher cost Large Language Models – hence, a cascade effect is created with some AI work and computations carried out at a higher level than others. Because AI LLMs are accessed via Application Programming Interfaces (APIs), software application developers have the to power to change the gears and hit turbocharge where needed.
“This is meaningful stuff especially when you consider our increasing reliance on LLM providers like OpenAI or Anthropic,” said Mansukhani, speaking to press & analysts at a deep dive summit this month. “As companies scale up the use of prompt engineering and Retrieval-Augmented Generation (RAG) – [a technology approach used to validate AI decisions with additional data], they rapidly increase the number of queries they send to these LLMs. As expected, the more advanced LLMs, like GPT-4 or Anthropic’s Claude Opus, cost more money to use. The price gap between these leading-edge models and older or less sophisticated ones can be huge – 60 times in the case of Anthropic. That should be reason enough to look into how LLM Cascades with MoT can save organizations money.”
Get to know LLM Cascades
LLM Cascades involve using multiple LLMs in sequence, where queries are submitted to LLMs in order of increasing cost and computational power until a satisfactory answer is achieved. The primary challenge in this space thus far has come down to determining whether the response from an LLM is adequate without needing further escalation.
A study by researchers from George Mason University, Microsoft and Virginia Tech titled “Large Language Model Cascades With Mixture of Thought Representations for Cost-Efficient Reasoning” introduces a function that assesses the adequacy of an LLM’s response without relying on additional models, thereby reducing costs significantly. Mansukhani notes that the team at Domino was able to recreate and confirm this conclusion.
AI answer consistency
“Mixture of Thought (MoT) utilizes two LLMs – GPT 3.5 Turbo as the weaker model and GPT 4 as the stronger. The technique relies on the principle of ‘answer consistency’: LLMs will produce the same answers to a prompt at ‘high’ temperatures when queried multiple times if they are confident in the results. If the weaker model’s responses are consistent, there is no need to engage the stronger model, saving both money and time. This method is particularly useful when the weaker model’s answers are inconsistent; indicating the need to query the more powerful model to ensure accuracy,” explained Mansukhani.
The Domino team researchers also utilized prompting techniques specifically for reasoning tasks – Chain of Thought (CoT) prompting and Program of Thought (PoT). CoT encourages LLMs to explain their reasoning steps. This enhances accuracy for complex tasks. PoT extends this by getting the models to output program code-like logic, which sharpens the model’s reasoning capabilities.
“The study introduced two methods to evaluate answer consistency: voting and verification. Voting involves generating multiple answers from a prompt at a high-temperature setting and comparing their similarity to identify the most consistent answer. You can adjust this method by setting a flexible consistency threshold to meet budget constraints. Verification, on the other hand, compares answers across different prompting techniques, accepting the weaker model’s response if the results match. The bottom line: a Mixture of Thought approach saves money,” said Mansukhani.
The researchers used the following sum to calculate the cost:
- The cost of prompting the weaker model (because it is prompted several times).
- The cost of the answer evaluation process.
- If the evaluation process rejects the answer, add the cost of prompting the strong model.
The results of this work showed that using Mixture of Thought variants – combining voting and verification with CoT and PoT – can lead to comparable performance at 40% of the cost of solely using GPT-4. In testing against the CREPE Q&A dataset, Mixture of Thought outperformed GPT-4 at 47% of its cost. The Domino team advise that mixing PoT and CoT improves decision-making compared to using one of the techniques alone. Increasing the threshold when using the voting method did not significantly impact quality despite the additional cost. The consistency model proved itself in reliably identifying correct LLM answers. It successfully predicted when to resort to using the strong model to obtain the optimal results.
Domino adapted the paper on this GutHub repository using Langchain for thought representations and added Tree of Thought (ToT) for thought representation for complex reasoning tasks. It also used cosine similarity on the embeddings computed from responses to compute answer consistency.
As generative AI matures, let’s optimize
“Generative AI is now in its second year in the spotlight. As it matures, gen-AI projects will face additional scrutiny as all maturing innovations face as they move into the mainstream. IT and AI leaders must demonstrate the value of their projects and answer the eternal question: Does this make us money? Lowering costs while increasing model reliability will help win support for future efforts and accelerate gen-AI adoption. LLM Cascades with MoT can help accomplish that goal, but it demands effort. The cost savings and performance improvement are more than worth it,” enthused an upbeat Mansukhani.
The thoughts presented here fall very much in line with the way large-scale enterprise software systems develop over time i.e. they adapt and develop to specialize and optimize towards a more precision-tuned level of operation (one last motoring analogy there) so that they can perform better, run faster and make sharper turns.
Gentlepeople, start your AI engines.
Read the full article here