November 7th, 2022
Ankur Goyal, Alana Anderson | 7 minute read
Large Language Models (LLMs) are essentially a shiny new tool for processing textual inputs and producing textual outputs. They are capable of performing tasks that were previously very difficult for computers (e.g. text summarization, reasoning) using simple natural language instructions (i.e. "prompts"). Some have compared LLMs to unix pipes, which can also perform an infinite number of tasks with text input/output. In this analogy, a "prompt", like a unix program, is a piece of text that instructs the machine (language model or CPU) to do something.
Compilers offer a similar value proposition: a user provides text, which the machine can interpret to produce textual outputs. Unlike LLMs, compilers require highly structured input: a programming language defined by a formal grammar1. Under the hood, a compiler works by generating a mathematical proof that the input and output programs are equivalent. As a result, compilers can generate highly performant code by repeatedly transforming high level statements into a series of equivalent lower level instructions that execute much faster.
In many ways, LLMs and compilers emulate different aspects of human intellect. In Kahneman's popular framework, today's LLMs mimic our System 1 mode (automatic, flexible, error-prone), while compilers automate our System 2 reasoning (deep, complex, logical, precise). A natural area of overlap between LLMs and compiled programs is code generation, i.e. "take natural language instructions as input and write me some code" (Copilot, Codegen, Repl.it). However, we believe that code generation is just scratching the surface.
Given the incredible volume of AI research over the last few months, there's naturally been a lot of chatter about LLMs, and in particular, generative AI. While there's no shortage of cool applications being built, we wanted to go a level deeper. As database nerds, we focused on the intersection of LLMs and compilers and discovered the following projects and ideas along the way. We would love to hear from you if any of these concepts resonate with you.
No code (and business intelligence) have enabled a wide range of non-programmers to build software. These tools are highly visual — requiring several clicks to assemble, but very simple for our brains to understand.
To alleviate this tradeoff, many tools (e.g. Airtable, Retool, Webflow) include templates, which accelerate the setup process. LLMs provide an interesting opportunity to turbocharge these templates. Imagine inputting "setup an invoice template with a mobile app for submitting payments" or "create a website that looks like this template but uses this other website's color scheme". In many ways, this is more powerful than text-based code generation. Visual programs are time consuming to create, easier to visually verify, and easier to tweak, creating a high value feedback loop that can improve the underlying model.
We're not aware of any specific research on this topic, partly because no code tools tend to be proprietary and inaccessible to researchers. However, we'd bet that generic code generation models will straightforwardly generalize to generate no code "programs", potentially without requiring fine tuning2.
Computer programs have many brittle components: handling/generating errors, parsing human input, communicating with external systems, etc. LLMs are a natural fit for these "messy" problems. In general, we expect to see LLM libraries flourish and eventually used in just about every piece of software. This trend will resemble the shift to API-centric code over the last decade. A recent example shows GPT-3 integrated into Google Sheets:
The prevailing toolsets are HuggingFace's Transformers library, which allows you to leverage pre-trained models in simple Python code, and OpenAI's GPT-3 API. However, models often return incorrect results, which is difficult to handle with modern software engineering patterns. As a result, machine learning powered software tends to feel buggy.
Recently, more advanced techniques like Chain of Thought force a model to explain its thinking, which can lead to higher accuracy and more explainable results. Factored Cognition and Cascades take this one step further by formalizing LLM prompt chains into a probabilistic programming framework, which forces you to write code that robustly handles uncertainty. Other interesting projects include LangChain, which aims to make prompt engineering composable and Binder, which allows a model to solve ambiguous tasks by generating code with callbacks to the model.
From a business standpoint, it's not clear to us yet where the value will accrue. Will it be the underlying APIs, or the apps built on top of them? The best models or the best model serving infrastructure? How scalably do the OSS products monetize? These are all open questions that will play out over the coming years.
On the flipside, models can learn to execute code like a traditional program, and then be applied to solve more open-ended reasoning problems. TAPEX first learns how to execute SQL queries and is then fine tuned to answer open ended questions about tables, beating previous state of the art performance. POET takes this one step further by learning how to execute SQL and Python, and then answering open-ended reasoning questions.
One of the key benefits of this approach is that you can access an unlimited quantity of training data, by generating random (valid) programs, capturing their output, and then training a model to generate the same output3. You can think of these models as flexible approximation of a computer program that can be prompted with natural language and handle some ambiguity. We expect to see this technique flourish in logic-heavy problems like business intelligence, for example allowing a user to ask "which month had the most MRR growth?" in a chart like this:
Compilers have been remarkably effective at optimizing code. You can write a high level SQL query, Python script, Rust program, etc. and trust a compiler to generate high performance code. However, searching over the set of possible optimizations is expensive and limited to the imagination of (human) compiler authors.
AI is an excellent method to efficiently propose many possible solutions to a problem, which can lead to more optimized code. AlphaTensor is a matrix multiplication algorithm that was discovered by a model. OtterTune automatically tunes your database for optimal performance. The compiler of the future may have an LLM "copilot", guiding it through optimization passes.
Of note: these optimization techniques use machine learning, but to our knowledge, are not using large language models. The other listed use cases do.
Today's premier models encode both reasoning and knowledge. For example, you can ask GPT-3 "who was the 35th president of the United States?", and it has the capability to interpret the question and the knowledge to answer it. However, this comes at the cost of model size (GPT-3 is 175B parameters) and rigdity (LLMs cannot incorporate new information without significant retraining).
An emerging alternate approach is factoring out the "memory" capabilities of a large language model, so that its reasoning can be applied to many different datasets. DeepMind released a paper (Improving language models by retrieving from trillions of tokens) describing an approach that, to our knowledge, has no open source implementation. Recently, MEMIT showed that you can edit a transformer's memory.
An LLM that can reason about an arbitrary input dataset is essentially a new kind of database, which recently prompted a discussion on Twitter:
Whether it's productized as an index or a new database query engine altogether, this idea has the potential to revolutionize how enterprises consume data. Imagine asking "what were our sales in Q1 2020?" without having to even locate a dashboard or dataset.
We're very excited about how programming use cases can be reimagined from first principles in the presence of LLMs. Compilers and programming languages are at the core of technology and innovation, so improvements have the potential to make a massive economic impact. Although Andrej's Software 3.0 meme is more than 2 years old, it's finally starting to come true. If you are thinking about this space, please reach out!
Thanks to Michael Fester, Richard Stebbing, Qian Liu, David Dohan, and Shubhro Saha for feedback on drafts of this post.
Certain natural languages can also be defined by a formal grammar, namely Sanskrit. ↩
One simple way to think about visual code generation is to use an off-the-shelf code generator to print code in a well known language (e.g. Python) and reverse-compile it into a visual language. ↩
The caveat, of course, is that you must model a sufficient amount of diversity in the code generation process. ↩