A Primer on Generative AI (GenAI)

Sep 19, 2023

Last updated: September 2023

How should a non-technical person or a machine learning engineer get up to speed on the fast-moving space of Generative AI? Below are the resources that I used and would recommend, mostly from my self-study, though I work with multiple teams of engineers at Meta building GenAI models at scale, with a focus on the LLMs that often act as the foundation and connective tissue.

I offer five stages to jump in:

Do some background reading on prior machine learning (deep learning, RL, etc);
Read some non-technical pieces on GenAI and play with current tools;
Learn the GenAI technical basics via courses, books, and papers;
Start reading papers in the GenAI field and take more advanced courses;
Build your own GenAI models from scratch.

I expect that over time the generative AI field will overtake traditional AI (for ranking, prediction, classification, etc) as generative use cases track a much wider area of human activity.

Generative AI: A Creative New World | Sequoia Capital

1) Background Reading for Machine Learning, Before GenAI

Below I’ve put together a collection of resources on older (pre-2023) machine learning and AI books, papers, videos, and podcasts. I recommend people start understanding the foundational material in this section before jumping into GenAI.

Primer on AI and Machine Learning (Part 1) — Beginners Level (Non-Technical): Artificial Intelligence (AI) will be one of the greatest developments of science and civilization, as machines approach, augment, and exceed human performance on a wide range of cognitive tasks over the next few decades.
This lays out how a smart, college-educated person with a fascination with AI but no technical background can get up to speed on it.
Machine Learning (including Deep Learning and Reinforcement Learning) for Engineers — A Technical Primer (Part 2): This lays out how a software developer, hacker, or coding cross-functional partner (Engineering manager, product manager, project manager, etc) who is fascinated by AI but with no ML-specific background build a base.

Due to generative AI, experts assess that technology could achieve human- level performance in some capabilities sooner than previously thought.

2) Non-Technical Writing on GenAI, plus Current Tools

The technology of Gen AI

A Generative AI Primer, from Its Origins (2023): Understanding the current state of technology requires understanding its origins. This reading list provides sources relevant to the form of generative AI that led to natural language processing (NLP) models such as ChatGPT.
Google: A generative AI primer for busy executives (2023): A very high-level and simplified explanation of the field and why it matters. A place to start if you know nothing about this space.
McKinsey: What’s the future of generative AI? An early view in 15 charts (2023): A quick take from many reports, with links to them.
A guide to Generative AI terminology: A short list of some key terms in the field. Also see this decent Adobe GenAI Glossary.
On the opportunities and risks of foundation models (2021): Stanford overview paper on Foundation Models. Long and opinionated, but this shaped the term.
State of AI Report: An annual roundup of everything going on in AI, including technology breakthroughs, industry development, politics/regulation, economic implications, safety, and predictions for the future.
GPTs are GPTs: An early look at the labor market impact potential of large language models: This paper from researchers at OpenAI, OpenResearch, and the UPenn predicts that “around 80% of the U.S. workforce could have at least 10% of their work tasks affected by the introduction of LLMs, while approximately 19% of workers may see at least 50% of their tasks impacted.”
Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality: An experiment involved 758 BCG consultants comprising about 7% of the individual contributor-level consultants at the company. After establishing a performance baseline on a similar task, subjects were randomly assigned to one of three conditions: no AI access, GPT-4 AI access, or GPT-4 AI access with a prompt engineering overview. For each one of a set of 18 realistic consulting tasks within the frontier of AI capabilities, consultants using AI were significantly more productive (they completed 12.2% more tasks on average, and completed task 25.1% more quickly), and produced significantly higher quality results (more than 40% higher quality compared to a control group).
What Is ChatGPT Doing … and Why Does It Work? (2023): A very detailed explanation from Wolfram that starts simply and ends up getting into the details.

The Business of GenAI

Behind the Hype: A Deep Dive into the AI Value Chain: We are near the local peak of a GenAI hype bubble in 2023, where ZIRP still affects AI startups. AI value chains and profit pools - what matters and who will win is variable. The current AI ecosystem (MAMANA, ML Ops, Startups) is growing fast, and this posits where the best investments are.
Who owns the generative AI platform?: Our flagship assessment of where value is accruing, and might accrue, at the infrastructure, model, and application layers of generative AI.
Navigating the high cost of AI compute: A detailed breakdown of why generative AI models require so many computing resources, and how to think about acquiring those resources (i.e., the right GPUs in the right quantity, at the right cost) in a high-demand market.
The economic potential of generative AI | McKinsey: Estimates that generative AI could add the equivalent of $2.6 trillion to $4.4 trillion annually across the 63 use cases we analyzed—by comparison, the United Kingdom’s entire GDP in 2021 was $3.1 trillion. About 75% of the value that generative AI use cases could deliver falls across four areas: Customer operations, marketing and sales, software engineering, and R&D.
Generative AI: A Creative New World (Sequoia): Certain functions may be completely replaced by generative AI, while others are more likely to thrive from a tight iterative creative cycle between human and machine—but generative AI should unlock better, faster and cheaper creation across a wide range of end markets. The dream is that generative AI brings the marginal cost of creation and knowledge work down towards zero, generating vast labor productivity and economic value—and commensurate market cap.
Generative AI: Hype, or Truly Transformative (Goldman Sachs): GS economists assess the technology’s potentially large impact on productivity and growth, which their equity strategists estimate could translate into significant upside for US equities over the medium-to-longer term, though our strategists also warn that past productivity booms have resulted in equity bubbles that ultimately burst.
AI Canon by a16z: This course has a mix of basic to advanced articles on AI, as curated by the VCs at a16z.

Legal Concerns

Currently, there is much uncertainty about the legality of GenAI tools, especially as they push the boundaries of copyright law when it comes to training models, along with other issues of liability.

The high-level copyright issues and ambiguities are presented by CRS GenAI and Copyright Primer (2023) and Samuelson GenAI Meets Copyright (2023). Three sets of ongoing litigation are detailed in TC Current GenAI Legal Cases, or you can read the complaints in Github Copilot Litigation and Silverman complaints against OpenAI. Prof. Sag highlights Copyright Safety for Generative AI, where memorization of the training data is more likely and copyright is problematic (e.g. models are trained on many duplicates of the same work; images are associated with unique text descriptions; and the ratio of the size of the model to the training data is relatively large).

Some solutions and technical directions to get around the copyright issues are presented in Lemley on Fair Learning, Foundation Models and Fair Use, and Provable Copyright for GenAI. There were some interesting discussions about this in ICML this year, per Gen AI and Law, ICML-23.

Best-in-Class GenAI Tools

I’ve used a lot of the existing tools on the market - these are all available (not vapor-ware) and are amongst the best.

Text/LLMs

ChatGPT (esp GPT-4): GPT-4 is still the overall best LLM available right now.
Claude 2: Very good at mimicking style, and approaching GPT-4; one of the top 3 LLMs.
Bing AI (Sydney): Likely built off GPT-4, or something close, and is free; it’s great because it pairs it with search and citations.
Github Copilot: Best code gen tool right now
OpenAI CodeInterpreter: A close second to Github Copilot, but better for some tasks.
Replit Ghostwriter: A tool that assists in writing code by providing suggestions based on your input.

Image

MidJourney: The best text-to-image generation tool, but with a clunky Discord interface.
Dall-E 2: A close second to MidJourney, decent interface
Adobe Firefly: The first Big Tech image creator launched widely
Stability AI (Stable Diffusion): An open-source image gen tool but with copyright issues

Video

RunwayML (uses GPT-4): A video generation tool that utilizes advanced GPT-4 technology.
Synthesia: A tool specialized in creating AI-generated video content.
Invideo AI: A user experience design tool with capabilities for creating video prototypes.

Audio

Splash: A tool facilitating the generation of audio content.
Meta AudioCraft: A platform for crafting audio elements using AI technology.
Murf.AI: A tool to make studio-quality voiceovers in minutes. Use Murf’s lifelike AI voices for podcasts, videos, and professional presentations.

Other / Productivity

Tome (Slides Creator): A tool that aids in creating presentation slides efficiently.
Microsoft Office 365 Copilot: An AI assistant for enhancing productivity within the Office 365 suite.
Google Workspace Duet AI: Collaborative AI tool integrated within the Google Workspace for enhanced functionality.
Copy.AI: A tool designed to assist in generating copy for various content types.
Jasper AI: An AI writer and AI marketing software for enterprise teams. Creates blog posts, marketing copy, etc.

3) GenAI Basics (Technical)

This is a list of resources and guides to get coders started on learning the fundamental algos behind GPT, diffusion, and GAN models. The courses offer some of the best papers, readings, and videos together, while I’ve curated some of the best LLM papers.

Guides

The illustrated transformer: A more technical overview of the transformer architecture by Jay Alammar.
The annotated transformer: This in-depth post covers transformers at a source-code level. Requires some knowledge of PyTorch.
Let’s build GPT: from scratch, in code, spelled out: For engineers, Karpathy does a video walkthrough of how to build a GPT model.
The illustrated Stable Diffusion: Introduction to latent diffusion models, the most common type of generative AI model for images.
The Illustrated VQ-GAN with CLIP : VQGAN stands for Vector Quantized Generative Adversarial Network, while CLIP stands for Contrastive Image-Language Pretraining. Whenever we say VQGAN-CLIP1, we refer to the interaction between these two networks. They’re separate models that work in tandem.
In essence, the way they work is that VQGAN generates the images, while CLIP judges how well an image matches our text prompt.
RLHF: Reinforcement Learning from Human Feedback: Chip Huyen explains RLHF, which can make LLMs behave in more predictable and human-friendly ways. This is one of the most important but least well-understood aspects of systems like ChatGPT.
Reinforcement learning from human feedback: Computer scientist and OpenAI cofounder John Shulman goes deeper in this great talk on the current state, progress, and limitations of LLMs with RLHF.

Courses

Stanford CS25: Transformers United, an online seminar on Transformers.
Stanford CS324: Large Language Models with Percy Liang, Tatsu Hashimoto, and Chris Re, covering a wide range of technical and non-technical aspects of LLMs.
GenAI Short Courses from DeepLearning.ai (2023) (e.g. Diffusion Models, LangChain, etc)

LLM Background

Eight Things to Know about Large Language Models (2023): Probably my favorite LLM survey position as it takes some strong stances and comprehensively cites the literature.
Self-attention and transformer networks (2021)
What are embeddings (2023)
The Illustrated Word2vec - A Gentle Intro to Word Embeddings in Machine Learning (YouTube) (2022)
Catching up on the weird world of LLMS (2023)
GPT-4 technical report (2023): The latest and greatest paper from OpenAI, known mostly for how little it reveals! (blog post). The GPT-4 system card sheds some light on how OpenAI treats hallucinations, privacy, security, and other issues.
- State of GPT, by Karpathy (2023)
Alpaca: A strong, replicable instruction-following model (2023): Out of Stanford, this model demonstrates the power of instruction tuning, especially in smaller open-source models, compared to pure scale.
Claude 2 Model Card (Anthropic)(2023)
PaLM: Scaling language modeling with pathways (2022): PaLM, from Google, utilized a new system for training LLMs across thousands of chips and demonstrated larger-than-expected improvements for certain tasks as model size scaled up. (blog post). See also the PaLM-2 technical report.
LLAMA 2 Open Foundation and Chat Models Report (Meta) (2023)

4) GenAI Papers and Advanced Courses (Technical)

There’s only one GenAI textbook worth reading, and that is Foster’s (for now). The scaling hypothesis and Chinchilla papers pair well with it. Some of these LLM courses are newer and I’m still going through them (there’s not much out on the market).

In-Depth Concepts

Textbook: Foster, Generative Deep Learning (2nd Edition): Probably the single best technical book to get acquainted with the key GenAI models, from a great transformers overview to sections on VAEs, GANs, LLMs, audio models, and more.
The scaling hypothesis: One of the most surprising aspects of LLMs is that scaling—adding more data and compute—just keeps increasing accuracy. GPT-3 was the first model to demonstrate this clearly, and Gwern’s post does a great job explaining the intuition behind it.
Chinchilla’s scaling implications: Nominally an explainer of the important Chinchilla paper (see below), this post gets to the heart of the big question in LLM scaling: are we running out of data? This builds on the post above and gives a refreshed view on scaling laws.

Advanced Courses

LLM Papers

Eight Things to Know about Large Language Models (2023): Probably my favorite LLM survey position as it takes some strong stances and comprehensively cites the literature.
A survey of large language models (2023): Comprehensive breakdown of current LLMs, including development timeline, size, training strategies, training data, hardware, and more.
Sparks of artificial general intelligence: Early experiments with GPT-4: Early analysis from Microsoft Research on the capabilities of GPT-4, the current most advanced LLM, relative to human intelligence.
Attention is all you need (2017): The original transformer work and research paper from Google Brain that started it all. (blog post)
BERT: Pre-training of deep bidirectional transformers for language understanding (2018): One of the first publicly available LLMs, with many variants still in use today. (blog post)
Improving language understanding by generative pre-training (2018): The first paper from OpenAI covers the GPT architecture, which has become the dominant development path in LLMs. (blog post)
Language models are few-shot learners (2020): The OpenAI paper that describes GPT-3 and the decoder-only architecture of modern LLMs.
RLHF: Training language models to follow instructions with human feedback (2022): OpenAI’s paper explaining InstructGPT, which utilizes humans in the loop to train models and, thus, better follow the instructions in prompts. This was one of the key unlocks that made LLMs accessible to consumers (e.g., via ChatGPT). (blog post)
LaMDA: language models for dialog applications (2022): A model from Google specifically designed for free-flowing dialog between a human and chatbot across a wide variety of topics. (blog post)
Training compute-optimal large language models (2022): The Chinchilla paper. It makes the case that most models are data-limited, not compute-limited, and changed the consensus on LLM scaling. (blog post)
A Survey on Evaluation of Large Language Models (2023): Model evals will be crucial to helping improve models - this is the best survey yet.
Summary of ChatGPT/GPT-4 (2023)
NLP reasoning survey (2023)
Theory of Mind in LLMs (2023)
Predictability and Surprise in LLMs (2022)
Emergent abilities of LLMs (2022)
Multimodal LLMs (2023)
Augmented LLMs with reasoning (2023)

There are many applications of generative AI across modalities.

Model enhancements (e.g. fine-tuning, retrieval, attention)

Deep reinforcement learning from human preferences (2017): Research on reinforcement learning in gaming and robotics contexts, that turned out to be a fantastic tool for LLMs.
Retrieval-augmented generation for knowledge-intensive NLP tasks (2020): Developed by Facebook, RAG is one of the two main research paths for improving LLM accuracy via information retrieval. (blog post)
Improving language models by retrieving from trillions of tokens (2021): RETRO, for “Retrieval Enhanced TRansfOrmers,” is another approach—this one by DeepMind—to improve LLM accuracy by accessing information not included in their training data. (blog post)
LoRA: Low-rank adaptation of large language models (2021): This research out of Microsoft introduced a more efficient alternative to fine-tuning for training LLMs on new data. It’s now become a standard for community fine-tuning, especially for image models.
Constitutional AI (2022): The Anthropic team introduces the concept of reinforcement learning from AI Feedback (RLAIF). The main idea is that we can develop a harmless AI assistant with the supervision of other AIs.
FlashAttention: Fast and memory-efficient exact attention with IO-awareness (2022): This research out of Stanford opened the door for state-of-the-art models to understand longer sequences of text (and higher-resolution images) without exorbitant training times and costs. (blog post)
Hungry Hungry Hippos: Towards Language Modeling with State Space Models (2022): Again from Stanford, this paper describes one of the leading alternatives to attention in language modeling. This is a promising path to better scaling and training efficiency. (blog post)

Image generation models

The Annotated Diffusion Model (2022): Takes a deeper look into Denoising Diffusion Probabilistic Models (also known as DDPMs, diffusion models, score-based generative models or simply autoencoders) as researchers have been able to achieve remarkable results with them for (un)conditional image/audio/video generation. Also see the Awesome Diffusion Models repo for papers, tutorials, and videos.
The recent rise of diffusion-based models (2022): A very recent history of solving the text-to-image generation problem and explain the latest developments regarding diffusion models, which are playing a huge role in the new, state-of-the-art architectures.
Introduction to Diffusion Models for Machine Learning (2022): Examines the theoretical foundations for Diffusion Models, and then demonstrate how to generate images with a Diffusion Model in PyTorch.
What are Diffusion Models? (2021): Compares them to GANs and VAEs. Diffusion models are inspired by non-equilibrium thermodynamics. They define a Markov chain of diffusion steps to slowly add random noise to data and then learn to reverse the diffusion process to construct desired data samples from the noise. Unlike VAE or flow models, diffusion models are learned with a fixed procedure and the latent variable has high dimensionality (same as the original data).
Learning transferable visual models from natural language supervision (2021): Paper that introduces a base model—CLIP—that links textual descriptions to images. One of the first effective, large-scale uses of foundation models in computer vision. (blog post)
Zero-shot text-to-image generation (2021): This is the paper that introduced DALL-E, a model that combines the aforementioned CLIP and GPT-3 to automatically generate images based on text prompts. Its successor, DALL-E 2, would kick off the image-based generative AI boom in 2022. (blog post)
High-resolution image synthesis with latent diffusion models (2021): The paper that described Stable Diffusion (after the launch and explosive open source growth).
Photorealistic text-to-image diffusion models with deep language understanding (2022): Imagen was Google’s foray into AI image generation. More than a year after its announcement, the model has yet to be released publicly as of the publish date of this piece. (website)
DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation (2022): DreamBooth is a system, developed at Google, for training models to recognize user-submitted subjects and apply them to the context of a prompt (e.g. [USER] smiling at the Eiffel Tower). (website)
Adding conditional control to text-to-image diffusion models (2023): This paper from Stanford introduces ControlNet, a now very popular tool for exercising fine-grained control over image generation with latent diffusion models.

Agents

A path towards autonomous machine intelligence (2022): A proposal from Meta AI lead and NYU professor Yann LeCun on how to build autonomous and intelligent agents that truly understand the world around them.
ReAct: Synergizing reasoning and acting in language models (2022): A project out of Princeton and Google to test and improve the reasoning and planning abilities of LLMs. (blog post)
Generative agents: Interactive simulacra of human behavior (2023): Researchers at Stanford and Google used LLMs to power agents, in a setting akin to “The Sims,” whose interactions are emergent rather than programmed.
Reflexion: an autonomous agent with dynamic memory and self-reflection (2023): Work from researchers at Northeastern University and MIT on teaching LLMs to solve problems more reliably by learning from their mistakes and past experiences.
Toolformer: Language models can teach themselves to use tools (2023): This Meta project trained LLMs to use external tools (APIs, in this case, pointing to things like search engines and calculators) in order to improve accuracy without increasing model size.
Auto-GPT: An autonomous GPT-4 experiment: An open-source experiment to expand on the capabilities of GPT-4 by giving it a collection of tools (internet access, file storage, etc.) and choosing which ones to use in order to solve a specific task.
BabyAGI: This Python script utilizes GPT-4 and vector databases (to store context) in order to plan and execute a series of tasks that solve a broader objective.

Code generation

Evaluating large language models trained on code (2021): This is OpenAI’s research paper for Codex, the code-generation model behind the GitHub Copilot product. (blog post)
Competition-level code generation with AlphaCode (2021): This research from DeepMind demonstrates a model capable of writing better code than human programmers. (blog post)
CodeGen: An open large language model for code with multi-turn program synthesis (2022): CodeGen comes out of the AI research arm at Salesforce, and currently underpins the Replit Ghostwriter product for code generation. (blog post)
Code Llama: Open Foundation Models for Code (2023): The new open source LLAMA for code model from Meta.

Video generation

Make-A-Video: Text-to-video generation without text-video data (2022): A model from Meta that creates short videos from text prompts, but also adds motion to static photo inputs or creates variations of existing videos. (blog post)
Imagen Video: High-definition video generation with diffusion models (2022): A version of Google’s image-based Imagen model optimized for producing short videos from text prompts. (website)
Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising (2023): This study explores the potential of extending the text-driven ability to the generation and editing of multi-text conditioned long videos. It introduces a novel paradigm dubbed as Gen-L-Video, which is capable of extending off-the-shelf short video diffusion models for generating and editing videos comprising hundreds of frames with diverse semantic segments without introducing additional training, all while preserving content consistency.

5) Build your own GenAI models from Scratch

Practical Guides to Building with LLMs

Build a GitHub support bot with GPT3, LangChain, and Python: One of the earliest public explanations of the modern LLM app stack. Some of the advice in here is dated, but it kicked off widespread adoption and experimentation of new AI apps.
Building LLM applications for production: Chip Huyen discusses many of the key challenges in building LLM apps, how to address them, and what types of use cases make the most sense.
Prompt Engineering Guide: For anyone writing LLM prompts—including app devs—this is the most comprehensive guide, with specific examples for a handful of popular models. For a lighter, more conversational treatment, try Brex’s prompt engineering guide.
Prompt injection: What’s the worst that can happen? Prompt injection is a potentially serious security vulnerability lurking for LLM apps, with no perfect solution yet. Simon Willison gives the definitive description of the problem in this post. Nearly everything Simon writes on AI is outstanding.
OpenAI cookbook: For developers, this is the definitive collection of guides and code examples for working with the OpenAI API. It’s updated continually with new code examples.
Vector Embeddings - Pinecone Learning Center: Many LLM apps are based around a vector search paradigm. Pinecone’s learning center—despite being branded vendor content—offers some of the most useful instructions on how to build with vector search.
LangChain docs: As the default orchestration layer for LLM apps, LangChain connects to just about all other pieces of the stack. So their docs are a real reference for the full stack and how the pieces fit together.

Hash Collision

Discussion about this post