Generative AI

Generative AI refers to machine learning systems that produce new content – text, images, video, audio, 3D models, code, and more – rather than simply classifying or analyzing existing data. These systems are trained on large datasets and learn to generate plausible new outputs by modeling the statistical patterns in that data.

Large language models (LLMs) are the most prominent example, generating text and code. But generative AI encompasses other modalities too.

Fundamentally, these models are all developed in much the same way — trained on large datasets to model the statistical patterns in that data. What differs is mainly the type of media they are trained on and the architecture best suited to it. Just as LLMs specialize in text, diffusion models specialize in image generation (see below).

This section covers the broader landscape of generative AI, including image, video, and audio generation.

Image generation

Image generation models produce photorealistic or stylized images from text prompts, reference images, or other inputs. Most modern image generators are based on diffusion models, which learn to iteratively de-noise random noise into coherent images.

Notable tools and platforms include:

Midjourney – High-quality image generation accessible via Discord and web interface.
GPT Image: OpenAI’s current, natively-integrated image generation model (gpt-image-1), noted for near-perfect text rendering. The successor to DALL·E.
DALL·E: OpenAI’s earlier image generation line, integrated into ChatGPT.
Stable Diffusion: Open-weight diffusion model from Stability AI, widely used as a base for fine-tuned variants.
Flux: High-resolution, prompt-editable, open-weight image models from Black Forest Labs.
Ideogram: Image generation noted for rendering precise, legible text within images.
Adobe Firefly: Adobe’s generative image models, integrated into Creative Cloud and designed to be commercially safe.
ComfyUI: Open-source, node-based visual workflow engine for running diffusion models locally, supporting image, video, 3D, and audio generation pipelines.

Video generation

Video generation models extend image generation into the temporal dimension, producing short clips from text or image prompts.

Notable tools and platforms include:

Sora: OpenAI’s text-to-video model.
Google Veo: Google DeepMind’s text-to-video model, with synchronized audio generation.
Kling: Cinematic, realistic video generation, from Kuaishou.
Runway: Commercial video generation and editing platform.
Luma Dream Machine: Text-to-video and image-to-video generation.

Audio generation

Audio generation covers music, sound effects, and voice synthesis from text or other audio inputs.

Notable tools and platforms include:

Suno: AI music generation from text prompts.
Udio: Music generation platform.
ElevenLabs: Voice synthesis and cloning, widely used for narration and dubbing.

References

Generative Deep Learning (2nd ed.), David Foster – Teaching machines to paint, write, compose, and play: GANs, VAEs, and diffusion models.

Generative AI

Image generation

Video generation

Audio generation

See also

References