Ten million tokens. That’s the context window on Llama 4 Scout, Meta’s open model released earlier this month. It is, by a wide margin, the largest context window of any open model. To make the number concrete: ten million tokens is roughly the entire Lord of the Rings trilogy, twice over, plus the appendices. Or every email a small company has sent in a year. Or the source code of a mid-sized SaaS product, dependencies included.

For two years, the bottleneck in working with language models has been memory. Not the kind of memory you think of for a server. The kind that decides how much information a model can hold in its head at once before it forgets the beginning of what you said. Until 2025, you could fit a long article. By late 2025, you could fit a book. With Llama 4 Scout, you can fit most of a small library.

This changes the economics of a specific class of problems.

Take the most common one: feeding a language model your company’s internal documents so it can answer questions about them. The standard approach, called retrieval-augmented generation or RAG, involves cutting your documents into chunks, indexing them, finding the most relevant chunks for each question, and feeding only those to the model. We’ve covered how RAG works in detail elsewhere. The reason RAG existed at all was that you couldn’t fit your full document set into the model’s context window.

When the window is 10 million tokens, a meaningful number of teams just stop chunking. They paste the whole employee handbook, the whole product manual, the whole knowledge base, and let the model read it cover to cover for each question. Slower. More expensive per query. But often more accurate, because the model sees the connections between sections that a chunked retriever misses.

Not every team should do this. The economics depend on query volume, document update frequency, and the cost difference between RAG infrastructure and large-context inference. But it’s now an option that didn’t exist before. That’s the shift.

Scout’s full spec is a Mixture-of-Experts architecture with 109 billion total parameters and 17 billion active parameters routed across 16 experts. Maverick, the bigger sibling that shipped at the same time, has 400 billion total parameters. Both are natively multimodal: they accept text, images, and video as input, in any combination, without separate adapters bolted on after training.

The MoE detail matters more than it looks.

A traditional dense model uses all of its parameters for every prediction. A 100 billion parameter dense model is using 100 billion parameters per token, regardless of what you’re asking. That’s expensive and slow. A Mixture-of-Experts model is wired so that for any given input, only a small fraction of the parameters wake up and do the work. The other parameters sit quiet. Scout’s 17 billion active parameters per token, drawn from a 109 billion parameter pool, means you get the knowledge density of a much larger model at the inference cost of a much smaller one.

This is why MoE has become the default architecture for new open releases. Of the seven major open-source models that shipped in the first twelve days of April 2026, four use MoE. Mistral Small 4 uses it. Qwen 3 uses it. The deeper tradeoffs of mixture-of-experts versus dense models are worth understanding before you commit your stack to one or the other, but the trend is clear: large dense models are becoming an artifact of an earlier era.

What’s surprising about the Llama 4 release is the licensing.

Meta did not put Llama 4 under Apache 2.0 or any other standard open-source license. They put it under the Llama 4 Community License, which permits commercial use but adds restrictions: companies above a certain size threshold need to negotiate separately, and outputs from the model can’t be used to train competing models. Neither restriction is unusual for “open” frontier models in 2026. Both are different from what you get with truly permissive open-source releases like OLMo 2 32B from Ai2, which shipped the same week under Apache 2.0.

If you’re choosing between Llama 4 and an Apache-licensed alternative, the license should be in the conversation alongside the benchmark scores. A model that scores three points higher but locks you out of certain commercial paths is not always the right answer. The legal review costs more time than most engineering teams budget for.

Then there’s the practical question of running Scout. A 10 million token context window is impressive. Filling it for every query will bankrupt most teams quickly. The pricing math, even self-hosted, is rough: each additional thousand tokens you load into context is additional GPU memory and additional compute, and the cost grows roughly linearly with input length. Loading a million tokens for every customer support query is technically possible. It’s also financially absurd unless the answer is high-value enough to justify it.

The real workflow that Scout enables is different. You load a lot of context once, get many answers from it within a single session, and pay the loading cost amortized across many queries. Long sessions become cheaper per question. Short sessions stay expensive.

The deployment question: If your company has internal data that nobody outside should see, and your team has the engineering capacity to run a 109B model on your own infrastructure, Scout’s combination of context length, multimodality, and self-hosted control is hard to beat. If you’d rather pay an API and forget about it, the question is whether OpenAI’s or Anthropic’s hosted models with their own large contexts are good enough for your use case. Most teams will find they are.

A final note, easy to miss in the spec sheet. Multimodality in Scout is native, not retrofitted. The model was trained from scratch on text, images, and video together. That changes the quality of vision tasks compared to models that bolted vision on after the fact. Reading a chart with a paragraph of context, or describing a video clip in the language of the surrounding documentation, both work better when the model learned all three modalities simultaneously. This is a quieter trend than the context window numbers, but it’s the one that’s going to differentiate models a year from now.

Frequently asked questions

Is Llama 4 Scout actually better than GPT-5.5 or Claude Opus 4.7?

Not on aggregate benchmarks. The frontier closed-source models still lead on most standard tests by a measurable margin. Scout’s value isn’t beating them on average. It’s offering a comparable-enough capability, on hardware you control, with a context window the closed models charge you significantly more for.

Do I need a Mixture-of-Experts model, or is a dense model fine?

For most use cases under 30 billion parameters, dense is fine and easier to deploy. Above that scale, MoE becomes meaningfully cheaper to serve at the same quality. If you’re running models that big, MoE has become the default for good reasons.

What’s the catch with the 10 million token context?

Cost and latency. Filling 10 million tokens of context takes real time and real money, both during input processing and during generation. Models also tend to lose attention quality at the very end of long contexts. The headline number is real, but in practice most teams operate at 100,000 to 500,000 tokens per query, not the maximum.

How does the Llama 4 Community License compare to true open source?

It’s open enough for most commercial uses, restricted for the largest companies and for training competing models. If your business is below the size threshold and you’re not building a competing foundation model, the license is workable. If you want zero restrictions, look at OLMo 2 32B or other Apache 2.0 alternatives. The deeper open source production tradeoffs shape that decision more than the model itself.