vLLM CVE-2026-56340: a fix that only hid the flaw

Coiled wire unwinding into a flat disc with one strand piercing the frame

By Suriq's Jack · Jun 21, 2026 · 02:55 UTC ·Security news · 5 min read Mitigate now

vLLM, one of the most widely run open-source engines for serving large language models, has a fresh memory-safety bug in the exact feature it tried to lock down six months ago. CVE-2026-56340 lets anyone who can send a request to the inference API submit a malformed tensor that crashes the worker, with a documented route to out-of-bounds memory corruption. It scores 8.8. The detail that decides whether it touches you is not whether you run vLLM. It is whether you switched its prompt embeds feature back on.

That single config choice is the whole story, so start there before you do anything else.

What actually broke

vLLM can accept multimodal embeddings as raw tensors through its prompt embeds path. PyTorch keeps its sparse-tensor invariant checks switched off by default, a speed tradeoff its own docs are open about. vLLM never added a check of its own, so a request carrying a sparse tensor with negative or out-of-range indices sails straight through. When the server expands that tensor into a dense one, the bad indices drive a write past the allocated buffer. The mild outcome is a crashed worker and a denial of service. The advisory also describes the worse one: a write-what-where condition, which is the raw material for code execution.

The flaw lands in vLLM 0.10.2 through 0.12.x and is fixed in 0.13.0. It was reported by a vLLM maintainer, not found in an attack, and there is no public exploit as of this writing. Treat that as breathing room, not safety: the memory-corruption path is spelled out in the vendor advisory, and a documented primitive tends to attract a proof of concept.

Why this is the same bug twice

Last year's CVE-2025-62164 hit the same prompt embeds surface. The response was to ship the feature switched off by default instead of validating what it accepts. That move contained the blast radius, but it quietly handed the risk to every operator who turned the feature back on. CVE-2026-56340 is the proof that the underlying problem was never solved: a different bad-tensor path, in the same place, reachable the moment the feature is live again.

This is the part worth sitting with. Shipping a feature off by default is a containment decision, not a fix. It buys time and it lowers the number of exposed installs, but it leaves the dangerous code intact and shifts the duty of care onto operators who may not even remember opting in. When the same component generates a second memory-safety CVE, the lesson is that policing this input was never the framework's job. It was the serving layer's.

The deeper pattern is one every team standing up an inference API should sit with. Model-serving stacks inherit PyTorch's performance defaults, including the disabled invariant checks, and then treat the embeddings endpoint as a friendly data plane. A tensor with attacker-chosen indices is not friendly data. It is hostile input that reaches a memory operation, which makes the embeddings API a deserialization surface in everything but name. vLLM 0.13.0 finally does what the API boundary always needed to do: it validates that the indices are non-negative and within bounds.

Who is actually exposed

Three conditions have to line up. You run an affected vLLM build (0.10.2 up to but not including 0.13.0). You enabled prompt embeds, which teams commonly do to feed precomputed multimodal embeddings straight into the model for retrieval or image pipelines. And the endpoint is reachable by a caller you do not fully trust, since the bug needs a valid request, not an authentication bypass.

If prompt embeds is off, the default since the last patch, this particular flaw cannot reach you. That is both the reassurance and the trap. Plenty of teams flipped the setting on once for a pipeline experiment and never flipped it back. The honest answer to whether you are exposed usually starts with an audit, not a memory.

One more piece of calibration. The likely real-world outcome here is a crash, not a shell. Memory corruption is in scope, but turning a write-what-where into reliable code execution against a modern allocator is real work, and no one has shown it for this bug yet. We made the same point about two high-scoring NGINX flaws where the practical result on a default install was a downed process rather than a takeover. Score the urgency on exposure and exploitability, not on the worst line in the severity vector.

What to do this week

The fix is short and the order matters.

Upgrade to vLLM 0.13.0. It addresses the root cause for this flaw and for the 2025 one, so it is the durable answer rather than another deferral.
If you cannot upgrade today, turn prompt embeds off. That puts you back in the contained posture and removes the attack surface entirely until you can patch.
Inventory before you assume. Check your launch flags, Helm values, and orchestration manifests for the prompt embeds setting across every vLLM deployment. Treat "not sure" as "enabled."
Put the inference endpoint behind network controls. A model-serving API reachable from untrusted networks is the precondition that turns this from a config note into an incident. Most vLLM endpoints have no business being public.

The AI-serving tier keeps repeating the web's early mistakes at high speed. We have watched a single rigged document walk a Langflow file reader up to server takeover, and now a malformed tensor doing the memory-safety equivalent inside vLLM. As more shops push raw tensors into serving stacks for throughput, the embeddings endpoint becomes the next deserialization frontier. Validate hostile input at the edge, or inherit a framework's defaults that were tuned for speed and never for an adversary.

Frequently asked questions

What is CVE-2026-56340 in vLLM?

CVE-2026-56340 is a high-severity flaw (CVSS 8.8) in the vLLM inference server. A crafted multimodal embedding tensor with out-of-range indices can crash the worker, and the advisory documents a path to out-of-bounds memory corruption. It affects versions 0.10.2 up to 0.13.0.

Am I affected if I run vLLM?

Only if you enabled the prompt embeds feature, which ships off by default since the 2025 patch. You also need an affected build (0.10.2 to 0.12.x) and an endpoint a partly untrusted caller can reach. If prompt embeds is off, this flaw cannot reach you.

How do I fix CVE-2026-56340?

Upgrade to vLLM 0.13.0, which validates tensor indices and addresses the root cause. If you cannot upgrade immediately, turn the prompt embeds feature off to remove the attack surface, and keep the inference endpoint off untrusted networks until you patch.

Is CVE-2026-56340 being exploited?

No public exploitation or exploit code is known as of June 2026. A vLLM maintainer reported the flaw rather than finding it in an attack. Patch anyway, because the memory-corruption primitive is documented and a proof of concept often follows a public advisory.

How is this different from CVE-2025-62164?

Both flaws sit in the same prompt embeds feature, but they are distinct bugs. The 2025 fix only disabled the feature by default rather than validating input, so this sparse-tensor path stayed exploitable once a team re-enabled it. vLLM 0.13.0 fixes the shared root cause.

Keep reading

vLLM's earlier patch only hid this AI-server bug. Re-enable embeddings and you are still exposed

What actually broke

Why this is the same bug twice

Who is actually exposed

What to do this week

Stay close to the work.

Frequently asked questions

What is CVE-2026-56340 in vLLM?

Am I affected if I run vLLM?

How do I fix CVE-2026-56340?

Is CVE-2026-56340 being exploited?

How is this different from CVE-2025-62164?

Ready to meet the Guardians?

Something great
is coming.

You're on the list.

vLLM's earlier patch only hid this AI-server bug. Re-enable embeddings and you are still exposed

What actually broke

Why this is the same bug twice

Who is actually exposed

What to do this week

Stay close to the work.

Frequently asked questions

What is CVE-2026-56340 in vLLM?

Am I affected if I run vLLM?

How do I fix CVE-2026-56340?

Is CVE-2026-56340 being exploited?

How is this different from CVE-2025-62164?

Related posts

PaperCut's Windows print client can be tricked into giving a local attacker total control

Run Central Dogma across servers? It may be guarding your config with a password printed in its source code

Your Squid proxy can leak other users' passwords, and the 7.6 update won't fix it

Ready to meet the Guardians?