vLLM, one of the most widely run open-source engines for serving large language models, has a fresh memory-safety bug in the exact feature it tried to lock down six months ago. CVE-2026-56340 lets anyone who can send a request to the inference API submit a malformed tensor that crashes the worker, with a documented route to out-of-bounds memory corruption. It scores 8.8. The detail that decides whether it touches you is not whether you run vLLM. It is whether you switched its prompt embeds feature back on.
That single config choice is the whole story, so start there before you do anything else.
What actually broke
vLLM can accept multimodal embeddings as raw tensors through its prompt embeds path. PyTorch keeps its sparse-tensor invariant checks switched off by default, a speed tradeoff its own docs are open about. vLLM never added a check of its own, so a request carrying a sparse tensor with negative or out-of-range indices sails straight through. When the server expands that tensor into a dense one, the bad indices drive a write past the allocated buffer. The mild outcome is a crashed worker and a denial of service. The advisory also describes the worse one: a write-what-where condition, which is the raw material for code execution.
The flaw lands in vLLM 0.10.2 through 0.12.x and is fixed in 0.13.0. It was reported by a vLLM maintainer, not found in an attack, and there is no public exploit as of this writing. Treat that as breathing room, not safety: the memory-corruption path is spelled out in the vendor advisory, and a documented primitive tends to attract a proof of concept.
Why this is the same bug twice
Last year's CVE-2025-62164 hit the same prompt embeds surface. The response was to ship the feature switched off by default instead of validating what it accepts. That move contained the blast radius, but it quietly handed the risk to every operator who turned the feature back on. CVE-2026-56340 is the proof that the underlying problem was never solved: a different bad-tensor path, in the same place, reachable the moment the feature is live again.
This is the part worth sitting with. Shipping a feature off by default is a containment decision, not a fix. It buys time and it lowers the number of exposed installs, but it leaves the dangerous code intact and shifts the duty of care onto operators who may not even remember opting in. When the same component generates a second memory-safety CVE, the lesson is that policing this input was never the framework's job. It was the serving layer's.
The deeper pattern is one every team standing up an inference API should sit with. Model-serving stacks inherit PyTorch's performance defaults, including the disabled invariant checks, and then treat the embeddings endpoint as a friendly data plane. A tensor with attacker-chosen indices is not friendly data. It is hostile input that reaches a memory operation, which makes the embeddings API a deserialization surface in everything but name. vLLM 0.13.0 finally does what the API boundary always needed to do: it validates that the indices are non-negative and within bounds.
Who is actually exposed
Three conditions have to line up. You run an affected vLLM build (0.10.2 up to but not including 0.13.0). You enabled prompt embeds, which teams commonly do to feed precomputed multimodal embeddings straight into the model for retrieval or image pipelines. And the endpoint is reachable by a caller you do not fully trust, since the bug needs a valid request, not an authentication bypass.
If prompt embeds is off, the default since the last patch, this particular flaw cannot reach you. That is both the reassurance and the trap. Plenty of teams flipped the setting on once for a pipeline experiment and never flipped it back. The honest answer to whether you are exposed usually starts with an audit, not a memory.
One more piece of calibration. The likely real-world outcome here is a crash, not a shell. Memory corruption is in scope, but turning a write-what-where into reliable code execution against a modern allocator is real work, and no one has shown it for this bug yet. We made the same point about two high-scoring NGINX flaws where the practical result on a default install was a downed process rather than a takeover. Score the urgency on exposure and exploitability, not on the worst line in the severity vector.
What to do this week
The fix is short and the order matters.
- Upgrade to vLLM 0.13.0. It addresses the root cause for this flaw and for the 2025 one, so it is the durable answer rather than another deferral.
- If you cannot upgrade today, turn prompt embeds off. That puts you back in the contained posture and removes the attack surface entirely until you can patch.
- Inventory before you assume. Check your launch flags, Helm values, and orchestration manifests for the prompt embeds setting across every vLLM deployment. Treat "not sure" as "enabled."
- Put the inference endpoint behind network controls. A model-serving API reachable from untrusted networks is the precondition that turns this from a config note into an incident. Most vLLM endpoints have no business being public.
The AI-serving tier keeps repeating the web's early mistakes at high speed. We have watched a single rigged document walk a Langflow file reader up to server takeover, and now a malformed tensor doing the memory-safety equivalent inside vLLM. As more shops push raw tensors into serving stacks for throughput, the embeddings endpoint becomes the next deserialization frontier. Validate hostile input at the edge, or inherit a framework's defaults that were tuned for speed and never for an adversary.