Running Distilled DeepSeek R1 models locally on Copilot+ PCs, powered by Windows Copilot Runtime

Brink · Jan 30, 2025

Windows Developer Blog:

AI is moving closer to the edge, and Copilot+ PCs are leading the way. With the availability of cloud hosted DeepSeek R1 available on Azure AI Foundry, we’re bringing NPU-optimized versions of DeepSeek-R1 directly to Copilot+ PCs, starting with Qualcomm Snapdragon X first, followed by Intel Core Ultra 200V and others. The first release, DeepSeek-R1-Distill-Qwen-1.5B (Source), will be available in AI Toolkit, with the 7B (Source) and 14B (Source) variants arriving soon. These optimized models let developers build and deploy AI-powered applications that run efficiently on-device, taking full advantage of the powerful NPUs in Copilot+ PCs.

The Neural Processing Unit (NPU) on Copilot+ PCs offers a highly efficient engine for model inferencing, unlocking a paradigm where generative AI can execute not just when invoked, but enable semi-continuously running services. This empowers developers to tap into powerful reasoning engines to build proactive and sustained experiences. With our work on Phi Silica, we were able to harness highly efficient inferencing – delivering very competitive time to first token and throughput rates, while minimally impacting battery life and consumption of PC resources. The optimized DeepSeek models for the NPU take advantage of several of the key learnings and techniques from that effort, including how we separate out the various parts of the model to drive the best tradeoffs between performance and efficiency, low bit rate quantization and mapping transformers to the NPU. Additionally, we take advantage of Windows Copilot Runtime (WCR) to scale across the diverse Windows ecosystem with ONNX QDQ format.

Get ready to play!

First things first…let’s give it a whirl.

To see DeepSeek in action on your Copilot+ PC, simply download the AI Toolkit VS Code extension. The DeepSeek model optimized in the ONNX QDQ format will soon be available in AI Toolkit’s model catalog, pulled directly from Azure AI Foundry. You can download it locally by clicking the “Download” button. Once downloaded, experimenting with the model is as simple as opening the Playground, loading the “ deepseek_r1_1_5” model, and sending it prompts.

In addition to the ONNX model optimized for Copilot+ PC, you can also try the cloud-hosted source model in Azure Foundry by clicking on the “Try in Playground” button under “ DeepSeek R1”.

AI Toolkit is part of your developer workflow as you experiment with models and get them ready for deployment. With this playground, you can effortlessly test the DeepSeek models available in Azure AI Foundry for local deployment. Get started with AI Toolkit for Visual Studio Code | Microsoft Learn.

Silicon Optimizations

The distilled Qwen 1.5B consists of a tokenizer, embedding layer, a context processing model, token iteration model, a language model head and de tokenizer.

We use 4-bit block wise quantization for the embeddings and language model head and run these memory-access heavy operations on the CPU. We focus the bulk of our NPU optimization efforts on the compute-heavy transformer block containing the context processing and token iteration, wherein we employ int4 per-channel quantization, and selective mixed precision for the weights alongside int16 activations.

While the Qwen 1.5B release from DeepSeek does have an int4 variant, it does not directly map to the NPU due to presence of dynamic input shapes and behavior – all of which needed optimizations to make compatible and extract the best efficiency. Additionally, we use the ONNX QDQ format to enable scaling across a variety of NPUs we have in the Windows ecosystem. We work out an optimal operator layout between the CPU and NPU for maximum power-efficiency and speed.

To achieve the dual goals of low memory footprint and fast inference, much like Phi Silica, we make two key changes: First, we leverage a sliding window design that unlocks super-fast time to first token and long context support despite not having dynamic tensor support in the hardware stack. Second, we use the 4-bit QuaRot quantization scheme to truly take advantage of low bit processing. QuaRot employs Hadamard rotations to remove outliers in weights and activations, making the model easier to quantize. QuaRot significantly improves quantization accuracy, compared to existing methods, such as GPTQ, particularly for low granularity settings such as per-channel quantization. The combination of low-bit quantization and hardware optimizations such the sliding window design help deliver the behavior of a larger model within the memory footprint of a compact model. With these optimizations in place, the model is capable of a time to first token of 130 ms and a throughput rate of 16 tokens/s for short prompts (<64 tokens).

We include examples of the original and quantized model responses below to show the minor differences between the two variants, with the latter being both fast and power-efficient:

Sample responses from the original model (left) vs NPU-optimized model (right) for the same prompt, including the model’s reasoning capability.

Figure 1: Qualitative comparison. Sample responses from the original model (left) vs NPU-optimized model (right) for the same prompt, including the model’s reasoning capability. The model follows a similar reasoning pattern, and reaches the same answer, demonstrating that the optimized model retains the reasoning ability of the original model.

With the speed and power characteristics of the NPU-optimized version of the DeepSeek R1 models users will be able to interact with these ground-breaking models entirely locally.

Source: