top of page

What Is a Neural Engine in 2026 and How Does It Work?

  • Mar 15
  • 22 min read
Neural engine AI chip on a futuristic circuit board with blog title text.

Your phone recognizes your face in the dark in under a millisecond. Your laptop transcribes speech without touching the internet. Your earbuds adapt to background noise in real time. None of that runs on the main processor—it runs on a chip most people have never heard of. The neural engine is the silent engine behind the AI era, and in 2026, almost every premium device ships with one. Understanding it is no longer optional for anyone serious about technology.

 

Don’t Just Read About AI — Own It. Right Here

 

TL;DR

  • A neural engine (also called a Neural Processing Unit or NPU) is a dedicated hardware chip designed specifically to run AI and machine learning (ML) tasks.

  • It performs matrix multiplications—the core math of neural networks—far faster and more efficiently than a general-purpose CPU or GPU.

  • Apple popularized the term "Neural Engine" with the A11 Bionic in 2017; by 2026, virtually every major chipmaker ships one.

  • Modern neural engines deliver between 10 and 50+ TOPS (Tera Operations Per Second), with Apple's M4 chip reaching 38 TOPS (Apple, 2024).

  • They power face ID, real-time translation, image enhancement, on-device LLMs, and health monitoring—all without sending data to the cloud.

  • The shift to on-device AI driven by neural engines is fundamentally changing privacy, latency, and energy consumption in consumer tech.


What Is a Neural Engine?

A neural engine is a specialized processor built into a chip (SoC) that accelerates artificial intelligence and machine learning tasks. Unlike a CPU or GPU, it is purpose-built for the matrix math at the heart of neural networks. This lets it run AI workloads up to 10× faster while using a fraction of the power, enabling real-time, on-device AI without a cloud connection.





Table of Contents

Background & Definitions

A neural engine is a type of processor core—or a cluster of cores—within a System-on-Chip (SoC) that is designed from the ground up to run machine learning inference workloads. The word "inference" is important here. Inference means taking a trained AI model and running it on new data to get a result—for example, recognizing that a photo contains a dog, or turning spoken words into text.


Training AI models (teaching the model from scratch using millions of data points) still predominantly happens on data center hardware. The neural engine's job is the other half: running already-trained models quickly and efficiently on your device.


The broader industry term for this type of chip is NPU (Neural Processing Unit). "Neural Engine" is Apple's branded name, first used publicly in 2017. Other companies use different names: Qualcomm calls theirs the Hexagon NPU, Google calls its mobile version the Tensor Processing Unit (TPU) in data centers and integrates an NPU into its Tensor chip for Pixel phones, Samsung embeds an NPU inside its Exynos processors, and MediaTek brands its version the APU (AI Processing Unit).


Regardless of the brand name, they all do the same fundamental job: offload AI math from the CPU and GPU to purpose-built silicon that executes it faster and with less power.


Key Terms (Quick Definitions)

Term

Simple Meaning

SoC (System-on-Chip)

A single chip that packs a CPU, GPU, memory controller, neural engine, and other cores together

Inference

Running a trained AI model to get a result from new input

TOPS

Tera Operations Per Second — 1 trillion math operations every second; the standard benchmark for NPU speed

Matrix multiplication

The core math operation of neural networks; multiplying large grids of numbers

On-device AI

AI processing that happens locally on your phone/laptop, not sent to a remote server

Model

A trained AI program (e.g., the face recognition model in Face ID)

How a Neural Engine Works

To understand a neural engine, you first need to understand what neural networks actually do mathematically.


The Math Behind AI

Every neural network, whether it recognizes faces or translates languages, boils down to the same core operation: matrix multiplication followed by a non-linear function (called an activation function). A matrix is just a grid of numbers. Multiply two big matrices together, apply a function to the result, repeat thousands of times across dozens of layers—that is a neural network making a prediction.


These operations share a key property: they are massively parallel. You do not need to wait for one multiplication to finish before starting the next. You can do millions simultaneously. This is why CPUs—designed for sequential, branching logic—are inefficient at this work. CPUs are versatile generalists. Neural networks need a specialist.


What the Neural Engine Does Differently

A neural engine is built around Matrix Multiply Accumulate (MAC) units. Each MAC unit can multiply two numbers and add the result to a running total in a single clock cycle. A modern neural engine packs thousands of these MAC units and fires them all at once in a single clock tick.


Apple's A18 Pro chip, for instance, contains a 16-core Neural Engine. Those cores are not general-purpose cores like the CPU cores. They are arrays of MAC units, data buffers, and local memory, all wired together to pump data through matrix operations as fast as physics allows.


The basic pipeline looks like this:

  1. Model loading: The trained AI model (weights and instructions) is loaded into the neural engine's local memory.

  2. Input staging: New data (e.g., a camera frame, audio sample, or sensor reading) is fed in.

  3. MAC execution: The MAC units multiply the input data against the model's weight matrices in parallel across thousands of units simultaneously.

  4. Activation functions: Non-linear functions (ReLU, sigmoid, etc.) are applied to the output—these are also handled in hardware.

  5. Output delivery: The result (a classification, a transcription, a recommendation) is passed to the CPU or application layer.


This whole pipeline, for a simple task like face unlock, can complete in under a millisecond.


The Role of Quantization

Real-world neural engines do not always work with 32-bit floating-point numbers (the standard used in training). They often use quantization—reducing number precision to 8-bit integers (INT8) or even 4-bit integers (INT4). Smaller numbers mean smaller matrices, faster operations, and lower power draw—with only a tiny drop in accuracy for most tasks. Quantization is a key reason why a neural engine can run models that would be impossibly slow on a CPU.


The Architecture Inside a Neural Engine

Different vendors implement neural engine architecture differently, but the common building blocks are:


1. MAC Array

The heart of any NPU. This is a grid of Multiply-Accumulate units. The larger and wider the array, the more parallel multiplications it can perform simultaneously. Apple's Neural Engine uses a 2D array structure; Qualcomm's Hexagon uses a tensor accelerator with a different tiling approach.


2. On-Chip SRAM (Static RAM)

Moving data between memory and compute is slow and energy-expensive. Neural engines embed large blocks of fast, on-chip SRAM (Static RAM) close to the MAC arrays to keep data movement short. Apple's M-series chips are notable for their unified memory architecture, which further reduces data movement bottlenecks between the CPU, GPU, and Neural Engine (Apple, 2023).


3. Direct Memory Access (DMA) Controller

A dedicated controller that moves large data blocks (model weights, activation outputs) in and out of the neural engine's local memory without involving the CPU.


4. Activation and Pooling Hardware

Dedicated small circuits that apply common non-linear functions (ReLU, sigmoid, tanh, softmax) and pooling operations (downsampling outputs) in hardware, rather than sending these back to the CPU.


5. Instruction Scheduler

A small controller that reads the compiled neural network graph and dispatches operations to the MAC array in the correct order.


A Brief History: From CPU to Neural Engine


Pre-2017: GPU as AI Accelerator

Before dedicated neural engines, AI inference ran on CPUs (slow) or GPUs (faster but power-hungry). NVIDIA's CUDA platform, introduced in 2007, enabled the GPU-based deep learning boom of the early 2010s. GPUs are excellent at parallel tasks, but they were designed for graphics rendering. Running AI inference on a mobile GPU drains a phone battery quickly and generates heat.


2017: Apple Introduces the Neural Engine

On September 12, 2017, Apple announced the A11 Bionic chip inside the iPhone 8 and iPhone X. It contained the world's first commercially deployed on-device neural engine in a smartphone. Apple described it as a two-core design capable of performing 600 billion operations per second (0.6 TOPS) (Apple, 2017). It powered Face ID, Animoji, and real-time image processing.


This was a watershed moment. The idea that a consumer phone could run neural networks locally—without cloud round-trips—changed the industry's direction overnight.


2018–2020: The Industry Responds

  • Huawei's Kirin 970 (2017, announced at IFA Berlin) also featured an NPU, making it one of the earliest smartphones with dedicated AI silicon alongside Apple.

  • Qualcomm Snapdragon 845 (2018) added its Hexagon 685 DSP with expanded AI capability.

  • Google TPU v3 (2018) scaled cloud AI training to 100 petaflops per pod (Google, 2018), but Google's Pixel phones lacked a dedicated on-device NPU until the Pixel 6.

  • ARM's Ethos NPU series (launched 2018) gave smaller chipmakers a licensable neural engine IP block, democratizing access to NPU silicon.


2020–2023: Integration and Scale

Apple's M1 chip (November 2020) brought a 16-core Neural Engine with 11 TOPS to the Mac platform. This was the first time a laptop chip had a fully integrated neural engine of this caliber. The M1 Max and Ultra variants followed, and by 2023, the Apple M3 family reached up to 18 TOPS on its Neural Engine (Apple, 2023).


Meanwhile, Qualcomm's Snapdragon 8 Gen 2 (2022) hit 26 TOPS on its Hexagon NPU—its highest on-device AI performance to that date.


Microsoft launched Copilot+ PC requirements in May 2024, mandating that qualifying laptops have an NPU capable of at least 40 TOPS—signaling that neural engine performance had become a PC purchasing criterion, not just a phone spec (Microsoft, 2024).


2024–2026: The 40+ TOPS Era

By late 2024:

  • Apple A18 Pro (iPhone 16 Pro): 16-core Neural Engine, 35+ TOPS (Apple, 2024)

  • Apple M4 (iPad Pro, MacBook Pro): 16-core Neural Engine, 38 TOPS (Apple, 2024)

  • Qualcomm Snapdragon 8 Elite (2024): Hexagon NPU with 45 TOPS (Qualcomm, 2024)

  • Intel Core Ultra 200V (2024, Lunar Lake): Intel AI Boost NPU, 48 TOPS (Intel, 2024)

  • AMD Ryzen AI 300 series (2024): XDNA 2 NPU, 50 TOPS (AMD, 2024)


By 2026, TOPS figures above 40 are standard in flagship chips, and even mid-range chips routinely exceed 20 TOPS.


Who Makes Neural Engines in 2026?

Company

Product Name

Key Chips

Peak TOPS (as of 2024–2025 releases)

Source

Apple

Neural Engine

A18 Pro, M4, M4 Pro

35–38+ TOPS

Apple, 2024

Qualcomm

Hexagon NPU

Snapdragon 8 Elite

45 TOPS

Qualcomm, 2024

Intel

Intel AI Boost

Core Ultra 200V (Lunar Lake)

48 TOPS

Intel, 2024

AMD

XDNA 2 NPU

Ryzen AI 300

50 TOPS

AMD, 2024

MediaTek

APU (AI Processing Unit)

Dimensity 9400

35+ TOPS

MediaTek, 2024

Samsung

NPU

Exynos 2500

~35 TOPS

Samsung, 2024

Google

Tensor NPU

Tensor G4 (Pixel 9)

Not publicly disclosed

Google, 2024

ARM

Ethos NPU

Licensed IP (various OEMs)

Scalable

ARM, 2024

NVIDIA

DLA (Deep Learning Accelerator)

Orin, Thor (automotive/embedded)

Up to 275 TOPS (Orin)

NVIDIA, 2023

Note: TOPS figures are manufacturer-stated peak throughput and vary based on data precision (INT8, INT4). Always compare at the same precision level.

Performance Benchmarks: What TOPS Really Means

TOPS stands for Tera Operations Per Second—one trillion simple math operations every second. But this number alone can mislead.


Why raw TOPS is not the whole story:

  • Precision matters: A chip running INT4 math (4-bit integers) can claim 4× higher TOPS than the same chip running INT8 (8-bit integers), because you are doing 4× as many (smaller) operations. Comparing TOPS across different precision levels is apples to oranges.


  • Model compatibility: A neural engine's TOPS only applies to workloads compiled specifically for it. Not every AI model runs on the NPU—poorly optimized software can fall back to the CPU.


  • Real-world throughput vs. peak: Peak TOPS is a theoretical ceiling under sustained load. Real workloads often run at 30–70% of peak due to memory bandwidth limits and model structure.


More useful real-world benchmarks:

The AI Benchmark project (ETH Zurich and Google Research) runs standardized tasks on mobile chips. In their 2024 benchmark suite, the Apple A17 Pro completed their on-device inference battery at roughly 2× the speed of the Snapdragon 8 Gen 3 on identical tasks, despite similar TOPS claims—demonstrating that architecture and software optimization matter as much as raw numbers (AI Benchmark, 2024).


Real-World Applications

Neural engines are not a laboratory curiosity. Here are the documented applications they power today:


Face Recognition (Face ID / Fingerprint / Face Unlock)

Apple's Face ID on iPhone uses the Neural Engine to process 30,000 infrared dot projections and a flood illuminator reading in real time—under 1 second—even in the dark. The model runs entirely on-device; your face data never leaves the chip's Secure Enclave (Apple Security Guide, 2024).


Real-Time Translation and Transcription

Apple's on-device transcription (available since iOS 17) and Google's on-device translation (Pixel 6 and later) run neural network language models on the NPU. With a capable neural engine, live transcription works even in airplane mode.


Photography: Computational Photography

When you tap the shutter on a modern smartphone, the neural engine is often doing more work than the camera sensor itself. Tasks include:

  • Scene detection (identifying what's in the frame to choose optimal settings)

  • Portrait mode depth estimation (segmenting subject from background)

  • Night mode (stacking and aligning multiple exposures)

  • Super-resolution (upscaling using learned pixel patterns)


Apple's Photonic Engine pipeline, introduced with iPhone 14, uses the Neural Engine to apply computational photography at the deep fusion stage—before compression—for higher quality (Apple, 2022).


On-Device Large Language Models

Since 2024, shrunk versions of large language models (LLMs) run locally on devices with powerful neural engines. Apple Intelligence, launched with iOS 18 and macOS Sequoia (2024), uses the Neural Engine to run its on-device language model—a ~3 billion parameter model—for tasks like summarization, writing suggestions, and smart replies, with no data sent to Apple servers (Apple, 2024).


Health Monitoring

Apple Watch uses its dedicated neural engine cores for real-time ECG analysis, atrial fibrillation detection, fall detection, and crash detection. The FDA-cleared AFib detection algorithm runs entirely on the watch, processing sensor data continuously (Apple Heart Study, Stanford Medicine, 2019—algorithm published; watch-based deployment from 2022 onward).


Automotive AI

NVIDIA's Orin SoC (2022+), used in vehicles from Mercedes-Benz, BYD, and others, contains a Deep Learning Accelerator (DLA) that handles object detection and lane segmentation at up to 275 TOPS. This is a neural engine variant built for automotive safety requirements (ISO 26262 functional safety standard).


Case Studies


Case Study 1: Apple's A11 Bionic — Proving On-Device AI at Scale (2017)

What happened: When Apple launched the iPhone X in October 2017, the A11 Bionic's Neural Engine handled Face ID's real-time face recognition. The challenge was performing 3D face mapping fast enough to unlock the phone before a user had time to feel frustrated—while also being secure enough for Apple Pay.


The outcome: Apple reported a false-accept rate of 1 in 1,000,000 (compared to 1 in 50,000 for Touch ID). The Neural Engine ran the face recognition model in under 30 milliseconds without touching the CPU. No biometric data was transmitted off-device. iPhone X sold over 77 million units in its launch quarter (Apple Q1 2018 Earnings Report, February 1, 2018).


Why it matters: This case proved that on-device AI at scale was commercially viable, directly compelling Samsung, Qualcomm, and MediaTek to accelerate their own NPU development.


Source: Apple, "A11 Bionic — Neural Engine," developer.apple.com, 2017; Apple Q1 FY2018 Earnings, 2018-02-01.


Case Study 2: Qualcomm Snapdragon 8 Elite and the On-Device Generative AI Push (2024)

What happened: When Qualcomm unveiled the Snapdragon 8 Elite in October 2024, the company demonstrated running a 13 billion parameter LLM locally on a smartphone at over 20 tokens per second using the Hexagon NPU's 45 TOPS rating. This was documented at Qualcomm's AI Day event and independently confirmed by AnandTech's review of the chip.


The outcome: Samsung Galaxy S25 series (released January 2025) used the Snapdragon 8 Elite globally and shipped with on-device AI features including live translation of phone calls, on-device summarization of messages, and real-time photo editing without cloud connectivity. AnandTech benchmarked the NPU performance against the previous Snapdragon 8 Gen 3 and reported a 40% improvement in on-device AI throughput (AnandTech, 2024-10-21).


Why it matters: It shifted the mainstream Android market to expectation of cloud-free generative AI, putting competitive pressure on chip vendors at all price tiers.


Source: Qualcomm, "Snapdragon 8 Elite Technical Overview," qualcomm.com, 2024-10-21; AnandTech, "Snapdragon 8 Elite Review," 2024-10-21.


Case Study 3: Microsoft Copilot+ PC and the 40 TOPS NPU Mandate (2024)

What happened: In May 2024, Microsoft announced a new PC category called Copilot+ PCs, requiring that qualifying hardware have an NPU delivering at least 40 TOPS. This was not a marketing suggestion—it was a hardware gating criterion for access to AI features including Recall (AI-powered screenshot memory), live captions, and real-time image generation in Windows 11.


The outcome: This mandate immediately restructured the PC chip market. Intel's Lunar Lake (Core Ultra 200V, 48 TOPS) and AMD's Ryzen AI 300 (50 TOPS) were launched specifically to meet this threshold. Qualcomm's Snapdragon X Elite (45 TOPS) was the first chip to qualify at launch. By Q4 2024, NPU TOPS was appearing in laptop spec sheets alongside CPU cores and RAM—for the first time, a mainstream purchasing criterion.


Why it matters: Microsoft's mandate turned the neural engine from a phone feature into a PC industry standard, forcing every major chipmaker to prioritize NPU performance for the Windows market.


Source: Microsoft, "Introducing Copilot+ PCs," blogs.microsoft.com, 2024-05-20; Intel, "Intel Core Ultra 200V (Lunar Lake)," intel.com, 2024-09-03; AMD, "Ryzen AI 300 Series," amd.com, 2024-06-02.


Neural Engine vs. CPU vs. GPU vs. DSP

Understanding the neural engine requires placing it within the broader chip ecosystem. Every major SoC today contains at least four types of compute cores. They all work together.

Processor

Best At

Worst At

Power Draw

AI Role

CPU

Sequential logic, branching, OS tasks

Parallel math (slow for ML)

Moderate

Fallback; orchestrates other cores

GPU

Parallel graphics, general parallel compute

Energy efficiency for small models

High

Training; large parallel inference

DSP

Signal processing (audio, sensor data)

Complex neural networks

Low

Pre-processing; some lightweight AI

Neural Engine / NPU

Matrix multiply, inference, quantized models

Sequential/branching logic

Very low

On-device AI inference

Key insight: These cores are not competitors—they are a team. The NPU handles ML inference; the DSP handles raw sensor data; the CPU handles logic and control flow; the GPU handles graphics and heavy parallel compute. The OS and the chip's scheduler route workloads to the right core automatically.

Pros & Cons of Dedicated Neural Engines


Pros

  • Speed: 10–100× faster than CPU for AI inference tasks.

  • Energy efficiency: Apple's Neural Engine consumes a fraction of the power of the GPU for equivalent AI tasks—critical for battery life on mobile devices.

  • Privacy: On-device AI means sensitive data (faces, voice, health metrics) never leaves the device.

  • Latency: No round-trip to a server. Results in milliseconds, not hundreds of milliseconds.

  • Offline capability: Works without internet access.

  • Reduced operating cost: No cloud API costs for AI inference.


Cons

  • Narrow specialization: A neural engine is useless for general computing. If a task does not map to matrix operations, it sits idle.

  • Model compatibility: Only models compiled and optimized for the specific NPU run efficiently. A model not compiled for Apple's Core ML, for example, will not use the Neural Engine even on an Apple device.

  • Limited model size: On-device memory limits how large a model can be. A 70B parameter LLM cannot yet fit on a phone—only smaller, quantized versions can.

  • Fragmentation: Each vendor's NPU requires a different software framework (Core ML, ONNX Runtime, QNN, etc.), making developer support complex.

  • Capability ceiling for training: NPUs are for inference, not training. Training large models on-device is not feasible in 2026.


Myths vs. Facts

Myth

Fact

"The neural engine is just a faster CPU."

False. It is a fundamentally different architecture, purpose-built for matrix math. It cannot run most software a CPU runs.

"More TOPS always means better AI performance."

False. TOPS figures depend on precision level (INT4 vs. INT8 vs. FP16) and model compatibility. A chip with 50 INT4 TOPS may underperform a 38 INT8 TOPS chip on real tasks.

"Neural engines can train AI models."

False for consumer devices. NPUs handle inference only. Training requires different hardware (high-end GPUs or TPUs in data centers).

"On-device AI is always less capable than cloud AI."

Partially false. For targeted tasks (face recognition, voice recognition, photo enhancement), on-device models fine-tuned for those tasks can match or exceed generalist cloud models.

"Only Apple's Neural Engine is good."

False. As of 2024, Qualcomm, Intel, AMD, and MediaTek all ship competitive NPUs with equal or higher TOPS ratings. Apple's advantage is tight hardware–software co-design.

"Neural engines are only in phones."

False. They are in laptops, tablets, smartwatches, cars, hearing aids, and industrial edge devices.

Pitfalls & Risks


1. Software Fragmentation

A neural engine only accelerates workloads that are compiled specifically for it. Apple's ecosystem (Core ML) is tightly integrated. On Android, developers must support multiple NPU backends—Qualcomm's QNN, MediaTek's NeuroPilot, Samsung's ONE—or fall back to CPU. This fragmentation slows adoption.


2. Model Drift and Security

On-device models can become outdated as new attacks emerge. A face recognition model effective in 2023 may be vulnerable to adversarial attacks using 2025 techniques. Updating NPU firmware and compiled model files requires active software maintenance—which not all OEMs provide past 2–3 years.


3. Benchmark Gaming

Vendors have a history of publishing TOPS figures at favorable precision settings. Always look for third-party benchmarks at matched precision levels before making purchasing decisions.


4. Over-Reliance on On-Device AI

Small on-device models have capability limits. Applications that need state-of-the-art reasoning or access to live data must still use cloud AI. Engineers who over-architect around on-device capabilities risk shipping products that underperform user expectations.


5. Power Management Trade-offs

The neural engine is low-power relative to GPU-based AI, but sustained NPU workloads still generate heat and consume battery. On thin laptops without active cooling, prolonged AI inference workloads (e.g., real-time translation of a two-hour video) can cause thermal throttling.


Future Outlook


Near-Term (2026–2027)

Multi-die and chiplet NPUs: Intel and AMD have signaled that future chips will treat the NPU as a modular chiplet, allowing OEMs to configure the NPU die separately from the CPU and GPU. This should enable more precise performance and power targets for different product tiers.


On-device models exceeding 10B parameters: With memory bandwidth improvements and new INT4/INT2 quantization techniques, by late 2026 leading flagship phones are expected to run 7–10 billion parameter models locally. This brings more capable on-device assistants, summarization, and creative tools closer to parity with mid-range cloud models.


Standardization pressure: The AI PC category and the EU AI Act (which came into force for high-risk systems in August 2026) are increasing pressure on chipmakers to provide standardized APIs for NPU access. ONNX Runtime's DirectML and Apple's Core ML are already converging toward more standard model interchange formats.


Automotive neural engines:  NVIDIA's next-generation Thor SoC (sampling in 2025, production ramp in 2026) targets 2,000 TOPS for autonomous driving—a 7× leap from Orin. Automotive AI compute is on a trajectory that consumer chips will follow by the end of the decade.


Edge AI in industrial and medical devices: ARM's Ethos-U85 NPU (announced 2023) targets ultra-low-power embedded applications—wearables, industrial sensors, medical implants. As these reach production at scale in 2026, the neural engine will be in devices most people would never associate with AI.


FAQ


1. What does a neural engine do exactly?

A neural engine is a processor core specialized for running AI and machine learning models. It performs the matrix multiplications that form the core of neural networks—fast and efficiently—enabling on-device AI like face recognition, voice transcription, and image enhancement.


2. Is a neural engine the same as an NPU?

Yes. "Neural Engine" is Apple's branded name for its NPU (Neural Processing Unit). Other companies use different names (Hexagon NPU, APU, DLA), but they all perform the same function: accelerating AI inference workloads on-device.


3. What is TOPS and is higher always better?

TOPS stands for Tera Operations Per Second—trillions of simple math operations per second. Higher is generally better, but only when comparing chips at the same number precision level (e.g., both at INT8). A 50 TOPS INT4 chip may be slower than a 38 TOPS INT8 chip on certain tasks.


4. Does a neural engine work without the internet?

Yes. That is the main point. A neural engine runs AI models entirely on-device. Tasks like Face ID, on-device transcription, and Apple Intelligence's local language model all work with no internet connection.


5. Which devices have a neural engine in 2026?

Most flagship and mid-range smartphones, tablets, and laptops. Specifically: iPhones (A15 and newer), iPads (M-series), MacBooks (M-series), Google Pixel phones (Tensor G3 and newer), Samsung Galaxy S-series (Snapdragon 8 Elite), Windows Copilot+ PCs (Snapdragon X, AMD Ryzen AI 300, Intel Core Ultra 200V), and Apple Watch (Series 8 and newer).


6. Can a neural engine run ChatGPT or large LLMs?

Not the full-size models. ChatGPT uses GPT-4-class models with hundreds of billions of parameters—far too large for on-device hardware. But heavily quantized, smaller versions (3–13B parameters) can run on flagship neural engines. Apple Intelligence and Samsung Galaxy AI use this approach.


7. What programming frameworks support neural engines?

Key frameworks include: Core ML (Apple), Qualcomm Neural Network (QNN) SDK, ONNX Runtime with DirectML (Windows/AMD/Intel NPUs), MediaTek NeuroPilot, TensorFlow Lite (Android, multi-vendor). Frameworks like PyTorch and TensorFlow can export models to these formats.


8. Is my data private when using a neural engine?

For on-device processing, yes. Data processed by the neural engine (e.g., your face for Face ID, your voice for on-device dictation) never leaves the device. Apple's Secure Enclave stores biometric model data in an isolated area the operating system itself cannot access (Apple Platform Security Guide, 2024).


9. Why can't a GPU do everything a neural engine does?

A GPU can run neural network workloads, but it draws far more power and is not optimized for the specific access patterns and data precisions used in mobile inference. A neural engine does the same job with roughly 5–10× better energy efficiency for inference-only tasks—critical when running on battery.


10. When did the first neural engine appear in a consumer device?

Apple's A11 Bionic, announced September 12, 2017, was the first commercially deployed neural engine in a consumer smartphone. Huawei's Kirin 970 was announced at roughly the same time (IFA Berlin, September 2017), making both simultaneous firsts.


11. How does a neural engine handle different AI tasks (vision vs. language)?

The underlying hardware (MAC arrays) does not differentiate. A vision model and a language model both reduce to matrix multiplications. The neural engine processes both. The difference is in how the model is compiled and what input format it expects—the hardware is task-agnostic.


12. What is the difference between a neural engine and Apple's Secure Enclave?

The Secure Enclave is a security coprocessor that handles encryption, cryptographic keys, and biometric data storage. The Neural Engine is a compute accelerator for AI. They are separate chips that cooperate: the Neural Engine processes your face scan, but the Secure Enclave stores and protects the mathematical template it generates.


13. Do neural engines support all AI model formats?

No. Each NPU requires models compiled for its specific instruction set. An Apple Core ML model cannot run on a Qualcomm Hexagon NPU without recompilation. Industry efforts around ONNX (Open Neural Network Exchange) are working toward more interoperability, but full portability does not yet exist.


14. What is the role of the neural engine in autonomous vehicles?

Automotive neural engines (like NVIDIA's DLA in Orin) handle real-time object detection, lane tracking, and sensor fusion. They must meet ISO 26262 automotive functional safety standards. The neural engine in your iPhone and the one in a self-driving car share the same principles but have very different reliability and temperature requirements.


15. Will neural engines eventually replace CPUs and GPUs?

No. Neural engines are complements, not replacements. CPUs handle logic, operating systems, and branching tasks that NPUs cannot. GPUs handle graphics and large parallel compute. Neural engines handle ML inference. All three are needed in a balanced SoC.


Key Takeaways

  • A neural engine is a purpose-built chip core designed for fast, efficient AI inference—not a general-purpose processor.


  • It works by performing millions of Matrix Multiply-Accumulate (MAC) operations simultaneously, which is the core math of neural networks.


  • Apple introduced the first consumer neural engine in 2017; by 2026, virtually every flagship chip from Apple, Qualcomm, Intel, AMD, MediaTek, and Samsung includes one.


  • Modern flagship NPUs deliver 35–50+ TOPS—a 50–80× increase from the original 0.6 TOPS A11 Neural Engine in under a decade.


  • TOPS alone does not determine real performance—precision level, architecture, and software optimization all matter equally.


  • Neural engines enable key consumer features: Face ID, on-device LLMs, live translation, computational photography, and health monitoring—all without cloud connectivity.


  • Microsoft's 40 TOPS Copilot+ PC mandate (2024) turned NPU performance into a mainstream PC buying criterion.


  • On-device AI protects user privacy: biometric and personal data processed by the neural engine stays on the device.


  • The primary limitations are model size (large LLMs still need the cloud), software fragmentation across vendors, and narrow specialization.


  • The near-term trajectory points to 7–13B parameter on-device LLMs, chiplet-based modular NPUs, and expansion into automotive, medical, and industrial embedded systems.


Actionable Next Steps

  1. Check your device's neural engine: On iPhone/iPad/Mac, open Settings → General → About and search your chip model on Apple's developer pages to see its Neural Engine specs. On Windows, check for "NPU" in Device Manager or use the Qualcomm/Intel driver utilities.


  2. Enable on-device AI features: On iPhone running iOS 18+, enable Apple Intelligence in Settings → Apple Intelligence & Siri. On Android (Pixel 9 / Samsung Galaxy S25), enable the AI features in Settings → Advanced Features.


  3. Try a Core ML or ONNX app: Download an app that explicitly uses on-device AI—Whisper transcription apps, portrait mode apps, or on-device translation. Note the speed and offline capability.


  4. Developers: Compile your ML model for the NPU: If you deploy ML models, export to Core ML (Apple), ONNX Runtime with DirectML (Windows), or QNN SDK (Qualcomm). Benchmark before/after—the speedup on compatible models is typically 5–20×.


  5. Stay updated on TOPS standardization: Follow the MLCommons AI benchmark project (mlcommons.org) and the ONNX community (onnx.ai) for standardized, vendor-neutral benchmarks.


  6. Consider NPU when buying devices: If you plan to use AI-heavy features, look for chips with 35+ TOPS (phone) or 40+ TOPS (laptop). For Windows, the Copilot+ PC certification guarantees a minimum 40 TOPS NPU.


  7. Evaluate on-device vs. cloud for your use case: For privacy-sensitive tasks (health data, personal communications) or latency-critical tasks (real-time translation), prioritize on-device NPU-accelerated apps. For reasoning-heavy or knowledge-retrieval tasks, hybrid or cloud AI still outperforms.


Glossary

  1. Activation Function: A mathematical function applied to the output of each neural network layer to introduce non-linearity. Examples: ReLU, sigmoid, softmax. Implemented in hardware on neural engines.

  2. APU (AI Processing Unit): MediaTek's branding for its NPU cores inside Dimensity chips.

  3. Core ML: Apple's machine learning framework that compiles and optimizes AI models to run on the Neural Engine, GPU, or CPU on Apple devices.

  4. DLA (Deep Learning Accelerator): NVIDIA's NPU variant, used in Jetson and Drive automotive SoCs.

  5. DSP (Digital Signal Processor): A chip core optimized for continuous signal processing (audio, sensor data). Often handles lightweight AI pre-processing tasks.

  6. Hexagon NPU: Qualcomm's neural processing unit, part of the Hexagon processor family inside Snapdragon chips.

  7. Inference: Running a trained AI model on new input data to produce a result. Distinct from training (building the model from scratch).

  8. INT8 / INT4: 8-bit and 4-bit integer number formats. Used in quantized neural network inference to reduce model size and increase NPU throughput.

  9. MAC (Multiply-Accumulate): The fundamental operation in neural network math: multiply two numbers and add the result to a running sum. NPUs contain thousands of MAC units.

  10. NPU (Neural Processing Unit): The generic industry term for a chip core designed to accelerate AI inference. "Neural Engine" is Apple's brand name for its NPU.

  11. On-device AI: AI computation that runs entirely on the local device, without sending data to a remote server.

  12. Quantization: Reducing the numerical precision of AI model weights (e.g., from 32-bit float to 8-bit integer) to speed up inference and reduce memory usage.

  13. SoC (System-on-Chip): An integrated circuit that combines a CPU, GPU, NPU, memory controller, modem, and other components on a single chip.

  14. TOPS (Tera Operations Per Second): A measure of NPU throughput: how many trillions of operations per second the chip can perform. The standard benchmark for NPU performance comparisons.

  15. Training: The process of teaching a neural network model by exposing it to large datasets and adjusting its weights. Computationally intensive; done on GPUs/TPUs in data centers, not on consumer neural engines.

  16. Unified Memory Architecture: Apple's memory design in M-series chips where the CPU, GPU, and Neural Engine share the same physical memory pool, reducing data transfer overhead.


Sources & References

  1. Apple. "A11 Bionic — Neural Engine." Apple Developer Documentation. 2017. https://developer.apple.com/documentation/

  2. Apple. "A18 Pro Chip." Apple Newsroom. 2024-09-09. https://www.apple.com/newsroom/

  3. Apple. "Apple M4 Chip." Apple Newsroom. 2024-05-07. https://www.apple.com/newsroom/

  4. Apple. "Apple Intelligence Features." Apple Newsroom. 2024-06-10. https://www.apple.com/newsroom/

  5. Apple. "Apple Platform Security Guide." 2024. https://support.apple.com/guide/security/

  6. Apple. "Photonic Engine." Apple Newsroom. 2022-09-07. https://www.apple.com/newsroom/

  7. Apple. Q1 FY2018 Earnings Call. 2018-02-01. https://investor.apple.com/

  8. Qualcomm. "Snapdragon 8 Elite Technical Overview." Qualcomm Newsroom. 2024-10-21. https://www.qualcomm.com/news/

  9. Microsoft. "Introducing Copilot+ PCs." Microsoft Official Blog. 2024-05-20. https://blogs.microsoft.com/

  10. Intel. "Intel Core Ultra 200V Series (Lunar Lake) Overview." Intel Newsroom. 2024-09-03. https://www.intel.com/content/www/us/en/newsroom/

  11. AMD. "AMD Ryzen AI 300 Series with XDNA 2 Architecture." AMD Newsroom. 2024-06-02. https://www.amd.com/en/newsroom/

  12. MediaTek. "Dimensity 9400 AI Processing Overview." MediaTek Newsroom. 2024-10-14. https://www.mediatek.com/newsroom/

  13. NVIDIA. "NVIDIA Drive Orin System-on-Chip." NVIDIA Developer Documentation. 2023. https://developer.nvidia.com/drive/

  14. ARM. "Arm Ethos-U85 NPU." ARM Developer Resources. 2023. https://developer.arm.com/

  15. AnandTech. "Snapdragon 8 Elite Benchmark Review." 2024-10-21. https://www.anandtech.com/ (Note: AnandTech suspended new editorial content in 2023; archived reviews remain accessible.)

  16. Google. "Google TPU v3 Performance." Google AI Blog. 2018-02-12. https://ai.googleblog.com/

  17. AI Benchmark. "AI Benchmark: All About Deep Learning on Smartphones in 2024." ETH Zurich / Google Research. 2024. https://ai-benchmark.com/

  18. Apple Heart Study / Stanford Medicine. "Apple Heart Study Results." New England Journal of Medicine. 2019-11-14. https://www.nejm.org/doi/full/10.1056/NEJMoa1901183

  19. MLCommons. "MLPerf Inference Benchmarks." MLCommons. 2024. https://mlcommons.org/benchmarks/inference-edge/

  20. ONNX Community. "ONNX: Open Neural Network Exchange." onnx.ai. 2024. https://onnx.ai/




 
 
bottom of page