The Secret Apple M1 Coprocessor. Developer Dougall Johnson has through… | by Erik Engheim | The Startup | Jan, 2021

Developer Dougall Johnson has through reverse engineering, uncovered a secret powerful coprocessor dubbed AMX: Apple Matrix coprocessor inside the M1 chip.

tories about the Apple Matrix coprocessor (AMX) are already out there. But not exactly discussed in a beginner friendly manner. And that is what I try to do here. Bring you the story buried under thick layers of technical jargon without treating you like an idiot.

To tell this story we need to clarify the basics such as what is a coprocessor? What is a matrix? And why should you even care about any of this?

More importantly why does none of the Apple slides talk about this coprocessor? Why is it seemingly a secret? If you have read about the Neural Engine inside the M1 System-on-a-Chip (SoC) you may be confused about what makes Apple’s Matrix coprocessor (AMX) is different.

Before we get to the big question, let me start with the basic concepts such as what a matrix and a coprocessor is.

A matrix is basically just a table of numbers. If you have worked with spreadsheets such as Microsoft Excel, you have basically worked with something very similar to matricies. The key difference is that in math such tables of numbers have a laundry list of operations they support and specific behavior. A matrix can come in different flavors as you see here. A matrix with such a row, is usually called a row vector. If one a column, we call it a column vector.

We can add, subtract, scale and multiple matrices. Addition is pretty easy. You just add every element separately. Multiplication is a bit more involved. I am just showing the simple case here.

More in depth: Why Does Matrix Multiplication Work the Way it Does?

Using matrices to rotate and scale: Explaining Affine Rotation (this is pretty math geeky).

The reason matrices are important is because they are heavily used in:

  • Image processing
  • Machine learning
  • Speech and handwriting recognition
  • Face recognition
  • Compression
  • Multimedia: audio and video

In particular machine learning which has been hot these last years. Just adding more cores to the CPU will not make this run fast enough as it is really demanding. You really need specialized hardware. Regular tasks such browsing the internet, writing email, word processing and spreadsheets has been running fast enough for years. It is for specialized tasks which we really need to boost the processing power.

You could spend your silicon real-estate (transistors) on more CPU cores or by adding specialized hardware.

On any given chip, Apple has a max number of transistors to spend building different kinds of hardware. They could add more CPU cores but that really just speeds up regular tasks, which already run fast enough. Thus they have chosen to spend transistors to make specialized hardware to tackle image processing, video decoding and machine learning. This specialized hardware is the coprocessor and accelerators.

More talk about coprocessors and accelerators: Apple M1 foreshadows Rise of RISC-V.

If you have read about the Neural Engine, you will know that it also does matrix operations to help with machine learning tasks. So why do we need the Matrix coprocessor? Or are they actually just the same thing? Am I just confused? No, let me clarify how Apple’s Matrix Coprocessor differ from the Neural Engine and why we need both.

The main processor (CPU), coprocessors and accelerators can usually exchange data over a shared data bus. The CPU usually controls memory access while an Accelerator such as a GPU often has its own dedicated memory.

I admit that in past stories I often use the term coprocessor and accelerator interchangeably but they are not the same. A GPU as found in your Nvidia graphics card and the Neural Engine are both a type of accelerator.

In both cases you have special areas of memory which the CPU has to fill up with data it wants processed as well as another part of memory which it fills up with a list of instructions that accelerator should perform. It is time consuming for a CPU to setup this kind of processing. There is a lot of coordination, filling in data, and then waiting to get results back.

Thus this only pays off for larger tasks. For smaller tasks the overhead will be too high.

Coprocessors unlike accelerator spy on the stream of instructions read from memory into the main processor. Accelerators in contrast don’t observe the instructions the CPU is pulling from memory.

This is where coprocessors are a benefit over accelerators. Coprocessors sit and spy on the stream of machine code instructions being fed from memory (or cache more specifically) into the CPU. Coprocessor are made to react to particular instructions they were made to process. The CPU meanwhile has been made to mostly ignore these instructions or help facilitate the handling of them by a coprocessor.

What we gain from this is that instructions carried out by the coprocessor can be placed inside your regular code. This is different from say a GPU. If you have done GPU programming you know that shader programs are placed into separate buffers of memory, and you have to explicitly transport these shader programs to the GPU. You cannot place GPU specific instruction inside your regular code. Thus for smaller workloads involving matrix processing AMX will be better than the Neural Engine.

What is the catch? You need to actually define the instructions in the instruction-set architecture (ISA) of your microprocessor. Thus you need much tighter integration with the CPU when using a coprocessor than when using an accelerator.

ARM Ltd. creators of the ARM instruction-set architecture (ISA) has long resisted adding custom instructions to their ISA. This is one of the advantages of RISC-V: What Is Innovative About RISC-V?

However due to pressure from customers ARM relented and announced in 2019 that they would allow extensions. EE Times reports:

The new instructions are interleaved with standard Arm instructions. To avoid software fragmentation and maintain a coherent software development environment, Arm expects customers to use the custom instructions mostly in called library functions.

This may help explain why AMX instructions are not described in official documentation. ARM Ltd. expects Apple to keep these kinds of instructions inside libraries provided by the customer (Apple in this case).

It is easy to confuse something like a matrix coprocessor with a SIMD vector engine, which you find inside most modern processors today including ARM processors. SIMD stands for Single Instruction Multiple Data.

Single Instruction Single Data (SISD) vs Single Instruction Multiple Data (SIMD)

SIMD is a way of getting higher performance when you need to perform the same operation on multiple elements. This is closely related to matrix operations. In fact SIMD instructions such as ARM’s Neon instructions or Intel x86 SSE or AVX are often used to speed up matrix multiplications.

Read more: RISC-V Vector Instructions vs ARM and x86 SIMD.

However a SIMD vector engine is part of a microprocessor core. Just like the ALU (Arithmetic Logic Unit) and FPU (Floating Point Unit) is part of the CPU. Inside the microprocessor there is an instruction decoder which will pick apart an instruction and decide what functional unit to activate (gray boxes).

Inside a CPU you got the ALU, FPU as well as SIMD vector engines (not shown) as separate parts activated by the instruction decoder. A coprocessor is external.

A coprocessor in contrast is external to a microprocessor core. In fact one of the early ones, Intel’s 8087 was a physically separate chip designed to speed up floating point calculations.

Intel 8087. One of the early coprocessors used for performing floating point calculations.

Now you may wonder why anyone would want to complicate CPU design by having a separate chip like this which has to sniff on the data flowing from memory to the CPU, to see if anything is a floating point instruction.

The reason was simple, the original 8086 CPU in the first PCs contained 29,000 transistors. The 8087 in contrast was far more complex at 45,000 transistors. It was really hard to make anything with that many transistors. Combining these two chips into one would have been really hard and expensive.

But as manufacturing technology improved, it was not a problem to put floating point units (FPUs) inside the CPU. Thus FPUs replaced the floating point coprocessors.

Why the AMX is not simply a part of the Firestorm cores on the M1 is not clear to me. They are all on the same silicon die anyway. I can only offer some speculations. By being a coprocessor, it may be easier for the CPU to continue running in parallel. Apple may also have liked to keep non-standard ARM stuff outside of their ARM CPU cores.

If AMX is not described in official documentation, how do we even know about it? Thanks to developer Dougall Johnson, who has done an amazing job reverse engineering the M1 to discover this coprocessor. His efforts are described here. For matrix related math operations Apple has special libraries or frameworks such as Accelerate, which is made up of:

  • vImage — higher level image processing, such as converting between formats, image manipulation.
  • BLAS — a sort of industry standard for linear algebra (what we call the math dealing with matricies and vectors).
  • BNNS — is used for running neural networks and training.
  • vDSP digital signal processing. Fourier transformations, convolution. These are mathematical operations important in image processing or any signal really including audio.
  • LAPACK higher level linear algebra functions, e.g. for solving linear equations.

Dougall Johnson knew these libraries would use the AMX coprocessor to speed up their calculations. Thus he wrote special programs to analyze and observe what these programs did to discover the special undocumented AMX machine code instructions.

But why doesn’t Apple document this and let us use these instructions directly? As mentioned earlier, this is something ARM Ltd. would like to avoid. If custom instructions are widely used it could fragment the ARM ecosystem.

However more importantly, this is an advantage to Apple. By only letting their libraries use these special instructions Apple retains the freedom to radically change how this hardware works later. They could remove or add AMX instructions. Or they could let the Neural Engine do the job. Either way they make the job easier for developers. Developers only need to use the Accelerate framework and can ignore how Apple specifically speeds up matrix calculations.

This is one of the big advantages Apple has by being vertically integrated. By controlling both the hardware and the software, they can pull these kinds of tricks. So the next question is how big a deal is this? What does this buy Apple in terms of performance and capabilities?

Nod Labs is a company that does machine interaction, intelligence and perception. Fast matrix operations are naturally in their interest. They have written a highly technical blog post of doing performance tests of AMX: Comparing Apple’s M1 matmul performance — AMX2 vs NEON.

What they are doing is comparing performance of doing similar code using AMX with doing it using the Neon instructions, which are officially supported by ARM. Neon is a type of SIMD instructions.

What Nod Labs found was that by using AMX they were able to get twice the performance of Neon instructions for matrix operations. It doesn’t mean AMX is better for everything, but at least for machine learning and high performance computing (HPC) type of work, we can expect that AMX gives an edge over the competition.

The Apple Matrix Coprocessor looks like some rather impressive piece of hardware giving Apple’s ARM processor an edge in machine learning and HPC related tasks. Further investigation will give us a more complete picture and I can update this story with more details.

Source link