I'm one of the lucky ones who have the pleasure of working on a 2014 Mac Pro (yes, the trash can) every day. I love it to bits. It's blazingly fast, never misses a beat and no matter what I feel like doing, resources are never a constraint. However, my particular Mac Pro has been a bit wonky from the start. Ever so often (somewhere between once a week and three times a day), the kernel will start spinning out of control on one of the cores. As a side effect, applications will start hanging, and the system basically becomes useless until I force a reboot (once the kernel finds itself in an endless loop, a graceful shutdown goes out the window — the best you can do is try to issue a sync to make sure the file system will come back up in a reasonable state, but with HFS+, you never really know.)

The dreaded kernel_task at 100 % CPU usage

The usual remedies haven't helped. Reset this, reset that, full reinstall and so on. For the longest time, then, I've just ignored the issue. It wasn't sufficiently annoying to warrant working from my laptop for however long it would take to fix the issue. However, after a series of all too frequent incidents in the last couple of weeks, I decided to take affair. Given that I haven't been able to isolate the trigger of the problem, I decided to try and gather as much data as I could, as searches on the matter of kernel_task running at 100 % yielded so many different explanations, none of which seemed directly applicable to my situation. Giving Apple's support a call and saying "I occasionally get kernel_task stuck on a core" would probably not be of much use to anyone, then. What I needed to do, was to figure out exactly what was going on in the kernel at these times.

Luckily, Mac OS X ships with every developer's wet dream, dtrace. Even better, Brendan Gregg made a simple Perl script back in 2006 to sample calls in the kernel using dtrace, which almost works today. Given the "almost" part, and the fact that I like more readable output for debugging, I decided to whip up a quick improvement in Go, osxkerneltracer, which I've been using for the past couple of weeks to gather evidence whenever this problem occurred.

On my machine in its normal state, the output looks something like this, after sampling for 5 seconds:

Module            | Method                                                 | Calls |      Share
------------------+--------------------------------------------------------+-------+-----------
kernel            | machine_idle+0x1fd                                     | 49756 |  98.4819 %
kernel            | 0xffffff801cd52880+0x313                               |   240 |   0.4750 %
kernel            | processor_idle+0xc5                                    |   101 |   0.1999 %
kernel            | processor_idle+0xb4                                    |    41 |   0.0812 %
kernel            | do_mfence+0x3                                          |    27 |   0.0534 %
kernel            | dcache_incoherent_io_store64+0xab                      |    23 |   0.0455 %
kernel            | ipc_mqueue_post+0x227                                  |    18 |   0.0356 %
kernel            | machine_idle+0x1ff                                     |    18 |   0.0356 %
kernel            | 0xffffff801cd52880+0x3bf                               |    17 |   0.0336 %
kernel            | processor_idle+0xd9                                    |    10 |   0.0198 %
kernel            | processor_idle+0xab                                    |     7 |   0.0139 %
kernel            | user_trap+0x9f                                         |     7 |   0.0139 %
kernel            | wait_queue_wakeup_all+0xb9                             |     7 |   0.0139 %
AMD7000Controller | _OSReadInt32(void const volatile*, unsigned long)+0x1a |     6 |   0.0119 %
kernel            | thread_block_reason+0xbe                               |     6 |   0.0119 %
kernel            | processor_idle+0xbb                                    |     5 |   0.0099 %
kernel            | 0xffffff801cd6acc0+0x2d0                               |     4 |   0.0079 %
kernel            | __bzero+0xb                                            |     4 |   0.0079 %
[...]

This is basically what you'd expect from a system at almost idle; most of the time in the kernel is spent in an idle function — in this case, the kernel's machine_idle function (as a side note, for a kick, read up on how this actually works.) When things go awry, it's always the same thing that happens. That innocent 32 bit read function to the AMD 7000 controller you can see in the call trace above suddenly represents roughly 6.25 % of the calls traced — the full share of one of the 16 hyper-threaded cores on this machine. In other words, the kernel seems stuck in an endless loop trying to read data from the graphics controller.

With this data in hand, I've now contacted Apple support, who immediately bumped the issue to a guy who understood what not only a kernel is, but also knew that dtrace existed (in other words: data expedites support when dealing with intermittent issues.) I agreed to ship a bunch of logs for them to look at before taking the next step of actually taking the machine in, and I'm now waiting to see what they come back with.

If you're experiencing similar, intermittent issues, I suggest you do the same thing. It'll save you and probably also the immediate support staff a lot of time fumbling in the dark and will much more easily get your issue bumped to the right level, if it's actually Apple's problem.