So you want uncontended unlocks to make a syscall to wake the threads that aren’t there?
That syscall will be worse than a CAS just because it’s a syscall. Not to mention it’ll have to lock some kernel data structure just to figure out that there aren’t any threads to wake.
I think I'm not doing a great job of explaining it.
There are three states of the mutex: locked=1, unlocked=0, and sleeping=2. A thread comes along to a locked mutex, and atomically moves it to sleeping and sleeps via futex. The freeing thread unconditionally writes 0 and atomically sees the old value: if 2 wake all, if 1 no need to wake anybody.
Maybe that is equally expensive to a CAS, but I don't think it ultimately is.
That syscall will be worse than a CAS just because it’s a syscall. Not to mention it’ll have to lock some kernel data structure just to figure out that there aren’t any threads to wake.