I think they’re choosing to do it that way. Raspberry pi’s easily have that capability to do the wake word recognition on device (i think they are also working on that). Esp’s on the other hand, can only stream audio to the server and not much more. Since esp’s are far cheaper than installing a raspberry in each room, they are focusing to do wake word detection on the server not on device.
I think the main problem is that requiring and checking if someone can learn a skill you need is a lot harder than just making the skill a requirement for the job right out of the bat.