One question we have to consider when designing LLM architectures is:
How much of this will be redundant as model capabilities improve?
Models are improving rapidly (we’re talking step changes every other month), and optimizing for tasks like attention may turn out to be short-lived efforts introducing unnecessary complexity down the road.
Our approach: optimize for performance, but also for ability to quickly throw away and adapt to future advancements.