Render::GL::Backend::execute is small, while QMatrix4x4 multiplies, QVector3D ops, and QQuaternion::rotationTo (via Render::Geom::cylinder_between) consume most time.render/humanoid/rig.cpp does formation offsets + jitter + per-soldier matrices + pose solve every frame; this is the primary bottleneck.sample_anim_state and render loops keep per-entity overhead high.render/horse/rig.cpp) and elephants (render/elephant/rig.cpp) are procedural and expensive; they must be cached alongside humanoids.(renderer_id, owner_id, humanoid LOD, mount LOD, anim state + combat phase + frame, variant + attack_variant) in render/template_cache.*.render/humanoid/rig.cpp).Renderer::prewarm_unit_templates() builds all templates for troop types across owners/nations/LODs/variants/animation keys; invoked after load in app/core/level_orchestrator.cpp.clear_humanoid_caches() plus TemplateCache::clear() to avoid stale pointer reuse.1) Add cache metrics (hits/misses, template count, average commands) and verify no runtime template builds after prewarm.
2) Visual verification: idle/move/run/attack/construct/heal/hit states for humanoids, mounted horses, and elephants; check shadows + LOD parity.
3) Reduce placement cost further by grouping soldiers by template key per unit and minimizing per-soldier hash lookups.
4) Decide horse LOD policy (currently tied to humanoid LOD) and adjust if separate horse LOD is required.
5) If still slow, move hottest procedural math to POD fast paths (Render::Geom::cylinder_between and equipment/horse/humanoid geometries).