gpu_optimization.rst 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263
  1. .. _doc_gpu_optimization:
  2. GPU Optimizations
  3. =================
  4. Introduction
  5. ~~~~~~~~~~~~
  6. The demand for new graphics features and progress almost guarantees that you
  7. will encounter graphics bottlenecks. Some of these can be CPU side, for instance
  8. in calculations inside the Godot engine to prepare objects for rendering.
  9. Bottlenecks can also occur on the CPU in the graphics driver, which sorts
  10. instructions to pass to the GPU, and in the transfer of these instructions. And
  11. finally bottlenecks also occur on the GPU itself.
  12. Where bottlenecks occur in rendering is highly hardware specific. Mobile GPUs in
  13. particular may struggle with scenes that run easily on desktop.
  14. Understanding and investigating GPU bottlenecks is slightly different to the
  15. situation on the CPU, because often you can only change performance indirectly,
  16. by changing the instructions you give to the GPU, and it may be more difficult
  17. to take measurements. Often the only way of measuring performance is by
  18. examining changes in frame rate.
  19. Drawcalls, state changes, and APIs
  20. ==================================
  21. .. note:: The following section is not relevant to end-users, but is useful to
  22. provide background information that is relevant in later sections.
  23. Godot sends instructions to the GPU via a graphics API (OpenGL, GLES2, GLES3,
  24. Vulkan). The communication and driver activity involved can be quite costly,
  25. especially in OpenGL. If we can provide these instructions in a way that is
  26. preferred by the driver and GPU, we can greatly increase performance.
  27. Nearly every API command in OpenGL requires a certain amount of validation, to
  28. make sure the GPU is in the correct state. Even seemingly simple commands can
  29. lead to a flurry of behind the scenes housekeeping. Therefore the name of the
  30. game is reduce these instructions to a bare minimum, and group together similar
  31. objects as much as possible so they can be rendered together, or with the
  32. minimum number of these expensive state changes.
  33. 2D batching
  34. ~~~~~~~~~~~
  35. In 2d, the costs of treating each item individually can be prohibitively high -
  36. there can easily be thousands on screen. This is why 2d batching is used -
  37. multiple similar items are grouped together and rendered in a batch, via a
  38. single drawcall, rather than making a separate drawcall for each item. In
  39. addition this means that state changes, material and texture changes can be kept
  40. to a minimum.
  41. For more information on 2D batching see :ref:`doc_batching`.
  42. 3D batching
  43. ~~~~~~~~~~~
  44. In 3d, we still aim to minimize draw calls and state changes, however, it can be
  45. more difficult to batch together several objects into a single draw call. 3d
  46. meshes tend to comprise hundreds or thousands of triangles, and combining large
  47. meshes at runtime is prohibitively expensive. The costs of joining them quickly
  48. exceeds any benefits as the number of triangles grows per mesh. A much better
  49. alternative is to join meshes ahead of time (static meshes in relation to each
  50. other). This can either be done by artists, or programmatically within Godot.
  51. There is also a cost to batching together objects in 3d. Several objects
  52. rendered as one cannot be individually culled. An entire city that is off screen
  53. will still be rendered if it is joined to a single blade of grass that is on
  54. screen. So attempting to batch together 3d objects should take account of their
  55. location and effect on culling. Despite this, the benefits of joining static
  56. objects often outweigh other considerations, especially for large numbers of low
  57. poly objects.
  58. For more information on 3D specific optimizations, see
  59. :ref:`doc_optimizing_3d_performance`.
  60. Reuse Shaders and Materials
  61. ~~~~~~~~~~~~~~~~~~~~~~~~~~~
  62. The Godot renderer is a little different to what is out there. It's designed to
  63. minimize GPU state changes as much as possible. :ref:`SpatialMaterial
  64. <class_SpatialMaterial>` does a good job at reusing materials that need similar
  65. shaders but, if custom shaders are used, make sure to reuse them as much as
  66. possible. Godot's priorities are:
  67. - **Reusing Materials**: The fewer different materials in the
  68. scene, the faster the rendering will be. If a scene has a huge amount
  69. of objects (in the hundreds or thousands) try reusing the materials
  70. or in the worst case use atlases.
  71. - **Reusing Shaders**: If materials can't be reused, at least try to
  72. re-use shaders (or SpatialMaterials with different parameters but the same
  73. configuration).
  74. If a scene has, for example, ``20,000`` objects with ``20,000`` different
  75. materials each, rendering will be slow. If the same scene has ``20,000``
  76. objects, but only uses ``100`` materials, rendering will be much faster.
  77. Pixel cost vs vertex cost
  78. =========================
  79. You may have heard that the lower the number of polygons in a model, the faster
  80. it will be rendered. This is *really* relative and depends on many factors.
  81. On a modern PC and console, vertex cost is low. GPUs originally only rendered
  82. triangles, so every frame all the vertices:
  83. 1. Had to be transformed by the CPU (including clipping).
  84. 2. Had to be sent to the GPU memory from the main RAM.
  85. Now all this is handled inside the GPU, so the performance is much higher. 3D
  86. artists usually have the wrong feeling about polycount performance because 3D
  87. DCCs (such as Blender, Max, etc.) need to keep geometry in CPU memory in order
  88. for it to be edited, reducing actual performance. Game engines rely on the GPU
  89. more so they can render many triangles much more efficiently.
  90. On mobile devices, the story is different. PC and Console GPUs are
  91. brute-force monsters that can pull as much electricity as they need from
  92. the power grid. Mobile GPUs are limited to a tiny battery, so they need
  93. to be a lot more power efficient.
  94. To be more efficient, mobile GPUs attempt to avoid *overdraw*. This means, the
  95. same pixel on the screen being rendered more than once. Imagine a town with
  96. several buildings, GPUs don't know what is visible and what is hidden until they
  97. draw it. A house might be drawn and then another house in front of it (rendering
  98. happened twice for the same pixel!). PC GPUs normally don't care much about this
  99. and just throw more pixel processors to the hardware to increase performance
  100. (but this also increases power consumption).
  101. Using more power is not an option on mobile so mobile devices use a technique
  102. called "Tile Based Rendering" which divides the screen into a grid. Each cell
  103. keeps the list of triangles drawn to it and sorts them by depth to minimize
  104. *overdraw*. This technique improves performance and reduces power consumption,
  105. but takes a toll on vertex performance. As a result, fewer vertices and
  106. triangles can be processed for drawing.
  107. Additionally, Tile Based Rendering struggles when there are small objects with a
  108. lot of geometry within a small portion of the screen. This forces mobile GPUs to
  109. put a lot of strain on a single screen tile which considerably decreases
  110. performance as all the other cells must wait for it to complete in order to
  111. display the frame.
  112. In summary, do not worry about vertex count on mobile, but avoid concentration
  113. of vertices in small parts of the screen. If a character, NPC, vehicle, etc. is
  114. far away (so it looks tiny), use a smaller level of detail (LOD) model.
  115. Pay attention to the additional vertex processing required when using:
  116. - Skinning (skeletal animation)
  117. - Morphs (shape keys)
  118. - Vertex-lit objects (common on mobile)
  119. Pixel / fragment shaders - fill rate
  120. ====================================
  121. In contrast to vertex processing, the costs of fragment shading has increased
  122. dramatically over the years. Screen resolutions have increased (the area of a 4K
  123. screen is ``8,294,400`` pixels, versus ``307,200`` for an old ``640x480`` VGA
  124. screen, that is 27x the area), but also the complexity of fragment shaders has
  125. exploded. Physically based rendering requires complex calculations for each
  126. fragment.
  127. You can test whether a project is fill rate limited quite easily. Turn off vsync
  128. to prevent capping the frames per second, then compare the frames per second
  129. when running with a large window, to running with a postage stamp sized window
  130. (you may also benefit from similarly reducing your shadow map size if using
  131. shadows). Usually you will find the fps increases quite a bit using a small
  132. window, which indicates you are to some extent fill rate limited. If on the
  133. other hand there is little to no increase in fps, then your bottleneck lies
  134. elsewhere.
  135. You can increase performance in a fill rate limited project by reducing the
  136. amount of work the GPU has to do. You can do this by simplifying the shader
  137. (perhaps turn off expensive options if you are using a :ref:`SpatialMaterial
  138. <class_SpatialMaterial>`), or reducing the number and size of textures used.
  139. Consider shipping simpler shaders for mobile.
  140. Reading textures
  141. ~~~~~~~~~~~~~~~~
  142. The other factor in fragment shaders is the cost of reading textures. Reading
  143. textures is an expensive operation (especially reading from several in a single
  144. fragment shader), and also consider the filtering may add expense to this
  145. (trilinear filtering between mipmaps, and averaging). Reading textures is also
  146. expensive in power terms, which is a big issue on mobiles.
  147. Texture compression
  148. ~~~~~~~~~~~~~~~~~~~
  149. Godot compresses textures of 3D models when imported (VRAM compression) by
  150. default. Video RAM compression is not as efficient in size as PNG or JPG when
  151. stored, but increases performance enormously when drawing.
  152. This is because the main goal of texture compression is bandwidth reduction
  153. between memory and the GPU.
  154. In 3D, the shapes of objects depend more on the geometry than the texture, so
  155. compression is generally not noticeable. In 2D, compression depends more on
  156. shapes inside the textures, so the artifacts resulting from 2D compression are
  157. more noticeable.
  158. As a warning, most Android devices do not support texture compression of
  159. textures with transparency (only opaque), so keep this in mind.
  160. Post processing / shadows
  161. ~~~~~~~~~~~~~~~~~~~~~~~~~
  162. Post processing effects and shadows can also be expensive in terms of fragment
  163. shading activity. Always test the impact of these on different hardware.
  164. Reducing the size of shadow maps can increase performance, both in terms of
  165. writing, and reading the maps.
  166. Transparency / blending
  167. =======================
  168. Transparent items present particular problems for rendering efficiency. Opaque
  169. items (especially in 3d) can be essentially rendered in any order and the
  170. Z-buffer will ensure that only the front most objects get shaded. Transparent or
  171. blended objects are different - in most cases they cannot rely on the Z-buffer
  172. and must be rendered in "painter's order" (i.e. from back to front) in order to
  173. look correct.
  174. The transparent items are also particularly bad for fill rate, because every
  175. item has to be drawn, even if later transparent items will be drawn on top.
  176. Opaque items don't have to do this. They can usually take advantage of the
  177. Z-buffer by writing to the Z-buffer only first, then only performing the
  178. fragment shader on the 'winning' fragment, the item that is at the front at a
  179. particular pixel.
  180. Transparency is particularly expensive where multiple transparent items overlap.
  181. It is usually better to use as small a transparent area as possible in order to
  182. minimize these fill rate requirements, especially on mobile, where fill rate is
  183. very expensive. Indeed, in many situations, rendering more complex opaque
  184. geometry can end up being faster than using transparency to "cheat".
  185. Multi-Platform Advice
  186. =====================
  187. If you are aiming to release on multiple platforms, test `early` and test
  188. `often` on all your platforms, especially mobile. Developing a game on desktop
  189. but attempting to port to mobile at the last minute is a recipe for disaster.
  190. In general you should design your game for the lowest common denominator, then
  191. add optional enhancements for more powerful platforms. For example, you may want
  192. to use the GLES2 backend for both desktop and mobile platforms where you target
  193. both.
  194. Mobile / tile renderers
  195. =======================
  196. GPUs on mobile devices work in dramatically different ways from GPUs on desktop.
  197. Most mobile devices use tile renderers. Tile renderers split up the screen into
  198. regular sized tiles that fit into super fast cache memory, and reduce the reads
  199. and writes to main memory.
  200. There are some downsides though, it can make certain techniques much more
  201. complicated and expensive to perform. Tiles that rely on the results of
  202. rendering in different tiles or on the results of earlier operations being
  203. preserved can be very slow. Be very careful to test the performance of shaders,
  204. viewport textures and post processing.