5 years ago · 55b1a4fb03
--- a/tutorials/3d/index.rst
+++ b/tutorials/3d/index.rst
@@ -7,7 +7,6 @@
 
				 
			
 
				    introduction_to_3d
			
 
				    using_transforms
			
 
				-   optimizing_3d_performance
			
 
				    3d_rendering_limitations
			
 
				    standard_material_3d
			
 
				    lights_and_shadows
			
--- a/tutorials/3d/optimizing_3d_performance.rst
+++ b/tutorials/3d/optimizing_3d_performance.rst
@@ -1,192 +0,0 @@
 
				-.. meta::
			
 
				-    :keywords: optimization
			
 
				-
			
 
				-.. _doc_optimizing_3d_performance:
			
 
				-
			
 
				-Optimizing 3D performance
			
 
				-=========================
			
 
				-
			
 
				-Introduction
			
 
				-~~~~~~~~~~~~
			
 
				-
			
 
				-Godot follows a balanced performance philosophy. In the performance world,
			
 
				-there are always trade-offs, which consist of trading speed for
			
 
				-usability and flexibility. Some practical examples of this are:
			
 
				-
			
 
				--  Rendering objects efficiently in high amounts is easy, but when a
			
 
				-   large scene must be rendered, it can become inefficient. To solve
			
 
				-   this, visibility computation must be added to the rendering, which
			
 
				-   makes rendering less efficient, but, at the same time, fewer objects are
			
 
				-   rendered, so efficiency overall improves.
			
 
				--  Configuring the properties of every material for every object that
			
 
				-   needs to be rendered is also slow. To solve this, objects are sorted
			
 
				-   by material to reduce the costs, but at the same time sorting has a
			
 
				-   cost.
			
 
				--  In 3D physics a similar situation happens. The best algorithms to
			
 
				-   handle large amounts of physics objects (such as SAP) are slow
			
 
				-   at insertion/removal of objects and ray-casting. Algorithms that
			
 
				-   allow faster insertion and removal, as well as ray-casting, will not
			
 
				-   be able to handle as many active objects.
			
 
				-
			
 
				-And there are many more examples of this! Game engines strive to be
			
 
				-general purpose in nature, so balanced algorithms are always favored
			
 
				-over algorithms that might be fast in some situations and slow in
			
 
				-others.. or algorithms that are fast but make usability more difficult.
			
 
				-
			
 
				-Godot is not an exception and, while it is designed to have backends
			
 
				-swappable for different algorithms, the default ones (or more like, the
			
 
				-only ones that are there for now) prioritize balance and flexibility
			
 
				-over performance.
			
 
				-
			
 
				-With this clear, the aim of this tutorial is to explain how to get the
			
 
				-maximum performance out of Godot.
			
 
				-
			
 
				-Rendering
			
 
				-~~~~~~~~~
			
 
				-
			
 
				-3D rendering is one of the most difficult areas to get performance from,
			
 
				-so this section will have a list of tips.
			
 
				-
			
 
				-Reuse shaders and materials
			
 
				----------------------------
			
 
				-
			
 
				-The Godot renderer is a little different to what is out there. It's designed
			
 
				-to minimize GPU state changes as much as possible.
			
 
				-:ref:`class_StandardMaterial3D`
			
 
				-does a good job at reusing materials that need similar shaders but, if
			
 
				-custom shaders are used, make sure to reuse them as much as possible.
			
 
				-Godot's priorities will be like this:
			
 
				-
			
 
				--  **Reusing Materials**: The fewer different materials in the
			
 
				-   scene, the faster the rendering will be. If a scene has a huge amount
			
 
				-   of objects (in the hundreds or thousands) try reusing the materials
			
 
				-   or in the worst case use atlases.
			
 
				--  **Reusing Shaders**: If materials can't be reused, at least try to
			
 
				-   re-use shaders (or StandardMaterial3Ds with different parameters but the same
			
 
				-   configuration).
			
 
				-
			
 
				-If a scene has, for example, 20.000 objects with 20.000 different
			
 
				-materials each, rendering will be slow. If the same scene has
			
 
				-20.000 objects, but only uses 100 materials, rendering will be blazingly
			
 
				-fast.
			
 
				-
			
 
				-Pixel cost vs vertex cost
			
 
				--------------------------
			
 
				-
			
 
				-It is a common thought that the lower the number of polygons in a model, the
			
 
				-faster it will be rendered. This is *really* relative and depends on
			
 
				-many factors.
			
 
				-
			
 
				-On a modern PC and console, vertex cost is low. GPUs
			
 
				-originally only rendered triangles, so all the vertices:
			
 
				-
			
 
				-1. Had to be transformed by the CPU (including clipping).
			
 
				-
			
 
				-2. Had to be sent to the GPU memory from the main RAM.
			
 
				-
			
 
				-Nowadays, all this is handled inside the GPU, so the performance is
			
 
				-extremely high. 3D artists usually have the wrong feeling about
			
 
				-polycount performance because 3D DCCs (such as Blender, Max, etc.) need
			
 
				-to keep geometry in CPU memory in order for it to be edited, reducing
			
 
				-actual performance. Truth is, a model rendered by a 3D engine is much
			
 
				-more optimal than how 3D DCCs display them.
			
 
				-
			
 
				-On mobile devices, the story is different. PC and Console GPUs are
			
 
				-brute-force monsters that can pull as much electricity as they need from
			
 
				-the power grid. Mobile GPUs are limited to a tiny battery, so they need
			
 
				-to be a lot more power efficient.
			
 
				-
			
 
				-To be more efficient, mobile GPUs attempt to avoid *overdraw*. This
			
 
				-means, the same pixel on the screen being rendered (as in, with lighting
			
 
				-calculation, etc.) more than once. Imagine a town with several buildings,
			
 
				-GPUs don't know what is visible and what is hidden until they
			
 
				-draw it. A house might be drawn and then another house in front of it
			
 
				-(rendering happened twice for the same pixel!). PC GPUs normally don't
			
 
				-care much about this and just throw more pixel processors to the
			
 
				-hardware to increase performance (but this also increases power
			
 
				-consumption).
			
 
				-
			
 
				-On mobile, pulling more power is not an option, so a technique called
			
 
				-"Tile Based Rendering" is used (almost every mobile hardware uses a
			
 
				-variant of it), which divides the screen into a grid. Each cell keeps the
			
 
				-list of triangles drawn to it and sorts them by depth to minimize
			
 
				-*overdraw*. This technique improves performance and reduces power
			
 
				-consumption, but takes a toll on vertex performance. As a result, fewer
			
 
				-vertices and triangles can be processed for drawing.
			
 
				-
			
 
				-Generally, this is not so bad, but there is a corner case on mobile that
			
 
				-must be avoided, which is to have small objects with a lot of geometry
			
 
				-within a small portion of the screen. This forces mobile GPUs to put a
			
 
				-lot of strain on a single screen cell, considerably decreasing
			
 
				-performance (as all the other cells must wait for it to complete in
			
 
				-order to display the frame).
			
 
				-
			
 
				-To make it short, do not worry about vertex count so much on mobile, but
			
 
				-avoid concentration of vertices in small parts of the screen. If, for
			
 
				-example, a character, NPC, vehicle, etc. is far away (so it looks tiny),
			
 
				-use a smaller level of detail (LOD) model instead.
			
 
				-
			
 
				-An extra situation where vertex cost must be considered is objects that
			
 
				-have extra processing per vertex, such as:
			
 
				-
			
 
				--  Skinning (skeletal animation)
			
 
				--  Morphs (shape keys)
			
 
				--  Vertex Lit Objects (common on mobile)
			
 
				-
			
 
				-Texture compression
			
 
				--------------------
			
 
				-
			
 
				-Godot offers to compress textures of 3D models when imported (VRAM
			
 
				-compression). Video RAM compression is not as efficient in size as PNG
			
 
				-or JPG when stored, but increases performance enormously when drawing.
			
 
				-
			
 
				-This is because the main goal of texture compression is bandwidth
			
 
				-reduction between memory and the GPU.
			
 
				-
			
 
				-In 3D, the shapes of objects depend more on the geometry than the
			
 
				-texture, so compression is generally not noticeable. In 2D, compression
			
 
				-depends more on shapes inside the textures, so the artifacts resulting
			
 
				-from 2D compression are more noticeable.
			
 
				-
			
 
				-As a warning, most Android devices do not support texture compression of
			
 
				-textures with transparency (only opaque), so keep this in mind.
			
 
				-
			
 
				-Transparent objects
			
 
				--------------------
			
 
				-
			
 
				-As mentioned before, Godot sorts objects by material and shader to
			
 
				-improve performance. This, however, can not be done on transparent
			
 
				-objects. Transparent objects are rendered from back to front to make
			
 
				-blending with what is behind work. As a result, please try to keep
			
 
				-transparent objects to a minimum! If an object has a small section with
			
 
				-transparency, try to make that section a separate material.
			
 
				-
			
 
				-Level of detail (LOD)
			
 
				----------------------
			
 
				-
			
 
				-As also mentioned before, using objects with fewer vertices can improve
			
 
				-performance in some cases. Godot has a simple system to change level
			
 
				-of detail,
			
 
				-:ref:`GeometryInstance <class_GeometryInstance>`
			
 
				-based objects have a visibility range that can be defined. Having
			
 
				-several GeometryInstance objects in different ranges works as LOD.
			
 
				-
			
 
				-Use instancing (MultiMesh)
			
 
				---------------------------
			
 
				-
			
 
				-If several identical objects have to be drawn in the same place or
			
 
				-nearby, try using :ref:`MultiMesh <class_MultiMesh>`
			
 
				-instead. MultiMesh allows the drawing of dozens of thousands of objects at
			
 
				-very little performance cost, making it ideal for flocks, grass,
			
 
				-particles, etc.
			
 
				-
			
 
				-Bake lighting
			
 
				--------------
			
 
				-
			
 
				-Small lights are usually not a performance issue. Shadows a little more.
			
 
				-In general, if several lights need to affect a scene, it's ideal to bake
			
 
				-it (:ref:`doc_baked_lightmaps`). Baking can also improve the scene quality by
			
 
				-adding indirect light bounces.
			
 
				-
			
 
				-If working on mobile, baking to texture is recommended, since this
			
 
				-method is even faster.
			
--- a/tutorials/optimization/cpu_optimization.rst
+++ b/tutorials/optimization/cpu_optimization.rst
@@ -0,0 +1,277 @@
 
				+.. _doc_cpu_optimization:
			
 
				+
			
 
				+CPU optimization
			
 
				+================
			
 
				+
			
 
				+Measuring performance
			
 
				+=====================
			
 
				+
			
 
				+We have to know where the "bottlenecks" are to know how to speed up our program.
			
 
				+Bottlenecks are the slowest parts of the program that limit the rate that
			
 
				+everything can progress. Focussing on bottlenecks allows us to concentrate our
			
 
				+efforts on optimizing the areas which will give us the greatest speed
			
 
				+improvement, instead of spending a lot of time optimizing functions that will
			
 
				+lead to small performance improvements.
			
 
				+
			
 
				+For the CPU, the easiest way to identify bottlenecks is to use a profiler.
			
 
				+
			
 
				+CPU profilers
			
 
				+=============
			
 
				+
			
 
				+Profilers run alongside your program and take timing measurements to work out
			
 
				+what proportion of time is spent in each function.
			
 
				+
			
 
				+The Godot IDE conveniently has a built-in profiler. It does not run every time
			
 
				+you start your project: it must be manually started and stopped. This is
			
 
				+because, like most profilers, recording these timing measurements can
			
 
				+slow down your project significantly.
			
 
				+
			
 
				+After profiling, you can look back at the results for a frame.
			
 
				+
			
 
				+.. figure:: img/godot_profiler.png
			
 
				+.. figure:: img/godot_profiler.png
			
 
				+   :alt: Screenshot of the Godot profiler
			
 
				+
			
 
				+   Results of a profile of one of the demo projects.
			
 
				+
			
 
				+.. note:: We can see the cost of built-in processes such as physics and audio,
			
 
				+          as well as seeing the cost of our own scripting functions at the
			
 
				+          bottom.
			
 
				+
			
 
				+          Time spent waiting for various built-in servers may not be counted in
			
 
				+          the profilers. This is a known bug.
			
 
				+
			
 
				+When a project is running slowly, you will often see an obvious function or
			
 
				+process taking a lot more time than others. This is your primary bottleneck, and
			
 
				+you can usually increase speed by optimizing this area.
			
 
				+
			
 
				+For more info about using Godot's built-in profiler, see
			
 
				+:ref:`doc_debugger_panel`.
			
 
				+
			
 
				+External profilers
			
 
				+~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Although the Godot IDE profiler is very convenient and useful, sometimes you
			
 
				+need more power, and the ability to profile the Godot engine source code itself.
			
 
				+
			
 
				+You can use a number of third party profilers to do this including
			
 
				+`Valgrind <https://www.valgrind.org/>`__,
			
 
				+`VerySleepy <http://www.codersnotes.com/sleepy/>`__,
			
 
				+`HotSpot <https://github.com/KDAB/hotspot>`__,
			
 
				+`Visual Studio <https://visualstudio.microsoft.com/>`__ and
			
 
				+`Intel VTune <https://software.intel.com/content/www/us/en/develop/tools/vtune-profiler.html>`__.
			
 
				+
			
 
				+.. note:: You will need to compile Godot from source to use a third-party profiler.
			
 
				+          This is required to obtain debugging symbols. You can also use a debug
			
 
				+          build, however, note that the results of profiling a debug build will
			
 
				+          be different to a release build, because debug builds are less
			
 
				+          optimized. Bottlenecks are often in a different place in debug builds,
			
 
				+          so you should profile release builds whenever possible.
			
 
				+
			
 
				+.. figure:: img/valgrind.png
			
 
				+   :alt: Screenshot of Callgrind
			
 
				+
			
 
				+   Example results from Callgrind, which is part of Valgrind.
			
 
				+
			
 
				+From the left, Callgrind is listing the percentage of time within a function and
			
 
				+its children (Inclusive), the percentage of time spent within the function
			
 
				+itself, excluding child functions (Self), the number of times the function is
			
 
				+called, the function name, and the file or module.
			
 
				+
			
 
				+In this example, we can see nearly all time is spent under the
			
 
				+`Main::iteration()` function. This is the master function in the Godot source
			
 
				+code that is called repeatedly. It causes frames to be drawn, physics ticks to
			
 
				+be simulated, and nodes and scripts to be updated. A large proportion of the
			
 
				+time is spent in the functions to render a canvas (66%), because this example
			
 
				+uses a 2D benchmark. Below this, we see that almost 50% of the time is spent
			
 
				+outside Godot code in ``libglapi`` and ``i965_dri`` (the graphics driver).
			
 
				+This tells us the a large proportion of CPU time is being spent in the
			
 
				+graphics driver.
			
 
				+
			
 
				+This is actually an excellent example because, in an ideal world, only a very
			
 
				+small proportion of time would be spent in the graphics driver. This is an
			
 
				+indication that there is a problem with too much communication and work being
			
 
				+done in the graphics API. This specific profiling led to the development of 2D
			
 
				+batching, which greatly speeds up 2D rendering by reducing bottlenecks in this
			
 
				+area.
			
 
				+
			
 
				+Manually timing functions
			
 
				+=========================
			
 
				+
			
 
				+Another handy technique, especially once you have identified the bottleneck
			
 
				+using a profiler, is to manually time the function or area under test.
			
 
				+The specifics vary depending on the language, but in GDScript, you would do
			
 
				+the following:
			
 
				+
			
 
				+::
			
 
				+
			
 
				+    var time_start = OS.get_ticks_usec()
			
 
				+
			
 
				+    # Your function you want to time
			
 
				+    update_enemies()
			
 
				+
			
 
				+    var time_end = OS.get_ticks_usec()
			
 
				+    print("update_enemies() took %d microseconds" % time_end - time_start)
			
 
				+
			
 
				+When manually timing functions, it is usually a good idea to run the function
			
 
				+many times (1,000 or more times), instead of just once (unless it is a very slow
			
 
				+function). The reason for doing this is that timers often have limited accuracy.
			
 
				+Moreover, CPUs will schedule processes in a haphazard manner. Therefore, an
			
 
				+average over a series of runs is more accurate than a single measurement.
			
 
				+
			
 
				+As you attempt to optimize functions, be sure to either repeatedly profile or
			
 
				+time them as you go. This will give you crucial feedback as to whether the
			
 
				+optimization is working (or not).
			
 
				+
			
 
				+Caches
			
 
				+======
			
 
				+
			
 
				+CPU caches are something else to be particularly aware of, especially when
			
 
				+comparing timing results of two different versions of a function. The results
			
 
				+can be highly dependent on whether the data is in the CPU cache or not. CPUs
			
 
				+don't load data directly from the system RAM, even though it's huge in
			
 
				+comparison to the CPU cache (several gigabytes instead of a few megabytes). This
			
 
				+is because system RAM is very slow to access. Instead, CPUs load data from a
			
 
				+smaller, faster bank of memory called cache. Loading data from cache is very
			
 
				+fast, but every time you try and load a memory address that is not stored in
			
 
				+cache, the cache must make a trip to main memory and slowly load in some data.
			
 
				+This delay can result in the CPU sitting around idle for a long time, and is
			
 
				+referred to as a "cache miss".
			
 
				+
			
 
				+This means that the first time you run a function, it may run slowly because the
			
 
				+data is not in the CPU cache. The second and later times, it may run much faster
			
 
				+because the data is in the cache. Due to this, always use averages when timing,
			
 
				+and be aware of the effects of cache.
			
 
				+
			
 
				+Understanding caching is also crucial to CPU optimization. If you have an
			
 
				+algorithm (routine) that loads small bits of data from randomly spread out areas
			
 
				+of main memory, this can result in a lot of cache misses, a lot of the time, the
			
 
				+CPU will be waiting around for data instead of doing any work. Instead, if you
			
 
				+can make your data accesses localised, or even better, access memory in a linear
			
 
				+fashion (like a continuous list), then the cache will work optimally and the CPU
			
 
				+will be able to work as fast as possible.
			
 
				+
			
 
				+Godot usually takes care of such low-level details for you. For example, the
			
 
				+Server APIs make sure data is optimized for caching already for things like
			
 
				+rendering and physics. Still, you should be especially aware of caching when
			
 
				+using :ref:`GDNative <toc-tutorials-gdnative>`.
			
 
				+
			
 
				+Languages
			
 
				+=========
			
 
				+
			
 
				+Godot supports a number of different languages, and it is worth bearing in mind
			
 
				+that there are trade-offs involved. Some languages are designed for ease of use
			
 
				+at the cost of speed, and others are faster but more difficult to work with.
			
 
				+
			
 
				+Built-in engine functions run at the same speed regardless of the scripting
			
 
				+language you choose. If your project is making a lot of calculations in its own
			
 
				+code, consider moving those calculations to a faster language.
			
 
				+
			
 
				+GDScript
			
 
				+~~~~~~~~
			
 
				+
			
 
				+:ref:`GDScript <toc-learn-scripting-gdscript>` is designed to be easy to use and iterate,
			
 
				+and is ideal for making many types of games. However, in this language, ease of
			
 
				+use is considered more important than performance. If you need to make heavy
			
 
				+calculations, consider moving some of your project to one of the other
			
 
				+languages.
			
 
				+
			
 
				+C#
			
 
				+~~
			
 
				+
			
 
				+:ref:`C# <toc-learn-scripting-C#>` is popular and has first-class support in Godot.It
			
 
				+offers a good compromise between speed and ease of use. Beware of possible
			
 
				+garbage collection pauses and leaks that can occur during gameplay, though. A
			
 
				+common approach to workaround issues with garbage collection is to use *object
			
 
				+pooling*, which is outside the scope of this guide.
			
 
				+
			
 
				+Other languages
			
 
				+~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Third parties provide support for several other languages, including `Rust
			
 
				+<https://github.com/godot-rust/godot-rust>`_ and `Javascript
			
 
				+<https://github.com/GodotExplorer/ECMAScript>`_.
			
 
				+
			
 
				+C++
			
 
				+~~~
			
 
				+
			
 
				+Godot is written in C++. Using C++ will usually result in the fastest code.
			
 
				+However, on a practical level, it is the most difficult to deploy to end users'
			
 
				+machines on different platforms. Options for using C++ include
			
 
				+:ref:`GDNative <toc-tutorials-gdnative>` and
			
 
				+:ref:`custom modules <doc_custom_modules_in_c++>`.
			
 
				+
			
 
				+Threads
			
 
				+=======
			
 
				+
			
 
				+Consider using threads when making a lot of calculations that can run in
			
 
				+parallel to each other. Modern CPUs have multiple cores, each one capable of
			
 
				+doing a limited amount of work. By spreading work over multiple threads, you can
			
 
				+move further towards peak CPU efficiency.
			
 
				+
			
 
				+The disadvantage of threads is that you have to be incredibly careful. As each
			
 
				+CPU core operates independently, they can end up trying to access the same
			
 
				+memory at the same time. One thread can be reading to a variable while another
			
 
				+is writing: this is called a *race condition*. Before you use threads, make sure
			
 
				+you understand the dangers and how to try and prevent these race conditions.
			
 
				+
			
 
				+Threads can also make debugging considerably more difficult. The GDScript
			
 
				+debugger doesn't support setting up breakpoints in threads yet.
			
 
				+
			
 
				+For more information on threads, see :ref:`doc_using_multiple_threads`.
			
 
				+
			
 
				+SceneTree
			
 
				+=========
			
 
				+
			
 
				+Although Nodes are an incredibly powerful and versatile concept, be aware that
			
 
				+every node has a cost. Built-in functions such as `_process()` and
			
 
				+`_physics_process()` propagate through the tree. This housekeeping can reduce
			
 
				+performance when you have very large numbers of nodes (usually in the thousands).
			
 
				+
			
 
				+Each node is handled individually in the Godot renderer. Therefore, a smaller
			
 
				+number of nodes with more in each can lead to better performance.
			
 
				+
			
 
				+One quirk of the :ref:`SceneTree <class_SceneTree>` is that you can sometimes
			
 
				+get much better performance by removing nodes from the SceneTree, rather than by
			
 
				+pausing or hiding them. You don't have to delete a detached node. You can for
			
 
				+example, keep a reference to a node, detach it from the scene tree using
			
 
				+:ref:`Node.remove_child(node) <class_Node_method_remove_child>`, then reattach
			
 
				+it later using :ref:`Node.add_child(node) <class_Node_method_add_child>`.
			
 
				+This can be very useful for adding and removing areas from a game, for example.
			
 
				+
			
 
				+You can avoid the SceneTree altogether by using Server APIs. For more
			
 
				+information, see :ref:`doc_using_servers`.
			
 
				+
			
 
				+Physics
			
 
				+=======
			
 
				+
			
 
				+In some situations, physics can end up becoming a bottleneck. This is
			
 
				+particularly the case with complex worlds and large numbers of physics objects.
			
 
				+
			
 
				+Here are some techniques to speed up physics:
			
 
				+
			
 
				+- Try using simplified versions of your rendered geometry for collision shapes.
			
 
				+  Often, this won't be noticeable for end users, but can greatly increase
			
 
				+  performance.
			
 
				+- Try removing objects from physics when they are out of view / outside the
			
 
				+  current area, or reusing physics objects (maybe you allow 8 monsters per area,
			
 
				+  for example, and reuse these).
			
 
				+
			
 
				+Another crucial aspect to physics is the physics tick rate. In some games, you
			
 
				+can greatly reduce the tick rate, and instead of for example, updating physics
			
 
				+60 times per second, you may update them only 30 or even 20 times per second.
			
 
				+This can greatly reduce the CPU load.
			
 
				+
			
 
				+The downside of changing physics tick rate is you can get jerky movement or
			
 
				+jitter when the physics update rate does not match the frames per second
			
 
				+rendered. Also, decreasing the physics tick rate will increase input lag.
			
 
				+It's recommended to stick to the default physics tick rate (60 Hz) in most games
			
 
				+that feature real-time player movement.
			
 
				+
			
 
				+The solution to jitter is to use *fixed timestep interpolation*, which involves
			
 
				+smoothing the rendered positions and rotations over multiple frames to match the
			
 
				+physics. You can either implement this yourself or use a
			
 
				+`third-party addon <https://github.com/lawnjelly/smoothing-addon>`__.
			
 
				+Performance-wise, interpolation is a very cheap operation compared to running a
			
 
				+physics tick. It's orders of magnitude faster, so this can be a significant
			
 
				+performance win while also reducing jitter.
			
--- a/tutorials/optimization/general_optimization.rst
+++ b/tutorials/optimization/general_optimization.rst
@@ -0,0 +1,297 @@
 
				+.. _doc_general_optimization:
			
 
				+
			
 
				+General optimization tips
			
 
				+=========================
			
 
				+
			
 
				+Introduction
			
 
				+~~~~~~~~~~~~
			
 
				+
			
 
				+In an ideal world, computers would run at infinite speed. The only limit to
			
 
				+what we could achieve would be our imagination. However, in the real world, it's
			
 
				+all too easy to produce software that will bring even the fastest computer to
			
 
				+its knees.
			
 
				+
			
 
				+Thus, designing games and other software is a compromise between what we would
			
 
				+like to be possible, and what we can realistically achieve while maintaining
			
 
				+good performance.
			
 
				+
			
 
				+To achieve the best results, we have two approaches:
			
 
				+
			
 
				+- Work faster.
			
 
				+- Work smarter.
			
 
				+
			
 
				+And preferably, we will use a blend of the two.
			
 
				+
			
 
				+Smoke and mirrors
			
 
				+^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+Part of working smarter is recognizing that, in games, we can often get the
			
 
				+player to believe they're in a world that is far more complex, interactive, and
			
 
				+graphically exciting than it really is. A good programmer is a magician, and
			
 
				+should strive to learn the tricks of the trade while trying to invent new ones.
			
 
				+
			
 
				+The nature of slowness
			
 
				+^^^^^^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+To the outside observer, performance problems are often lumped together.
			
 
				+But in reality, there are several different kinds of performance problems:
			
 
				+
			
 
				+- A slow process that occurs every frame, leading to a continuously low frame
			
 
				+  rate.
			
 
				+- An intermittent process that causes "spikes" of slowness, leading to
			
 
				+  stalls.
			
 
				+- A slow process that occurs outside of normal gameplay, for instance,
			
 
				+  when loading a level.
			
 
				+
			
 
				+Each of these are annoying to the user, but in different ways.
			
 
				+
			
 
				+Measuring performance
			
 
				+=====================
			
 
				+
			
 
				+Probably the most important tool for optimization is the ability to measure
			
 
				+performance - to identify where bottlenecks are, and to measure the success of
			
 
				+our attempts to speed them up.
			
 
				+
			
 
				+There are several methods of measuring performance, including:
			
 
				+
			
 
				+- Putting a start/stop timer around code of interest.
			
 
				+- Using the Godot profiler.
			
 
				+- Using external third-party CPU profilers.
			
 
				+- Using GPU profilers/debuggers such as
			
 
				+  `NVIDIA Nsight Graphics <https://developer.nvidia.com/nsight-graphics>`__
			
 
				+  or `apitrace <https://apitrace.github.io/>`__.
			
 
				+- Checking the frame rate (with V-Sync disabled).
			
 
				+
			
 
				+Be very aware that the relative performance of different areas can vary on
			
 
				+different hardware. It's often a good idea to measure timings on more than one
			
 
				+device. This is especially the case if you're targeting mobile devices.
			
 
				+
			
 
				+Limitations
			
 
				+~~~~~~~~~~~
			
 
				+
			
 
				+CPU profilers are often the go-to method for measuring performance. However,
			
 
				+they don't always tell the whole story.
			
 
				+
			
 
				+- Bottlenecks are often on the GPU, "as a result" of instructions given by the
			
 
				+  CPU.
			
 
				+- Spikes can occur in the operating system processes (outside of Godot) "as a
			
 
				+  result" of instructions used in Godot (for example, dynamic memory allocation).
			
 
				+- You may not always be able to profile specific devices like a mobile phone
			
 
				+  due to the initial setup required.
			
 
				+- You may have to solve performance problems that occur on hardware you don't
			
 
				+  have access to.
			
 
				+
			
 
				+As a result of these limitations, you often need to use detective work to find
			
 
				+out where bottlenecks are.
			
 
				+
			
 
				+Detective work
			
 
				+~~~~~~~~~~~~~~
			
 
				+
			
 
				+Detective work is a crucial skill for developers (both in terms of performance,
			
 
				+and also in terms of bug fixing). This can include hypothesis testing, and
			
 
				+binary search.
			
 
				+
			
 
				+Hypothesis testing
			
 
				+^^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+Say, for example, that you believe sprites are slowing down your game.
			
 
				+You can test this hypothesis by:
			
 
				+
			
 
				+- Measuring the performance when you add more sprites, or take some away.
			
 
				+
			
 
				+This may lead to a further hypothesis: does the size of the sprite determine
			
 
				+the performance drop?
			
 
				+
			
 
				+- You can test this by keeping everything the same, but changing the sprite
			
 
				+  size, and measuring performance.
			
 
				+
			
 
				+Binary search
			
 
				+^^^^^^^^^^^^^
			
 
				+
			
 
				+If you know that frames are taking much longer than they should, but you're
			
 
				+not sure where the bottleneck lies. You could begin by commenting out
			
 
				+approximately half the routines that occur on a normal frame. Has the
			
 
				+performance improved more or less than expected?
			
 
				+
			
 
				+Once you know which of the two halves contains the bottleneck, you can
			
 
				+repeat this process until you've pinned down the problematic area.
			
 
				+
			
 
				+Profilers
			
 
				+=========
			
 
				+
			
 
				+Profilers allow you to time your program while running it. Profilers then
			
 
				+provide results telling you what percentage of time was spent in different
			
 
				+functions and areas, and how often functions were called.
			
 
				+
			
 
				+This can be very useful both to identify bottlenecks and to measure the results
			
 
				+of your improvements. Sometimes, attempts to improve performance can backfire
			
 
				+and lead to slower performance.
			
 
				+**Always use profiling and timing to guide your efforts.**
			
 
				+
			
 
				+For more info about using Godot's built-in profiler, see :ref:`doc_debugger_panel`.
			
 
				+
			
 
				+Principles
			
 
				+==========
			
 
				+
			
 
				+`Donald Knuth <https://en.wikipedia.org/wiki/Donald_Knuth>`__ said:
			
 
				+
			
 
				+    *Programmers waste enormous amounts of time thinking about, or worrying
			
 
				+    about, the speed of noncritical parts of their programs, and these attempts
			
 
				+    at efficiency actually have a strong negative impact when debugging and
			
 
				+    maintenance are considered. We should forget about small efficiencies, say
			
 
				+    about 97% of the time: premature optimization is the root of all evil. Yet
			
 
				+    we should not pass up our opportunities in that critical 3%.*
			
 
				+
			
 
				+The messages are very important:
			
 
				+
			
 
				+- Developer time is limited. Instead of blindly trying to speed up
			
 
				+  all aspects of a program, we should concentrate our efforts on the aspects
			
 
				+  that really matter.
			
 
				+- Efforts at optimization often end up with code that is harder to read and
			
 
				+  debug than non-optimized code. It is in our interests to limit this to areas
			
 
				+  that will really benefit.
			
 
				+
			
 
				+Just because we *can* optimize a particular bit of code, it doesn't necessarily
			
 
				+mean that we *should*. Knowing when and when not to optimize is a great skill to
			
 
				+develop.
			
 
				+
			
 
				+One misleading aspect of the quote is that people tend to focus on the subquote
			
 
				+*"premature optimization is the root of all evil"*. While *premature*
			
 
				+optimization is (by definition) undesirable, performant software is the result
			
 
				+of performant design.
			
 
				+
			
 
				+Performant design
			
 
				+~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+The danger with encouraging people to ignore optimization until necessary, is
			
 
				+that it conveniently ignores that the most important time to consider
			
 
				+performance is at the design stage, before a key has even hit a keyboard. If the
			
 
				+design or algorithms of a program are inefficient, then no amount of polishing
			
 
				+the details later will make it run fast. It may run *faster*, but it will never
			
 
				+run as fast as a program designed for performance.
			
 
				+
			
 
				+This tends to be far more important in game or graphics programming than in
			
 
				+general programming. A performant design, even without low-level optimization,
			
 
				+will often run many times faster than a mediocre design with low-level
			
 
				+optimization.
			
 
				+
			
 
				+Incremental design
			
 
				+~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Of course, in practice, unless you have prior knowledge, you are unlikely to
			
 
				+come up with the best design the first time. Instead, you'll often make a series
			
 
				+of versions of a particular area of code, each taking a different approach to
			
 
				+the problem, until you come to a satisfactory solution. It's important not to
			
 
				+spend too much time on the details at this stage until you have finalized the
			
 
				+overall design. Otherwise, much of your work will be thrown out.
			
 
				+
			
 
				+It's difficult to give general guidelines for performant design because this is
			
 
				+so dependent on the problem. One point worth mentioning though, on the CPU side,
			
 
				+is that modern CPUs are nearly always limited by memory bandwidth. This has led
			
 
				+to a resurgence in data-oriented design, which involves designing data
			
 
				+structures and algorithms for *cache locality* of data and linear access, rather
			
 
				+than jumping around in memory.
			
 
				+
			
 
				+The optimization process
			
 
				+~~~~~~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Assuming we have a reasonable design, and taking our lessons from Knuth, our
			
 
				+first step in optimization should be to identify the biggest bottlenecks - the
			
 
				+slowest functions, the low-hanging fruit.
			
 
				+
			
 
				+Once we've successfully improved the speed of the slowest area, it may no
			
 
				+longer be the bottleneck. So we should test/profile again and find the next
			
 
				+bottleneck on which to focus.
			
 
				+
			
 
				+The process is thus:
			
 
				+
			
 
				+1. Profile / Identify bottleneck.
			
 
				+2. Optimize bottleneck.
			
 
				+3. Return to step 1.
			
 
				+
			
 
				+Optimizing bottlenecks
			
 
				+~~~~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Some profilers will even tell you which part of a function (which data accesses,
			
 
				+calculations) are slowing things down.
			
 
				+
			
 
				+As with design, you should concentrate your efforts first on making sure the
			
 
				+algorithms and data structures are the best they can be. Data access should be
			
 
				+local (to make best use of CPU cache), and it can often be better to use compact
			
 
				+storage of data (again, always profile to test results). Often, you precalculate
			
 
				+heavy computations ahead of time. This can be done by performing the computation
			
 
				+when loading a level, by loading a file containing precalculated data or simply
			
 
				+by storing the results of complex calculations into a script constant and
			
 
				+reading its value.
			
 
				+
			
 
				+Once algorithms and data are good, you can often make small changes in routines
			
 
				+which improve performance. For instance, you can move some calculations outside
			
 
				+of loops or transform nested ``for`` loops into non-nested loops.
			
 
				+(This should be feasible if you know a 2D array's width or height in advance.)
			
 
				+
			
 
				+Always retest your timing/bottlenecks after making each change. Some changes
			
 
				+will increase speed, others may have a negative effect. Sometimes, a small
			
 
				+positive effect will be outweighed by the negatives of more complex code, and
			
 
				+you may choose to leave out that optimization.
			
 
				+
			
 
				+Appendix
			
 
				+========
			
 
				+
			
 
				+Bottleneck math
			
 
				+~~~~~~~~~~~~~~~
			
 
				+
			
 
				+The proverb *"a chain is only as strong as its weakest link"* applies directly to
			
 
				+performance optimization. If your project is spending 90% of the time in
			
 
				+function ``A``, then optimizing ``A`` can have a massive effect on performance.
			
 
				+
			
 
				+.. code-block:: none
			
 
				+
			
 
				+    A: 9 ms
			
 
				+    Everything else: 1 ms
			
 
				+    Total frame time: 10 ms
			
 
				+
			
 
				+.. code-block:: none
			
 
				+
			
 
				+    A: 1 ms
			
 
				+    Everything else: 1ms
			
 
				+    Total frame time: 2 ms
			
 
				+
			
 
				+In this example, improving this bottleneck ``A`` by a factor of 9× decreases
			
 
				+overall frame time by 5× while increasing frames per second by 5×.
			
 
				+
			
 
				+However, if something else is running slowly and also bottlenecking your
			
 
				+project, then the same improvement can lead to less dramatic gains:
			
 
				+
			
 
				+.. code-block:: none
			
 
				+
			
 
				+    A: 9 ms
			
 
				+    Everything else: 50 ms
			
 
				+    Total frame time: 59 ms
			
 
				+
			
 
				+.. code-block:: none
			
 
				+
			
 
				+    A: 1 ms
			
 
				+    Everything else: 50 ms
			
 
				+    Total frame time: 51 ms
			
 
				+
			
 
				+In this example, even though we have hugely optimized function ``A``,
			
 
				+the actual gain in terms of frame rate is quite small.
			
 
				+
			
 
				+In games, things become even more complicated because the CPU and GPU run
			
 
				+independently of one another. Your total frame time is determined by the slower
			
 
				+of the two.
			
 
				+
			
 
				+.. code-block:: none
			
 
				+
			
 
				+    CPU: 9 ms
			
 
				+    GPU: 50 ms
			
 
				+    Total frame time: 50 ms
			
 
				+
			
 
				+.. code-block:: none
			
 
				+
			
 
				+    CPU: 1 ms
			
 
				+    GPU: 50 ms
			
 
				+    Total frame time: 50 ms
			
 
				+
			
 
				+In this example, we optimized the CPU hugely again, but the frame time didn't
			
 
				+improve because we are GPU-bottlenecked.
			
--- a/tutorials/optimization/gpu_optimization.rst
+++ b/tutorials/optimization/gpu_optimization.rst
@@ -0,0 +1,280 @@
 
				+.. _doc_gpu_optimization:
			
 
				+
			
 
				+GPU optimization
			
 
				+================
			
 
				+
			
 
				+Introduction
			
 
				+~~~~~~~~~~~~
			
 
				+
			
 
				+The demand for new graphics features and progress almost guarantees that you
			
 
				+will encounter graphics bottlenecks. Some of these can be on the CPU side, for
			
 
				+instance in calculations inside the Godot engine to prepare objects for
			
 
				+rendering. Bottlenecks can also occur on the CPU in the graphics driver, which
			
 
				+sorts instructions to pass to the GPU, and in the transfer of these
			
 
				+instructions. And finally, bottlenecks also occur on the GPU itself.
			
 
				+
			
 
				+Where bottlenecks occur in rendering is highly hardware-specific.
			
 
				+Mobile GPUs in particular may struggle with scenes that run easily on desktop.
			
 
				+
			
 
				+Understanding and investigating GPU bottlenecks is slightly different to the
			
 
				+situation on the CPU. This is because, often, you can only change performance
			
 
				+indirectly by changing the instructions you give to the GPU. Also, it may be
			
 
				+more difficult to take measurements. In many cases, the only way of measuring
			
 
				+performance is by examining changes in the time spent rendering each frame.
			
 
				+
			
 
				+Draw calls, state changes, and APIs
			
 
				+===================================
			
 
				+
			
 
				+.. note:: The following section is not relevant to end-users, but is useful to
			
 
				+          provide background information that is relevant in later sections.
			
 
				+
			
 
				+Godot sends instructions to the GPU via a graphics API (OpenGL, OpenGL ES or
			
 
				+Vulkan). The communication and driver activity involved can be quite costly,
			
 
				+especially in OpenGL and OpenGL ES. If we can provide these instructions in a
			
 
				+way that is preferred by the driver and GPU, we can greatly increase
			
 
				+performance.
			
 
				+
			
 
				+Nearly every API command in OpenGL requires a certain amount of validation to
			
 
				+make sure the GPU is in the correct state. Even seemingly simple commands can
			
 
				+lead to a flurry of behind-the-scenes housekeeping. Therefore, the goal is to
			
 
				+reduce these instructions to a bare minimum and group together similar objects
			
 
				+as much as possible so they can be rendered together, or with the minimum number
			
 
				+of these expensive state changes.
			
 
				+
			
 
				+2D batching
			
 
				+~~~~~~~~~~~
			
 
				+
			
 
				+In 2D, the costs of treating each item individually can be prohibitively high -
			
 
				+there can easily be thousands of them on the screen. This is why 2D *batching*
			
 
				+is used. Multiple similar items are grouped together and rendered in a batch,
			
 
				+via a single draw call, rather than making a separate draw call for each item.
			
 
				+In addition, this means state changes, material and texture changes can be kept
			
 
				+to a minimum.
			
 
				+
			
 
				+3D batching
			
 
				+~~~~~~~~~~~
			
 
				+
			
 
				+In 3D, we still aim to minimize draw calls and state changes. However, it can be
			
 
				+more difficult to batch together several objects into a single draw call. 3D
			
 
				+meshes tend to comprise hundreds or thousands of triangles, and combining large
			
 
				+meshes in real-time is prohibitively expensive. The costs of joining them quickly
			
 
				+exceeds any benefits as the number of triangles grows per mesh. A much better
			
 
				+alternative is to **join meshes ahead of time** (static meshes in relation to each
			
 
				+other). This can either be done by artists, or programmatically within Godot.
			
 
				+
			
 
				+There is also a cost to batching together objects in 3D. Several objects
			
 
				+rendered as one cannot be individually culled. An entire city that is off-screen
			
 
				+will still be rendered if it is joined to a single blade of grass that is on
			
 
				+screen. Thus, you should always take objects' location and culling into account
			
 
				+when attempting to batch 3D objects together. Despite this, the benefits of
			
 
				+joining static objects often outweigh other considerations, especially for large
			
 
				+numbers of distant or low-poly objects.
			
 
				+
			
 
				+For more information on 3D specific optimizations, see
			
 
				+:ref:`doc_optimizing_3d_performance`.
			
 
				+
			
 
				+Reuse Shaders and Materials
			
 
				+~~~~~~~~~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+The Godot renderer is a little different to what is out there. It's designed to
			
 
				+minimize GPU state changes as much as possible. :ref:`StandardMaterial3D
			
 
				+<class_StandardMaterial3D>` does a good job at reusing materials that need similar
			
 
				+shaders.  if custom shaders are used, make sure to reuse them as much as
			
 
				+possible. Godot's priorities are:
			
 
				+
			
 
				+-  **Reusing Materials:** The fewer different materials in the
			
 
				+   scene, the faster the rendering will be. If a scene has a huge amount
			
 
				+   of objects (in the hundreds or thousands), try reusing the materials.
			
 
				+   In the worst case, use atlases to decrease the amount of texture changes.
			
 
				+-  **Reusing Shaders:** If materials can't be reused, at least try to
			
 
				+   re-use shaders (or StandardMaterial3Ds with different parameters but the same
			
 
				+   configuration).
			
 
				+
			
 
				+If a scene has, for example, ``20,000`` objects with ``20,000`` different
			
 
				+materials each, rendering will be slow. If the same scene has ``20,000``
			
 
				+objects, but only uses ``100`` materials, rendering will be much faster.
			
 
				+
			
 
				+Pixel cost versus vertex cost
			
 
				+=============================
			
 
				+
			
 
				+You may have heard that the lower the number of polygons in a model, the faster
			
 
				+it will be rendered. This is *really* relative and depends on many factors.
			
 
				+
			
 
				+On a modern PC and console, vertex cost is low. GPUs originally only rendered
			
 
				+triangles. This meant that every frame:
			
 
				+
			
 
				+1. All vertices had to be transformed by the CPU (including clipping).
			
 
				+2. All vertices had to be sent to the GPU memory from the main RAM.
			
 
				+
			
 
				+Nowadays, all this is handled inside the GPU, greatly increasing performance.
			
 
				+3D artists usually have the wrong feeling about polycount performance because 3D
			
 
				+DCCs (such as Blender, Max, etc.) need to keep geometry in CPU memory for it to
			
 
				+be edited, reducing actual performance. Game engines rely on the GPU more, so
			
 
				+they can render many triangles much more efficiently.
			
 
				+
			
 
				+On mobile devices, the story is different. PC and console GPUs are
			
 
				+brute-force monsters that can pull as much electricity as they need from
			
 
				+the power grid. Mobile GPUs are limited to a tiny battery, so they need
			
 
				+to be a lot more power efficient.
			
 
				+
			
 
				+To be more efficient, mobile GPUs attempt to avoid *overdraw*. Overdraw occurs
			
 
				+when the same pixel on the screen is being rendered more than once. Imagine a
			
 
				+town with several buildings. GPUs don't know what is visible and what is hidden
			
 
				+until they draw it. For example, a house might be drawn and then another house
			
 
				+in front of it (which means rendering happened twice for the same pixel). PC
			
 
				+GPUs normally don't care much about this and just throw more pixel processors to
			
 
				+the hardware to increase performance (which also increases power consumption).
			
 
				+
			
 
				+Using more power is not an option on mobile so mobile devices use a technique
			
 
				+called *tile-based rendering* which divides the screen into a grid. Each cell
			
 
				+keeps the list of triangles drawn to it and sorts them by depth to minimize
			
 
				+*overdraw*. This technique improves performance and reduces power consumption,
			
 
				+but takes a toll on vertex performance. As a result, fewer vertices and
			
 
				+triangles can be processed for drawing.
			
 
				+
			
 
				+Additionally, tile-based rendering struggles when there are small objects with a
			
 
				+lot of geometry within a small portion of the screen. This forces mobile GPUs to
			
 
				+put a lot of strain on a single screen tile, which considerably decreases
			
 
				+performance as all the other cells must wait for it to complete before
			
 
				+displaying the frame.
			
 
				+
			
 
				+To summarize, don't worry about vertex count on mobile, but
			
 
				+**avoid concentration of vertices in small parts of the screen**.
			
 
				+If a character, NPC, vehicle, etc. is far away (which means it looks tiny), use
			
 
				+a smaller level of detail (LOD) model. Even on desktop GPUs, it's preferable to
			
 
				+avoid having triangles smaller than the size of a pixel on screen.
			
 
				+
			
 
				+Pay attention to the additional vertex processing required when using:
			
 
				+
			
 
				+-  Skinning (skeletal animation)
			
 
				+-  Morphs (shape keys)
			
 
				+-  Vertex-lit objects (common on mobile)
			
 
				+
			
 
				+Pixel/fragment shaders and fill rate
			
 
				+====================================
			
 
				+
			
 
				+In contrast to vertex processing, the costs of fragment (per-pixel) shading have
			
 
				+increased dramatically over the years. Screen resolutions have increased (the
			
 
				+area of a 4K screen is 8,294,400 pixels, versus 307,200 for an old 640×480 VGA
			
 
				+screen, that is 27x the area), but also the complexity of fragment shaders has
			
 
				+exploded. Physically-based rendering requires complex calculations for each
			
 
				+fragment.
			
 
				+
			
 
				+You can test whether a project is fill rate-limited quite easily. Turn off
			
 
				+V-Sync to prevent capping the frames per second, then compare the frames per
			
 
				+second when running with a large window, to running with a very small window.
			
 
				+You may also benefit from similarly reducing your shadow map size if using
			
 
				+shadows. Usually, you will find the FPS increases quite a bit using a small
			
 
				+window, which indicates you are to some extent fill rate-limited. On the other
			
 
				+hand, if there is little to no increase in FPS, then your bottleneck lies
			
 
				+elsewhere.
			
 
				+
			
 
				+You can increase performance in a fill rate-limited project by reducing the
			
 
				+amount of work the GPU has to do. You can do this by simplifying the shader
			
 
				+(perhaps turn off expensive options if you are using a :ref:`StandardMaterial3D
			
 
				+<class_StandardMaterial3D>`), or reducing the number and size of textures used.
			
 
				+
			
 
				+**When targeting mobile devices, consider using the simplest possible shaders
			
 
				+you can reasonably afford to use.**
			
 
				+
			
 
				+Reading textures
			
 
				+~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+The other factor in fragment shaders is the cost of reading textures. Reading
			
 
				+textures is an expensive operation, especially when reading from several
			
 
				+textures in a single fragment shader. Also, consider that filtering may slow it
			
 
				+down further (trilinear filtering between mipmaps, and averaging). Reading
			
 
				+textures is also expensive in terms of power usage, which is a big issue on
			
 
				+mobiles.
			
 
				+
			
 
				+**If you use third-party shaders or write your own shaders, try to use
			
 
				+algorithms that require as few texture reads as possible.**
			
 
				+
			
 
				+Texture compression
			
 
				+~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+By default, Godot compresses textures of 3D models when imported using video RAM
			
 
				+(VRAM) compression. Video RAM compression isn't as efficient in size as PNG or
			
 
				+JPG when stored, but increases performance enormously when drawing large enough
			
 
				+textures.
			
 
				+
			
 
				+This is because the main goal of texture compression is bandwidth reduction
			
 
				+between memory and the GPU.
			
 
				+
			
 
				+In 3D, the shapes of objects depend more on the geometry than the texture, so
			
 
				+compression is generally not noticeable. In 2D, compression depends more on
			
 
				+shapes inside the textures, so the artifacts resulting from 2D compression are
			
 
				+more noticeable.
			
 
				+
			
 
				+As a warning, most Android devices do not support texture compression of
			
 
				+textures with transparency (only opaque), so keep this in mind.
			
 
				+
			
 
				+.. note::
			
 
				+
			
 
				+   Even in 3D, "pixel art" textures should have VRAM compression disabled as it
			
 
				+   will negatively affect their appearance, without improving performance
			
 
				+   significantly due to their low resolution.
			
 
				+
			
 
				+
			
 
				+Post-processing and shadows
			
 
				+~~~~~~~~~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Post-processing effects and shadows can also be expensive in terms of fragment
			
 
				+shading activity. Always test the impact of these on different hardware.
			
 
				+
			
 
				+**Reducing the size of shadowmaps can increase performance**, both in terms of
			
 
				+writing and reading the shadowmaps. On top of that, the best way to improve
			
 
				+performance of shadows is to turn shadows off for as many lights and objects as
			
 
				+possible. Smaller or distant OmniLights/SpotLights can often have their shadows
			
 
				+disabled with only a small visual impact.
			
 
				+
			
 
				+Transparency and blending
			
 
				+=========================
			
 
				+
			
 
				+Transparent objects present particular problems for rendering efficiency. Opaque
			
 
				+objects (especially in 3D) can be essentially rendered in any order and the
			
 
				+Z-buffer will ensure that only the front most objects get shaded. Transparent or
			
 
				+blended objects are different. In most cases, they cannot rely on the Z-buffer
			
 
				+and must be rendered in "painter's order" (i.e. from back to front) to look
			
 
				+correct.
			
 
				+
			
 
				+Transparent objects are also particularly bad for fill rate, because every item
			
 
				+has to be drawn even if other transparent objects will be drawn on top
			
 
				+later on.
			
 
				+
			
 
				+Opaque objects don't have to do this. They can usually take advantage of the
			
 
				+Z-buffer by writing to the Z-buffer only first, then only performing the
			
 
				+fragment shader on the "winning" fragment, the object that is at the front at a
			
 
				+particular pixel.
			
 
				+
			
 
				+Transparency is particularly expensive where multiple transparent objects
			
 
				+overlap. It is usually better to use transparent areas as small as possible to
			
 
				+minimize these fill rate requirements, especially on mobile, where fill rate is
			
 
				+very expensive. Indeed, in many situations, rendering more complex opaque
			
 
				+geometry can end up being faster than using transparency to "cheat".
			
 
				+
			
 
				+Multi-platform advice
			
 
				+=====================
			
 
				+
			
 
				+If you are aiming to release on multiple platforms, test *early* and test
			
 
				+*often* on all your platforms, especially mobile. Developing a game on desktop
			
 
				+but attempting to port it to mobile at the last minute is a recipe for disaster.
			
 
				+
			
 
				+In general, you should design your game for the lowest common denominator, then
			
 
				+add optional enhancements for more powerful platforms. For example, you may want
			
 
				+to use the GLES2 backend for both desktop and mobile platforms where you target
			
 
				+both.
			
 
				+
			
 
				+Mobile/tiled renderers
			
 
				+======================
			
 
				+
			
 
				+As described above, GPUs on mobile devices work in dramatically different ways
			
 
				+from GPUs on desktop. Most mobile devices use tile renderers. Tile renderers
			
 
				+split up the screen into regular-sized tiles that fit into super fast cache
			
 
				+memory, which reduces the number of read/write operations to the main memory.
			
 
				+
			
 
				+There are some downsides though. Tiled rendering can make certain techniques
			
 
				+much more complicated and expensive to perform. Tiles that rely on the results
			
 
				+of rendering in different tiles or on the results of earlier operations being
			
 
				+preserved can be very slow. Be very careful to test the performance of shaders,
			
 
				+viewport textures and post processing.
			
--- a/tutorials/optimization/img/godot_profiler.png
+++ b/tutorials/optimization/img/godot_profiler.png
--- a/tutorials/optimization/img/lights_overlap.png
+++ b/tutorials/optimization/img/lights_overlap.png
--- a/tutorials/optimization/img/lights_separate.png
+++ b/tutorials/optimization/img/lights_separate.png
--- a/tutorials/optimization/img/overlap1.png
+++ b/tutorials/optimization/img/overlap1.png
--- a/tutorials/optimization/img/overlap2.png
+++ b/tutorials/optimization/img/overlap2.png
--- a/tutorials/optimization/img/scissoring.png
+++ b/tutorials/optimization/img/scissoring.png
--- a/tutorials/optimization/img/valgrind.png
+++ b/tutorials/optimization/img/valgrind.png
--- a/tutorials/optimization/index.rst
+++ b/tutorials/optimization/index.rst
@@ -1,9 +1,76 @@
 
				 Optimization
			
 
				 =============
			
 
				 
			
 
				+Introduction
			
 
				+------------
			
 
				+
			
 
				+Godot follows a balanced performance philosophy. In the performance world,
			
 
				+there are always trade-offs, which consist of trading speed for usability
			
 
				+and flexibility. Some practical examples of this are:
			
 
				+
			
 
				+-  Rendering large amounts of objects efficiently is easy, but when a
			
 
				+   large scene must be rendered, it can become inefficient. To solve this,
			
 
				+   visibility computation must be added to the rendering. This makes rendering
			
 
				+   less efficient, but at the same time, fewer objects are rendered. Therefore,
			
 
				+   the overall rendering efficiency is improved.
			
 
				+
			
 
				+-  Configuring the properties of every material for every object that
			
 
				+   needs to be rendered is also slow. To solve this, objects are sorted by
			
 
				+   material to reduce the costs. At the same time, sorting has a cost.
			
 
				+
			
 
				+-  In 3D physics, a similar situation happens. The best algorithms to
			
 
				+   handle large amounts of physics objects (such as SAP) are slow at
			
 
				+   insertion/removal of objects and raycasting. Algorithms that allow faster
			
 
				+   insertion and removal, as well as raycasting, will not be able to handle as
			
 
				+   many active objects.
			
 
				+
			
 
				+And there are many more examples of this! Game engines strive to be
			
 
				+general-purpose in nature. Balanced algorithms are always favored over
			
 
				+algorithms that might be fast in some situations and slow in others, or
			
 
				+algorithms that are fast but are more difficult to use.
			
 
				+
			
 
				+Godot is not an exception to this. While it is designed to have backends
			
 
				+swappable for different algorithms, the default backends prioritize balance and
			
 
				+flexibility over performance.
			
 
				+
			
 
				+With this clear, the aim of this tutorial section is to explain how to get the
			
 
				+maximum performance out of Godot. While the tutorials can be read in any order,
			
 
				+it is a good idea to start from :ref:`doc_general_optimization`.
			
 
				+
			
 
				+Common
			
 
				+------
			
 
				+
			
 
				 .. toctree::
			
 
				    :maxdepth: 1
			
 
				-   :name: toc-learn-features-optimization
			
 
				+   :name: toc-learn-features-general-optimization
			
 
				 
			
 
				+   general_optimization
			
 
				    using_servers
			
 
				+
			
 
				+CPU
			
 
				+---
			
 
				+
			
 
				+.. toctree::
			
 
				+   :maxdepth: 1
			
 
				+   :name: toc-learn-features-cpu-optimization
			
 
				+
			
 
				+   cpu_optimization
			
 
				+
			
 
				+GPU
			
 
				+---
			
 
				+
			
 
				+.. toctree::
			
 
				+   :maxdepth: 1
			
 
				+   :name: toc-learn-features-gpu-optimization
			
 
				+
			
 
				+   gpu_optimization
			
 
				    using_multimesh
			
 
				+
			
 
				+3D
			
 
				+--
			
 
				+
			
 
				+.. toctree::
			
 
				+   :maxdepth: 1
			
 
				+   :name: toc-learn-features-3d-optimization
			
 
				+
			
 
				+   optimizing_3d_performance
			
--- a/tutorials/optimization/optimizing_3d_performance.rst
+++ b/tutorials/optimization/optimizing_3d_performance.rst
@@ -0,0 +1,152 @@
 
				+.. meta::
			
 
				+    :keywords: optimization
			
 
				+
			
 
				+.. _doc_optimizing_3d_performance:
			
 
				+
			
 
				+Optimizing 3D performance
			
 
				+=========================
			
 
				+
			
 
				+Culling
			
 
				+=======
			
 
				+
			
 
				+Godot will automatically perform view frustum culling in order to prevent
			
 
				+rendering objects that are outside the viewport. This works well for games that
			
 
				+take place in a small area, however things can quickly become problematic in
			
 
				+larger levels.
			
 
				+
			
 
				+Occlusion culling
			
 
				+~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Walking around a town for example, you may only be able to see a few buildings
			
 
				+in the street you are in, as well as the sky and a few birds flying overhead. As
			
 
				+far as a naive renderer is concerned however, you can still see the entire town.
			
 
				+It won't just render the buildings in front of you, it will render the street
			
 
				+behind that, with the people on that street, the buildings behind that. You
			
 
				+quickly end up in situations where you are attempting to render 10× or 100× more
			
 
				+than what is visible.
			
 
				+
			
 
				+Things aren't quite as bad as they seem, because the Z-buffer usually allows the
			
 
				+GPU to only fully shade the objects that are at the front. This is called *depth
			
 
				+prepass* and is enabled by default in Godot when using the GLES3 renderer.
			
 
				+However, unneeded objects are still reducing performance.
			
 
				+
			
 
				+One way we can potentially reduce the amount to be rendered is to take advantage
			
 
				+of occlusion. As of Godot 3.2.2, there is no built in support for occlusion in
			
 
				+Godot. However, with careful design you can still get many of the advantages.
			
 
				+
			
 
				+For instance, in our city street scenario, you may be able to work out in advance
			
 
				+that you can only see two other streets, ``B`` and ``C``, from street ``A``.
			
 
				+Streets ``D`` to ``Z`` are hidden. In order to take advantage of occlusion, all
			
 
				+you have to do is work out when your viewer is in street ``A`` (perhaps using
			
 
				+Godot Areas), then you can hide the other streets.
			
 
				+
			
 
				+This is a manual version of what is known as a "potentially visible set". It is
			
 
				+a very powerful technique for speeding up rendering. You can also use it to
			
 
				+restrict physics or AI to the local area, and speed these up as well as
			
 
				+rendering.
			
 
				+
			
 
				+.. note::
			
 
				+
			
 
				+    In some cases, you may have to adapt your level design to add more occlusion
			
 
				+    opportunities. For example, you may have to add more walls to prevent the player
			
 
				+    from seeing too far away, which would decrease performance due to the lost
			
 
				+    opportunies for occlusion culling.
			
 
				+
			
 
				+Other occlusion techniques
			
 
				+~~~~~~~~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+There are other occlusion techniques such as portals, automatic PVS, and
			
 
				+raster-based occlusion culling. Some of these may be available through add-ons
			
 
				+and may be available in core Godot in the future.
			
 
				+
			
 
				+Transparent objects
			
 
				+~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Godot sorts objects by :ref:`Material <class_Material>` and :ref:`Shader
			
 
				+<class_Shader>` to improve performance. This, however, can not be done with
			
 
				+transparent objects. Transparent objects are rendered from back to front to make
			
 
				+blending with what is behind work. As a result,
			
 
				+**try to use as few transparent objects as possible**. If an object has a
			
 
				+small section with transparency, try to make that section a separate surface
			
 
				+with its own material.
			
 
				+
			
 
				+For more information, see the :ref:`GPU optimizations <doc_gpu_optimization>`
			
 
				+doc.
			
 
				+
			
 
				+Level of detail (LOD)
			
 
				+=====================
			
 
				+
			
 
				+In some situations, particularly at a distance, it can be a good idea to
			
 
				+**replace complex geometry with simpler versions**. The end user will probably
			
 
				+not be able to see much difference. Consider looking at a large number of trees
			
 
				+in the far distance. There are several strategies for replacing models at
			
 
				+varying distance. You could use lower poly models, or use transparency to
			
 
				+simulate more complex geometry.
			
 
				+
			
 
				+Billboards and imposters
			
 
				+~~~~~~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+The simplest version of using transparency to deal with LOD is billboards. For
			
 
				+example, you can use a single transparent quad to represent a tree at distance.
			
 
				+This can be very cheap to render, unless of course, there are many trees in
			
 
				+front of each other. In which case transparency may start eating into fill rate
			
 
				+(for more information on fill rate, see :ref:`doc_gpu_optimization`).
			
 
				+
			
 
				+An alternative is to render not just one tree, but a number of trees together as
			
 
				+a group. This can be especially effective if you can see an area but cannot
			
 
				+physically approach it in a game.
			
 
				+
			
 
				+You can make imposters by pre-rendering views of an object at different angles.
			
 
				+Or you can even go one step further, and periodically re-render a view of an
			
 
				+object onto a texture to be used as an imposter. At a distance, you need to move
			
 
				+the viewer a considerable distance for the angle of view to change
			
 
				+significantly. This can be complex to get working, but may be worth it depending
			
 
				+on the type of project you are making.
			
 
				+
			
 
				+Use instancing (MultiMesh)
			
 
				+~~~~~~~~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+If several identical objects have to be drawn in the same place or nearby, try
			
 
				+using :ref:`MultiMesh <class_MultiMesh>` instead. MultiMesh allows the drawing
			
 
				+of many thousands of objects at very little performance cost, making it ideal
			
 
				+for flocks, grass, particles, and anything else where you have thousands of
			
 
				+identical objects.
			
 
				+
			
 
				+Also see the :ref:`Using MultiMesh <doc_using_multimesh>` doc.
			
 
				+
			
 
				+Bake lighting
			
 
				+=============
			
 
				+
			
 
				+Lighting objects is one of the most costly rendering operations. Realtime
			
 
				+lighting, shadows (especially multiple lights), and GI are especially expensive.
			
 
				+They may simply be too much for lower power mobile devices to handle.
			
 
				+
			
 
				+**Consider using baked lighting**, especially for mobile. This can look fantastic,
			
 
				+but has the downside that it will not be dynamic. Sometimes, this is a trade-off
			
 
				+worth making.
			
 
				+
			
 
				+In general, if several lights need to affect a scene, it's best to use
			
 
				+:ref:`doc_baked_lightmaps`. Baking can also improve the scene quality by adding
			
 
				+indirect light bounces.
			
 
				+
			
 
				+Animation and skinning
			
 
				+======================
			
 
				+
			
 
				+Animation and vertex animation such as skinning and morphing can be very
			
 
				+expensive on some platforms. You may need to lower the polycount considerably
			
 
				+for animated models or limit the number of them on screen at any one time.
			
 
				+
			
 
				+Large worlds
			
 
				+============
			
 
				+
			
 
				+If you are making large worlds, there are different considerations than what you
			
 
				+may be familiar with from smaller games.
			
 
				+
			
 
				+Large worlds may need to be built in tiles that can be loaded on demand as you
			
 
				+move around the world. This can prevent memory use from getting out of hand, and
			
 
				+also limit the processing needed to the local area.
			
 
				+
			
 
				+There may also be rendering and physics glitches due to floating point error in
			
 
				+large worlds. You may be able to use techniques such as orienting the world
			
 
				+around the player (rather than the other way around), or shifting the origin
			
 
				+periodically to keep things centred around ``Vector3(0, 0, 0)``.