Преглед на файлове

Initial port of optimization tutorials to master

clayjohn преди 4 години
родител
ревизия
55b1a4fb03

+ 0 - 1
tutorials/3d/index.rst

@@ -7,7 +7,6 @@
 
    introduction_to_3d
    using_transforms
-   optimizing_3d_performance
    3d_rendering_limitations
    standard_material_3d
    lights_and_shadows

+ 0 - 192
tutorials/3d/optimizing_3d_performance.rst

@@ -1,192 +0,0 @@
-.. meta::
-    :keywords: optimization
-
-.. _doc_optimizing_3d_performance:
-
-Optimizing 3D performance
-=========================
-
-Introduction
-~~~~~~~~~~~~
-
-Godot follows a balanced performance philosophy. In the performance world,
-there are always trade-offs, which consist of trading speed for
-usability and flexibility. Some practical examples of this are:
-
--  Rendering objects efficiently in high amounts is easy, but when a
-   large scene must be rendered, it can become inefficient. To solve
-   this, visibility computation must be added to the rendering, which
-   makes rendering less efficient, but, at the same time, fewer objects are
-   rendered, so efficiency overall improves.
--  Configuring the properties of every material for every object that
-   needs to be rendered is also slow. To solve this, objects are sorted
-   by material to reduce the costs, but at the same time sorting has a
-   cost.
--  In 3D physics a similar situation happens. The best algorithms to
-   handle large amounts of physics objects (such as SAP) are slow
-   at insertion/removal of objects and ray-casting. Algorithms that
-   allow faster insertion and removal, as well as ray-casting, will not
-   be able to handle as many active objects.
-
-And there are many more examples of this! Game engines strive to be
-general purpose in nature, so balanced algorithms are always favored
-over algorithms that might be fast in some situations and slow in
-others.. or algorithms that are fast but make usability more difficult.
-
-Godot is not an exception and, while it is designed to have backends
-swappable for different algorithms, the default ones (or more like, the
-only ones that are there for now) prioritize balance and flexibility
-over performance.
-
-With this clear, the aim of this tutorial is to explain how to get the
-maximum performance out of Godot.
-
-Rendering
-~~~~~~~~~
-
-3D rendering is one of the most difficult areas to get performance from,
-so this section will have a list of tips.
-
-Reuse shaders and materials
----------------------------
-
-The Godot renderer is a little different to what is out there. It's designed
-to minimize GPU state changes as much as possible.
-:ref:`class_StandardMaterial3D`
-does a good job at reusing materials that need similar shaders but, if
-custom shaders are used, make sure to reuse them as much as possible.
-Godot's priorities will be like this:
-
--  **Reusing Materials**: The fewer different materials in the
-   scene, the faster the rendering will be. If a scene has a huge amount
-   of objects (in the hundreds or thousands) try reusing the materials
-   or in the worst case use atlases.
--  **Reusing Shaders**: If materials can't be reused, at least try to
-   re-use shaders (or StandardMaterial3Ds with different parameters but the same
-   configuration).
-
-If a scene has, for example, 20.000 objects with 20.000 different
-materials each, rendering will be slow. If the same scene has
-20.000 objects, but only uses 100 materials, rendering will be blazingly
-fast.
-
-Pixel cost vs vertex cost
--------------------------
-
-It is a common thought that the lower the number of polygons in a model, the
-faster it will be rendered. This is *really* relative and depends on
-many factors.
-
-On a modern PC and console, vertex cost is low. GPUs
-originally only rendered triangles, so all the vertices:
-
-1. Had to be transformed by the CPU (including clipping).
-
-2. Had to be sent to the GPU memory from the main RAM.
-
-Nowadays, all this is handled inside the GPU, so the performance is
-extremely high. 3D artists usually have the wrong feeling about
-polycount performance because 3D DCCs (such as Blender, Max, etc.) need
-to keep geometry in CPU memory in order for it to be edited, reducing
-actual performance. Truth is, a model rendered by a 3D engine is much
-more optimal than how 3D DCCs display them.
-
-On mobile devices, the story is different. PC and Console GPUs are
-brute-force monsters that can pull as much electricity as they need from
-the power grid. Mobile GPUs are limited to a tiny battery, so they need
-to be a lot more power efficient.
-
-To be more efficient, mobile GPUs attempt to avoid *overdraw*. This
-means, the same pixel on the screen being rendered (as in, with lighting
-calculation, etc.) more than once. Imagine a town with several buildings,
-GPUs don't know what is visible and what is hidden until they
-draw it. A house might be drawn and then another house in front of it
-(rendering happened twice for the same pixel!). PC GPUs normally don't
-care much about this and just throw more pixel processors to the
-hardware to increase performance (but this also increases power
-consumption).
-
-On mobile, pulling more power is not an option, so a technique called
-"Tile Based Rendering" is used (almost every mobile hardware uses a
-variant of it), which divides the screen into a grid. Each cell keeps the
-list of triangles drawn to it and sorts them by depth to minimize
-*overdraw*. This technique improves performance and reduces power
-consumption, but takes a toll on vertex performance. As a result, fewer
-vertices and triangles can be processed for drawing.
-
-Generally, this is not so bad, but there is a corner case on mobile that
-must be avoided, which is to have small objects with a lot of geometry
-within a small portion of the screen. This forces mobile GPUs to put a
-lot of strain on a single screen cell, considerably decreasing
-performance (as all the other cells must wait for it to complete in
-order to display the frame).
-
-To make it short, do not worry about vertex count so much on mobile, but
-avoid concentration of vertices in small parts of the screen. If, for
-example, a character, NPC, vehicle, etc. is far away (so it looks tiny),
-use a smaller level of detail (LOD) model instead.
-
-An extra situation where vertex cost must be considered is objects that
-have extra processing per vertex, such as:
-
--  Skinning (skeletal animation)
--  Morphs (shape keys)
--  Vertex Lit Objects (common on mobile)
-
-Texture compression
--------------------
-
-Godot offers to compress textures of 3D models when imported (VRAM
-compression). Video RAM compression is not as efficient in size as PNG
-or JPG when stored, but increases performance enormously when drawing.
-
-This is because the main goal of texture compression is bandwidth
-reduction between memory and the GPU.
-
-In 3D, the shapes of objects depend more on the geometry than the
-texture, so compression is generally not noticeable. In 2D, compression
-depends more on shapes inside the textures, so the artifacts resulting
-from 2D compression are more noticeable.
-
-As a warning, most Android devices do not support texture compression of
-textures with transparency (only opaque), so keep this in mind.
-
-Transparent objects
--------------------
-
-As mentioned before, Godot sorts objects by material and shader to
-improve performance. This, however, can not be done on transparent
-objects. Transparent objects are rendered from back to front to make
-blending with what is behind work. As a result, please try to keep
-transparent objects to a minimum! If an object has a small section with
-transparency, try to make that section a separate material.
-
-Level of detail (LOD)
----------------------
-
-As also mentioned before, using objects with fewer vertices can improve
-performance in some cases. Godot has a simple system to change level
-of detail,
-:ref:`GeometryInstance <class_GeometryInstance>`
-based objects have a visibility range that can be defined. Having
-several GeometryInstance objects in different ranges works as LOD.
-
-Use instancing (MultiMesh)
---------------------------
-
-If several identical objects have to be drawn in the same place or
-nearby, try using :ref:`MultiMesh <class_MultiMesh>`
-instead. MultiMesh allows the drawing of dozens of thousands of objects at
-very little performance cost, making it ideal for flocks, grass,
-particles, etc.
-
-Bake lighting
--------------
-
-Small lights are usually not a performance issue. Shadows a little more.
-In general, if several lights need to affect a scene, it's ideal to bake
-it (:ref:`doc_baked_lightmaps`). Baking can also improve the scene quality by
-adding indirect light bounces.
-
-If working on mobile, baking to texture is recommended, since this
-method is even faster.

+ 277 - 0
tutorials/optimization/cpu_optimization.rst

@@ -0,0 +1,277 @@
+.. _doc_cpu_optimization:
+
+CPU optimization
+================
+
+Measuring performance
+=====================
+
+We have to know where the "bottlenecks" are to know how to speed up our program.
+Bottlenecks are the slowest parts of the program that limit the rate that
+everything can progress. Focussing on bottlenecks allows us to concentrate our
+efforts on optimizing the areas which will give us the greatest speed
+improvement, instead of spending a lot of time optimizing functions that will
+lead to small performance improvements.
+
+For the CPU, the easiest way to identify bottlenecks is to use a profiler.
+
+CPU profilers
+=============
+
+Profilers run alongside your program and take timing measurements to work out
+what proportion of time is spent in each function.
+
+The Godot IDE conveniently has a built-in profiler. It does not run every time
+you start your project: it must be manually started and stopped. This is
+because, like most profilers, recording these timing measurements can
+slow down your project significantly.
+
+After profiling, you can look back at the results for a frame.
+
+.. figure:: img/godot_profiler.png
+.. figure:: img/godot_profiler.png
+   :alt: Screenshot of the Godot profiler
+
+   Results of a profile of one of the demo projects.
+
+.. note:: We can see the cost of built-in processes such as physics and audio,
+          as well as seeing the cost of our own scripting functions at the
+          bottom.
+
+          Time spent waiting for various built-in servers may not be counted in
+          the profilers. This is a known bug.
+
+When a project is running slowly, you will often see an obvious function or
+process taking a lot more time than others. This is your primary bottleneck, and
+you can usually increase speed by optimizing this area.
+
+For more info about using Godot's built-in profiler, see
+:ref:`doc_debugger_panel`.
+
+External profilers
+~~~~~~~~~~~~~~~~~~
+
+Although the Godot IDE profiler is very convenient and useful, sometimes you
+need more power, and the ability to profile the Godot engine source code itself.
+
+You can use a number of third party profilers to do this including
+`Valgrind <https://www.valgrind.org/>`__,
+`VerySleepy <http://www.codersnotes.com/sleepy/>`__,
+`HotSpot <https://github.com/KDAB/hotspot>`__,
+`Visual Studio <https://visualstudio.microsoft.com/>`__ and
+`Intel VTune <https://software.intel.com/content/www/us/en/develop/tools/vtune-profiler.html>`__.
+
+.. note:: You will need to compile Godot from source to use a third-party profiler.
+          This is required to obtain debugging symbols. You can also use a debug
+          build, however, note that the results of profiling a debug build will
+          be different to a release build, because debug builds are less
+          optimized. Bottlenecks are often in a different place in debug builds,
+          so you should profile release builds whenever possible.
+
+.. figure:: img/valgrind.png
+   :alt: Screenshot of Callgrind
+
+   Example results from Callgrind, which is part of Valgrind.
+
+From the left, Callgrind is listing the percentage of time within a function and
+its children (Inclusive), the percentage of time spent within the function
+itself, excluding child functions (Self), the number of times the function is
+called, the function name, and the file or module.
+
+In this example, we can see nearly all time is spent under the
+`Main::iteration()` function. This is the master function in the Godot source
+code that is called repeatedly. It causes frames to be drawn, physics ticks to
+be simulated, and nodes and scripts to be updated. A large proportion of the
+time is spent in the functions to render a canvas (66%), because this example
+uses a 2D benchmark. Below this, we see that almost 50% of the time is spent
+outside Godot code in ``libglapi`` and ``i965_dri`` (the graphics driver).
+This tells us the a large proportion of CPU time is being spent in the
+graphics driver.
+
+This is actually an excellent example because, in an ideal world, only a very
+small proportion of time would be spent in the graphics driver. This is an
+indication that there is a problem with too much communication and work being
+done in the graphics API. This specific profiling led to the development of 2D
+batching, which greatly speeds up 2D rendering by reducing bottlenecks in this
+area.
+
+Manually timing functions
+=========================
+
+Another handy technique, especially once you have identified the bottleneck
+using a profiler, is to manually time the function or area under test.
+The specifics vary depending on the language, but in GDScript, you would do
+the following:
+
+::
+
+    var time_start = OS.get_ticks_usec()
+
+    # Your function you want to time
+    update_enemies()
+
+    var time_end = OS.get_ticks_usec()
+    print("update_enemies() took %d microseconds" % time_end - time_start)
+
+When manually timing functions, it is usually a good idea to run the function
+many times (1,000 or more times), instead of just once (unless it is a very slow
+function). The reason for doing this is that timers often have limited accuracy.
+Moreover, CPUs will schedule processes in a haphazard manner. Therefore, an
+average over a series of runs is more accurate than a single measurement.
+
+As you attempt to optimize functions, be sure to either repeatedly profile or
+time them as you go. This will give you crucial feedback as to whether the
+optimization is working (or not).
+
+Caches
+======
+
+CPU caches are something else to be particularly aware of, especially when
+comparing timing results of two different versions of a function. The results
+can be highly dependent on whether the data is in the CPU cache or not. CPUs
+don't load data directly from the system RAM, even though it's huge in
+comparison to the CPU cache (several gigabytes instead of a few megabytes). This
+is because system RAM is very slow to access. Instead, CPUs load data from a
+smaller, faster bank of memory called cache. Loading data from cache is very
+fast, but every time you try and load a memory address that is not stored in
+cache, the cache must make a trip to main memory and slowly load in some data.
+This delay can result in the CPU sitting around idle for a long time, and is
+referred to as a "cache miss".
+
+This means that the first time you run a function, it may run slowly because the
+data is not in the CPU cache. The second and later times, it may run much faster
+because the data is in the cache. Due to this, always use averages when timing,
+and be aware of the effects of cache.
+
+Understanding caching is also crucial to CPU optimization. If you have an
+algorithm (routine) that loads small bits of data from randomly spread out areas
+of main memory, this can result in a lot of cache misses, a lot of the time, the
+CPU will be waiting around for data instead of doing any work. Instead, if you
+can make your data accesses localised, or even better, access memory in a linear
+fashion (like a continuous list), then the cache will work optimally and the CPU
+will be able to work as fast as possible.
+
+Godot usually takes care of such low-level details for you. For example, the
+Server APIs make sure data is optimized for caching already for things like
+rendering and physics. Still, you should be especially aware of caching when
+using :ref:`GDNative <toc-tutorials-gdnative>`.
+
+Languages
+=========
+
+Godot supports a number of different languages, and it is worth bearing in mind
+that there are trade-offs involved. Some languages are designed for ease of use
+at the cost of speed, and others are faster but more difficult to work with.
+
+Built-in engine functions run at the same speed regardless of the scripting
+language you choose. If your project is making a lot of calculations in its own
+code, consider moving those calculations to a faster language.
+
+GDScript
+~~~~~~~~
+
+:ref:`GDScript <toc-learn-scripting-gdscript>` is designed to be easy to use and iterate,
+and is ideal for making many types of games. However, in this language, ease of
+use is considered more important than performance. If you need to make heavy
+calculations, consider moving some of your project to one of the other
+languages.
+
+C#
+~~
+
+:ref:`C# <toc-learn-scripting-C#>` is popular and has first-class support in Godot.It
+offers a good compromise between speed and ease of use. Beware of possible
+garbage collection pauses and leaks that can occur during gameplay, though. A
+common approach to workaround issues with garbage collection is to use *object
+pooling*, which is outside the scope of this guide.
+
+Other languages
+~~~~~~~~~~~~~~~
+
+Third parties provide support for several other languages, including `Rust
+<https://github.com/godot-rust/godot-rust>`_ and `Javascript
+<https://github.com/GodotExplorer/ECMAScript>`_.
+
+C++
+~~~
+
+Godot is written in C++. Using C++ will usually result in the fastest code.
+However, on a practical level, it is the most difficult to deploy to end users'
+machines on different platforms. Options for using C++ include
+:ref:`GDNative <toc-tutorials-gdnative>` and
+:ref:`custom modules <doc_custom_modules_in_c++>`.
+
+Threads
+=======
+
+Consider using threads when making a lot of calculations that can run in
+parallel to each other. Modern CPUs have multiple cores, each one capable of
+doing a limited amount of work. By spreading work over multiple threads, you can
+move further towards peak CPU efficiency.
+
+The disadvantage of threads is that you have to be incredibly careful. As each
+CPU core operates independently, they can end up trying to access the same
+memory at the same time. One thread can be reading to a variable while another
+is writing: this is called a *race condition*. Before you use threads, make sure
+you understand the dangers and how to try and prevent these race conditions.
+
+Threads can also make debugging considerably more difficult. The GDScript
+debugger doesn't support setting up breakpoints in threads yet.
+
+For more information on threads, see :ref:`doc_using_multiple_threads`.
+
+SceneTree
+=========
+
+Although Nodes are an incredibly powerful and versatile concept, be aware that
+every node has a cost. Built-in functions such as `_process()` and
+`_physics_process()` propagate through the tree. This housekeeping can reduce
+performance when you have very large numbers of nodes (usually in the thousands).
+
+Each node is handled individually in the Godot renderer. Therefore, a smaller
+number of nodes with more in each can lead to better performance.
+
+One quirk of the :ref:`SceneTree <class_SceneTree>` is that you can sometimes
+get much better performance by removing nodes from the SceneTree, rather than by
+pausing or hiding them. You don't have to delete a detached node. You can for
+example, keep a reference to a node, detach it from the scene tree using
+:ref:`Node.remove_child(node) <class_Node_method_remove_child>`, then reattach
+it later using :ref:`Node.add_child(node) <class_Node_method_add_child>`.
+This can be very useful for adding and removing areas from a game, for example.
+
+You can avoid the SceneTree altogether by using Server APIs. For more
+information, see :ref:`doc_using_servers`.
+
+Physics
+=======
+
+In some situations, physics can end up becoming a bottleneck. This is
+particularly the case with complex worlds and large numbers of physics objects.
+
+Here are some techniques to speed up physics:
+
+- Try using simplified versions of your rendered geometry for collision shapes.
+  Often, this won't be noticeable for end users, but can greatly increase
+  performance.
+- Try removing objects from physics when they are out of view / outside the
+  current area, or reusing physics objects (maybe you allow 8 monsters per area,
+  for example, and reuse these).
+
+Another crucial aspect to physics is the physics tick rate. In some games, you
+can greatly reduce the tick rate, and instead of for example, updating physics
+60 times per second, you may update them only 30 or even 20 times per second.
+This can greatly reduce the CPU load.
+
+The downside of changing physics tick rate is you can get jerky movement or
+jitter when the physics update rate does not match the frames per second
+rendered. Also, decreasing the physics tick rate will increase input lag.
+It's recommended to stick to the default physics tick rate (60 Hz) in most games
+that feature real-time player movement.
+
+The solution to jitter is to use *fixed timestep interpolation*, which involves
+smoothing the rendered positions and rotations over multiple frames to match the
+physics. You can either implement this yourself or use a
+`third-party addon <https://github.com/lawnjelly/smoothing-addon>`__.
+Performance-wise, interpolation is a very cheap operation compared to running a
+physics tick. It's orders of magnitude faster, so this can be a significant
+performance win while also reducing jitter.

+ 297 - 0
tutorials/optimization/general_optimization.rst

@@ -0,0 +1,297 @@
+.. _doc_general_optimization:
+
+General optimization tips
+=========================
+
+Introduction
+~~~~~~~~~~~~
+
+In an ideal world, computers would run at infinite speed. The only limit to
+what we could achieve would be our imagination. However, in the real world, it's
+all too easy to produce software that will bring even the fastest computer to
+its knees.
+
+Thus, designing games and other software is a compromise between what we would
+like to be possible, and what we can realistically achieve while maintaining
+good performance.
+
+To achieve the best results, we have two approaches:
+
+- Work faster.
+- Work smarter.
+
+And preferably, we will use a blend of the two.
+
+Smoke and mirrors
+^^^^^^^^^^^^^^^^^
+
+Part of working smarter is recognizing that, in games, we can often get the
+player to believe they're in a world that is far more complex, interactive, and
+graphically exciting than it really is. A good programmer is a magician, and
+should strive to learn the tricks of the trade while trying to invent new ones.
+
+The nature of slowness
+^^^^^^^^^^^^^^^^^^^^^^
+
+To the outside observer, performance problems are often lumped together.
+But in reality, there are several different kinds of performance problems:
+
+- A slow process that occurs every frame, leading to a continuously low frame
+  rate.
+- An intermittent process that causes "spikes" of slowness, leading to
+  stalls.
+- A slow process that occurs outside of normal gameplay, for instance,
+  when loading a level.
+
+Each of these are annoying to the user, but in different ways.
+
+Measuring performance
+=====================
+
+Probably the most important tool for optimization is the ability to measure
+performance - to identify where bottlenecks are, and to measure the success of
+our attempts to speed them up.
+
+There are several methods of measuring performance, including:
+
+- Putting a start/stop timer around code of interest.
+- Using the Godot profiler.
+- Using external third-party CPU profilers.
+- Using GPU profilers/debuggers such as
+  `NVIDIA Nsight Graphics <https://developer.nvidia.com/nsight-graphics>`__
+  or `apitrace <https://apitrace.github.io/>`__.
+- Checking the frame rate (with V-Sync disabled).
+
+Be very aware that the relative performance of different areas can vary on
+different hardware. It's often a good idea to measure timings on more than one
+device. This is especially the case if you're targeting mobile devices.
+
+Limitations
+~~~~~~~~~~~
+
+CPU profilers are often the go-to method for measuring performance. However,
+they don't always tell the whole story.
+
+- Bottlenecks are often on the GPU, "as a result" of instructions given by the
+  CPU.
+- Spikes can occur in the operating system processes (outside of Godot) "as a
+  result" of instructions used in Godot (for example, dynamic memory allocation).
+- You may not always be able to profile specific devices like a mobile phone
+  due to the initial setup required.
+- You may have to solve performance problems that occur on hardware you don't
+  have access to.
+
+As a result of these limitations, you often need to use detective work to find
+out where bottlenecks are.
+
+Detective work
+~~~~~~~~~~~~~~
+
+Detective work is a crucial skill for developers (both in terms of performance,
+and also in terms of bug fixing). This can include hypothesis testing, and
+binary search.
+
+Hypothesis testing
+^^^^^^^^^^^^^^^^^^
+
+Say, for example, that you believe sprites are slowing down your game.
+You can test this hypothesis by:
+
+- Measuring the performance when you add more sprites, or take some away.
+
+This may lead to a further hypothesis: does the size of the sprite determine
+the performance drop?
+
+- You can test this by keeping everything the same, but changing the sprite
+  size, and measuring performance.
+
+Binary search
+^^^^^^^^^^^^^
+
+If you know that frames are taking much longer than they should, but you're
+not sure where the bottleneck lies. You could begin by commenting out
+approximately half the routines that occur on a normal frame. Has the
+performance improved more or less than expected?
+
+Once you know which of the two halves contains the bottleneck, you can
+repeat this process until you've pinned down the problematic area.
+
+Profilers
+=========
+
+Profilers allow you to time your program while running it. Profilers then
+provide results telling you what percentage of time was spent in different
+functions and areas, and how often functions were called.
+
+This can be very useful both to identify bottlenecks and to measure the results
+of your improvements. Sometimes, attempts to improve performance can backfire
+and lead to slower performance.
+**Always use profiling and timing to guide your efforts.**
+
+For more info about using Godot's built-in profiler, see :ref:`doc_debugger_panel`.
+
+Principles
+==========
+
+`Donald Knuth <https://en.wikipedia.org/wiki/Donald_Knuth>`__ said:
+
+    *Programmers waste enormous amounts of time thinking about, or worrying
+    about, the speed of noncritical parts of their programs, and these attempts
+    at efficiency actually have a strong negative impact when debugging and
+    maintenance are considered. We should forget about small efficiencies, say
+    about 97% of the time: premature optimization is the root of all evil. Yet
+    we should not pass up our opportunities in that critical 3%.*
+
+The messages are very important:
+
+- Developer time is limited. Instead of blindly trying to speed up
+  all aspects of a program, we should concentrate our efforts on the aspects
+  that really matter.
+- Efforts at optimization often end up with code that is harder to read and
+  debug than non-optimized code. It is in our interests to limit this to areas
+  that will really benefit.
+
+Just because we *can* optimize a particular bit of code, it doesn't necessarily
+mean that we *should*. Knowing when and when not to optimize is a great skill to
+develop.
+
+One misleading aspect of the quote is that people tend to focus on the subquote
+*"premature optimization is the root of all evil"*. While *premature*
+optimization is (by definition) undesirable, performant software is the result
+of performant design.
+
+Performant design
+~~~~~~~~~~~~~~~~~
+
+The danger with encouraging people to ignore optimization until necessary, is
+that it conveniently ignores that the most important time to consider
+performance is at the design stage, before a key has even hit a keyboard. If the
+design or algorithms of a program are inefficient, then no amount of polishing
+the details later will make it run fast. It may run *faster*, but it will never
+run as fast as a program designed for performance.
+
+This tends to be far more important in game or graphics programming than in
+general programming. A performant design, even without low-level optimization,
+will often run many times faster than a mediocre design with low-level
+optimization.
+
+Incremental design
+~~~~~~~~~~~~~~~~~~
+
+Of course, in practice, unless you have prior knowledge, you are unlikely to
+come up with the best design the first time. Instead, you'll often make a series
+of versions of a particular area of code, each taking a different approach to
+the problem, until you come to a satisfactory solution. It's important not to
+spend too much time on the details at this stage until you have finalized the
+overall design. Otherwise, much of your work will be thrown out.
+
+It's difficult to give general guidelines for performant design because this is
+so dependent on the problem. One point worth mentioning though, on the CPU side,
+is that modern CPUs are nearly always limited by memory bandwidth. This has led
+to a resurgence in data-oriented design, which involves designing data
+structures and algorithms for *cache locality* of data and linear access, rather
+than jumping around in memory.
+
+The optimization process
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Assuming we have a reasonable design, and taking our lessons from Knuth, our
+first step in optimization should be to identify the biggest bottlenecks - the
+slowest functions, the low-hanging fruit.
+
+Once we've successfully improved the speed of the slowest area, it may no
+longer be the bottleneck. So we should test/profile again and find the next
+bottleneck on which to focus.
+
+The process is thus:
+
+1. Profile / Identify bottleneck.
+2. Optimize bottleneck.
+3. Return to step 1.
+
+Optimizing bottlenecks
+~~~~~~~~~~~~~~~~~~~~~~
+
+Some profilers will even tell you which part of a function (which data accesses,
+calculations) are slowing things down.
+
+As with design, you should concentrate your efforts first on making sure the
+algorithms and data structures are the best they can be. Data access should be
+local (to make best use of CPU cache), and it can often be better to use compact
+storage of data (again, always profile to test results). Often, you precalculate
+heavy computations ahead of time. This can be done by performing the computation
+when loading a level, by loading a file containing precalculated data or simply
+by storing the results of complex calculations into a script constant and
+reading its value.
+
+Once algorithms and data are good, you can often make small changes in routines
+which improve performance. For instance, you can move some calculations outside
+of loops or transform nested ``for`` loops into non-nested loops.
+(This should be feasible if you know a 2D array's width or height in advance.)
+
+Always retest your timing/bottlenecks after making each change. Some changes
+will increase speed, others may have a negative effect. Sometimes, a small
+positive effect will be outweighed by the negatives of more complex code, and
+you may choose to leave out that optimization.
+
+Appendix
+========
+
+Bottleneck math
+~~~~~~~~~~~~~~~
+
+The proverb *"a chain is only as strong as its weakest link"* applies directly to
+performance optimization. If your project is spending 90% of the time in
+function ``A``, then optimizing ``A`` can have a massive effect on performance.
+
+.. code-block:: none
+
+    A: 9 ms
+    Everything else: 1 ms
+    Total frame time: 10 ms
+
+.. code-block:: none
+
+    A: 1 ms
+    Everything else: 1ms
+    Total frame time: 2 ms
+
+In this example, improving this bottleneck ``A`` by a factor of 9× decreases
+overall frame time by 5× while increasing frames per second by 5×.
+
+However, if something else is running slowly and also bottlenecking your
+project, then the same improvement can lead to less dramatic gains:
+
+.. code-block:: none
+
+    A: 9 ms
+    Everything else: 50 ms
+    Total frame time: 59 ms
+
+.. code-block:: none
+
+    A: 1 ms
+    Everything else: 50 ms
+    Total frame time: 51 ms
+
+In this example, even though we have hugely optimized function ``A``,
+the actual gain in terms of frame rate is quite small.
+
+In games, things become even more complicated because the CPU and GPU run
+independently of one another. Your total frame time is determined by the slower
+of the two.
+
+.. code-block:: none
+
+    CPU: 9 ms
+    GPU: 50 ms
+    Total frame time: 50 ms
+
+.. code-block:: none
+
+    CPU: 1 ms
+    GPU: 50 ms
+    Total frame time: 50 ms
+
+In this example, we optimized the CPU hugely again, but the frame time didn't
+improve because we are GPU-bottlenecked.

+ 280 - 0
tutorials/optimization/gpu_optimization.rst

@@ -0,0 +1,280 @@
+.. _doc_gpu_optimization:
+
+GPU optimization
+================
+
+Introduction
+~~~~~~~~~~~~
+
+The demand for new graphics features and progress almost guarantees that you
+will encounter graphics bottlenecks. Some of these can be on the CPU side, for
+instance in calculations inside the Godot engine to prepare objects for
+rendering. Bottlenecks can also occur on the CPU in the graphics driver, which
+sorts instructions to pass to the GPU, and in the transfer of these
+instructions. And finally, bottlenecks also occur on the GPU itself.
+
+Where bottlenecks occur in rendering is highly hardware-specific.
+Mobile GPUs in particular may struggle with scenes that run easily on desktop.
+
+Understanding and investigating GPU bottlenecks is slightly different to the
+situation on the CPU. This is because, often, you can only change performance
+indirectly by changing the instructions you give to the GPU. Also, it may be
+more difficult to take measurements. In many cases, the only way of measuring
+performance is by examining changes in the time spent rendering each frame.
+
+Draw calls, state changes, and APIs
+===================================
+
+.. note:: The following section is not relevant to end-users, but is useful to
+          provide background information that is relevant in later sections.
+
+Godot sends instructions to the GPU via a graphics API (OpenGL, OpenGL ES or
+Vulkan). The communication and driver activity involved can be quite costly,
+especially in OpenGL and OpenGL ES. If we can provide these instructions in a
+way that is preferred by the driver and GPU, we can greatly increase
+performance.
+
+Nearly every API command in OpenGL requires a certain amount of validation to
+make sure the GPU is in the correct state. Even seemingly simple commands can
+lead to a flurry of behind-the-scenes housekeeping. Therefore, the goal is to
+reduce these instructions to a bare minimum and group together similar objects
+as much as possible so they can be rendered together, or with the minimum number
+of these expensive state changes.
+
+2D batching
+~~~~~~~~~~~
+
+In 2D, the costs of treating each item individually can be prohibitively high -
+there can easily be thousands of them on the screen. This is why 2D *batching*
+is used. Multiple similar items are grouped together and rendered in a batch,
+via a single draw call, rather than making a separate draw call for each item.
+In addition, this means state changes, material and texture changes can be kept
+to a minimum.
+
+3D batching
+~~~~~~~~~~~
+
+In 3D, we still aim to minimize draw calls and state changes. However, it can be
+more difficult to batch together several objects into a single draw call. 3D
+meshes tend to comprise hundreds or thousands of triangles, and combining large
+meshes in real-time is prohibitively expensive. The costs of joining them quickly
+exceeds any benefits as the number of triangles grows per mesh. A much better
+alternative is to **join meshes ahead of time** (static meshes in relation to each
+other). This can either be done by artists, or programmatically within Godot.
+
+There is also a cost to batching together objects in 3D. Several objects
+rendered as one cannot be individually culled. An entire city that is off-screen
+will still be rendered if it is joined to a single blade of grass that is on
+screen. Thus, you should always take objects' location and culling into account
+when attempting to batch 3D objects together. Despite this, the benefits of
+joining static objects often outweigh other considerations, especially for large
+numbers of distant or low-poly objects.
+
+For more information on 3D specific optimizations, see
+:ref:`doc_optimizing_3d_performance`.
+
+Reuse Shaders and Materials
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The Godot renderer is a little different to what is out there. It's designed to
+minimize GPU state changes as much as possible. :ref:`StandardMaterial3D
+<class_StandardMaterial3D>` does a good job at reusing materials that need similar
+shaders.  if custom shaders are used, make sure to reuse them as much as
+possible. Godot's priorities are:
+
+-  **Reusing Materials:** The fewer different materials in the
+   scene, the faster the rendering will be. If a scene has a huge amount
+   of objects (in the hundreds or thousands), try reusing the materials.
+   In the worst case, use atlases to decrease the amount of texture changes.
+-  **Reusing Shaders:** If materials can't be reused, at least try to
+   re-use shaders (or StandardMaterial3Ds with different parameters but the same
+   configuration).
+
+If a scene has, for example, ``20,000`` objects with ``20,000`` different
+materials each, rendering will be slow. If the same scene has ``20,000``
+objects, but only uses ``100`` materials, rendering will be much faster.
+
+Pixel cost versus vertex cost
+=============================
+
+You may have heard that the lower the number of polygons in a model, the faster
+it will be rendered. This is *really* relative and depends on many factors.
+
+On a modern PC and console, vertex cost is low. GPUs originally only rendered
+triangles. This meant that every frame:
+
+1. All vertices had to be transformed by the CPU (including clipping).
+2. All vertices had to be sent to the GPU memory from the main RAM.
+
+Nowadays, all this is handled inside the GPU, greatly increasing performance.
+3D artists usually have the wrong feeling about polycount performance because 3D
+DCCs (such as Blender, Max, etc.) need to keep geometry in CPU memory for it to
+be edited, reducing actual performance. Game engines rely on the GPU more, so
+they can render many triangles much more efficiently.
+
+On mobile devices, the story is different. PC and console GPUs are
+brute-force monsters that can pull as much electricity as they need from
+the power grid. Mobile GPUs are limited to a tiny battery, so they need
+to be a lot more power efficient.
+
+To be more efficient, mobile GPUs attempt to avoid *overdraw*. Overdraw occurs
+when the same pixel on the screen is being rendered more than once. Imagine a
+town with several buildings. GPUs don't know what is visible and what is hidden
+until they draw it. For example, a house might be drawn and then another house
+in front of it (which means rendering happened twice for the same pixel). PC
+GPUs normally don't care much about this and just throw more pixel processors to
+the hardware to increase performance (which also increases power consumption).
+
+Using more power is not an option on mobile so mobile devices use a technique
+called *tile-based rendering* which divides the screen into a grid. Each cell
+keeps the list of triangles drawn to it and sorts them by depth to minimize
+*overdraw*. This technique improves performance and reduces power consumption,
+but takes a toll on vertex performance. As a result, fewer vertices and
+triangles can be processed for drawing.
+
+Additionally, tile-based rendering struggles when there are small objects with a
+lot of geometry within a small portion of the screen. This forces mobile GPUs to
+put a lot of strain on a single screen tile, which considerably decreases
+performance as all the other cells must wait for it to complete before
+displaying the frame.
+
+To summarize, don't worry about vertex count on mobile, but
+**avoid concentration of vertices in small parts of the screen**.
+If a character, NPC, vehicle, etc. is far away (which means it looks tiny), use
+a smaller level of detail (LOD) model. Even on desktop GPUs, it's preferable to
+avoid having triangles smaller than the size of a pixel on screen.
+
+Pay attention to the additional vertex processing required when using:
+
+-  Skinning (skeletal animation)
+-  Morphs (shape keys)
+-  Vertex-lit objects (common on mobile)
+
+Pixel/fragment shaders and fill rate
+====================================
+
+In contrast to vertex processing, the costs of fragment (per-pixel) shading have
+increased dramatically over the years. Screen resolutions have increased (the
+area of a 4K screen is 8,294,400 pixels, versus 307,200 for an old 640×480 VGA
+screen, that is 27x the area), but also the complexity of fragment shaders has
+exploded. Physically-based rendering requires complex calculations for each
+fragment.
+
+You can test whether a project is fill rate-limited quite easily. Turn off
+V-Sync to prevent capping the frames per second, then compare the frames per
+second when running with a large window, to running with a very small window.
+You may also benefit from similarly reducing your shadow map size if using
+shadows. Usually, you will find the FPS increases quite a bit using a small
+window, which indicates you are to some extent fill rate-limited. On the other
+hand, if there is little to no increase in FPS, then your bottleneck lies
+elsewhere.
+
+You can increase performance in a fill rate-limited project by reducing the
+amount of work the GPU has to do. You can do this by simplifying the shader
+(perhaps turn off expensive options if you are using a :ref:`StandardMaterial3D
+<class_StandardMaterial3D>`), or reducing the number and size of textures used.
+
+**When targeting mobile devices, consider using the simplest possible shaders
+you can reasonably afford to use.**
+
+Reading textures
+~~~~~~~~~~~~~~~~
+
+The other factor in fragment shaders is the cost of reading textures. Reading
+textures is an expensive operation, especially when reading from several
+textures in a single fragment shader. Also, consider that filtering may slow it
+down further (trilinear filtering between mipmaps, and averaging). Reading
+textures is also expensive in terms of power usage, which is a big issue on
+mobiles.
+
+**If you use third-party shaders or write your own shaders, try to use
+algorithms that require as few texture reads as possible.**
+
+Texture compression
+~~~~~~~~~~~~~~~~~~~
+
+By default, Godot compresses textures of 3D models when imported using video RAM
+(VRAM) compression. Video RAM compression isn't as efficient in size as PNG or
+JPG when stored, but increases performance enormously when drawing large enough
+textures.
+
+This is because the main goal of texture compression is bandwidth reduction
+between memory and the GPU.
+
+In 3D, the shapes of objects depend more on the geometry than the texture, so
+compression is generally not noticeable. In 2D, compression depends more on
+shapes inside the textures, so the artifacts resulting from 2D compression are
+more noticeable.
+
+As a warning, most Android devices do not support texture compression of
+textures with transparency (only opaque), so keep this in mind.
+
+.. note::
+
+   Even in 3D, "pixel art" textures should have VRAM compression disabled as it
+   will negatively affect their appearance, without improving performance
+   significantly due to their low resolution.
+
+
+Post-processing and shadows
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Post-processing effects and shadows can also be expensive in terms of fragment
+shading activity. Always test the impact of these on different hardware.
+
+**Reducing the size of shadowmaps can increase performance**, both in terms of
+writing and reading the shadowmaps. On top of that, the best way to improve
+performance of shadows is to turn shadows off for as many lights and objects as
+possible. Smaller or distant OmniLights/SpotLights can often have their shadows
+disabled with only a small visual impact.
+
+Transparency and blending
+=========================
+
+Transparent objects present particular problems for rendering efficiency. Opaque
+objects (especially in 3D) can be essentially rendered in any order and the
+Z-buffer will ensure that only the front most objects get shaded. Transparent or
+blended objects are different. In most cases, they cannot rely on the Z-buffer
+and must be rendered in "painter's order" (i.e. from back to front) to look
+correct.
+
+Transparent objects are also particularly bad for fill rate, because every item
+has to be drawn even if other transparent objects will be drawn on top
+later on.
+
+Opaque objects don't have to do this. They can usually take advantage of the
+Z-buffer by writing to the Z-buffer only first, then only performing the
+fragment shader on the "winning" fragment, the object that is at the front at a
+particular pixel.
+
+Transparency is particularly expensive where multiple transparent objects
+overlap. It is usually better to use transparent areas as small as possible to
+minimize these fill rate requirements, especially on mobile, where fill rate is
+very expensive. Indeed, in many situations, rendering more complex opaque
+geometry can end up being faster than using transparency to "cheat".
+
+Multi-platform advice
+=====================
+
+If you are aiming to release on multiple platforms, test *early* and test
+*often* on all your platforms, especially mobile. Developing a game on desktop
+but attempting to port it to mobile at the last minute is a recipe for disaster.
+
+In general, you should design your game for the lowest common denominator, then
+add optional enhancements for more powerful platforms. For example, you may want
+to use the GLES2 backend for both desktop and mobile platforms where you target
+both.
+
+Mobile/tiled renderers
+======================
+
+As described above, GPUs on mobile devices work in dramatically different ways
+from GPUs on desktop. Most mobile devices use tile renderers. Tile renderers
+split up the screen into regular-sized tiles that fit into super fast cache
+memory, which reduces the number of read/write operations to the main memory.
+
+There are some downsides though. Tiled rendering can make certain techniques
+much more complicated and expensive to perform. Tiles that rely on the results
+of rendering in different tiles or on the results of earlier operations being
+preserved can be very slow. Be very careful to test the performance of shaders,
+viewport textures and post processing.

BIN
tutorials/optimization/img/godot_profiler.png


BIN
tutorials/optimization/img/lights_overlap.png


BIN
tutorials/optimization/img/lights_separate.png


BIN
tutorials/optimization/img/overlap1.png


BIN
tutorials/optimization/img/overlap2.png


BIN
tutorials/optimization/img/scissoring.png


BIN
tutorials/optimization/img/valgrind.png


+ 68 - 1
tutorials/optimization/index.rst

@@ -1,9 +1,76 @@
 Optimization
 =============
 
+Introduction
+------------
+
+Godot follows a balanced performance philosophy. In the performance world,
+there are always trade-offs, which consist of trading speed for usability
+and flexibility. Some practical examples of this are:
+
+-  Rendering large amounts of objects efficiently is easy, but when a
+   large scene must be rendered, it can become inefficient. To solve this,
+   visibility computation must be added to the rendering. This makes rendering
+   less efficient, but at the same time, fewer objects are rendered. Therefore,
+   the overall rendering efficiency is improved.
+
+-  Configuring the properties of every material for every object that
+   needs to be rendered is also slow. To solve this, objects are sorted by
+   material to reduce the costs. At the same time, sorting has a cost.
+
+-  In 3D physics, a similar situation happens. The best algorithms to
+   handle large amounts of physics objects (such as SAP) are slow at
+   insertion/removal of objects and raycasting. Algorithms that allow faster
+   insertion and removal, as well as raycasting, will not be able to handle as
+   many active objects.
+
+And there are many more examples of this! Game engines strive to be
+general-purpose in nature. Balanced algorithms are always favored over
+algorithms that might be fast in some situations and slow in others, or
+algorithms that are fast but are more difficult to use.
+
+Godot is not an exception to this. While it is designed to have backends
+swappable for different algorithms, the default backends prioritize balance and
+flexibility over performance.
+
+With this clear, the aim of this tutorial section is to explain how to get the
+maximum performance out of Godot. While the tutorials can be read in any order,
+it is a good idea to start from :ref:`doc_general_optimization`.
+
+Common
+------
+
 .. toctree::
    :maxdepth: 1
-   :name: toc-learn-features-optimization
+   :name: toc-learn-features-general-optimization
 
+   general_optimization
    using_servers
+
+CPU
+---
+
+.. toctree::
+   :maxdepth: 1
+   :name: toc-learn-features-cpu-optimization
+
+   cpu_optimization
+
+GPU
+---
+
+.. toctree::
+   :maxdepth: 1
+   :name: toc-learn-features-gpu-optimization
+
+   gpu_optimization
    using_multimesh
+
+3D
+--
+
+.. toctree::
+   :maxdepth: 1
+   :name: toc-learn-features-3d-optimization
+
+   optimizing_3d_performance

+ 152 - 0
tutorials/optimization/optimizing_3d_performance.rst

@@ -0,0 +1,152 @@
+.. meta::
+    :keywords: optimization
+
+.. _doc_optimizing_3d_performance:
+
+Optimizing 3D performance
+=========================
+
+Culling
+=======
+
+Godot will automatically perform view frustum culling in order to prevent
+rendering objects that are outside the viewport. This works well for games that
+take place in a small area, however things can quickly become problematic in
+larger levels.
+
+Occlusion culling
+~~~~~~~~~~~~~~~~~
+
+Walking around a town for example, you may only be able to see a few buildings
+in the street you are in, as well as the sky and a few birds flying overhead. As
+far as a naive renderer is concerned however, you can still see the entire town.
+It won't just render the buildings in front of you, it will render the street
+behind that, with the people on that street, the buildings behind that. You
+quickly end up in situations where you are attempting to render 10× or 100× more
+than what is visible.
+
+Things aren't quite as bad as they seem, because the Z-buffer usually allows the
+GPU to only fully shade the objects that are at the front. This is called *depth
+prepass* and is enabled by default in Godot when using the GLES3 renderer.
+However, unneeded objects are still reducing performance.
+
+One way we can potentially reduce the amount to be rendered is to take advantage
+of occlusion. As of Godot 3.2.2, there is no built in support for occlusion in
+Godot. However, with careful design you can still get many of the advantages.
+
+For instance, in our city street scenario, you may be able to work out in advance
+that you can only see two other streets, ``B`` and ``C``, from street ``A``.
+Streets ``D`` to ``Z`` are hidden. In order to take advantage of occlusion, all
+you have to do is work out when your viewer is in street ``A`` (perhaps using
+Godot Areas), then you can hide the other streets.
+
+This is a manual version of what is known as a "potentially visible set". It is
+a very powerful technique for speeding up rendering. You can also use it to
+restrict physics or AI to the local area, and speed these up as well as
+rendering.
+
+.. note::
+
+    In some cases, you may have to adapt your level design to add more occlusion
+    opportunities. For example, you may have to add more walls to prevent the player
+    from seeing too far away, which would decrease performance due to the lost
+    opportunies for occlusion culling.
+
+Other occlusion techniques
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There are other occlusion techniques such as portals, automatic PVS, and
+raster-based occlusion culling. Some of these may be available through add-ons
+and may be available in core Godot in the future.
+
+Transparent objects
+~~~~~~~~~~~~~~~~~~~
+
+Godot sorts objects by :ref:`Material <class_Material>` and :ref:`Shader
+<class_Shader>` to improve performance. This, however, can not be done with
+transparent objects. Transparent objects are rendered from back to front to make
+blending with what is behind work. As a result,
+**try to use as few transparent objects as possible**. If an object has a
+small section with transparency, try to make that section a separate surface
+with its own material.
+
+For more information, see the :ref:`GPU optimizations <doc_gpu_optimization>`
+doc.
+
+Level of detail (LOD)
+=====================
+
+In some situations, particularly at a distance, it can be a good idea to
+**replace complex geometry with simpler versions**. The end user will probably
+not be able to see much difference. Consider looking at a large number of trees
+in the far distance. There are several strategies for replacing models at
+varying distance. You could use lower poly models, or use transparency to
+simulate more complex geometry.
+
+Billboards and imposters
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+The simplest version of using transparency to deal with LOD is billboards. For
+example, you can use a single transparent quad to represent a tree at distance.
+This can be very cheap to render, unless of course, there are many trees in
+front of each other. In which case transparency may start eating into fill rate
+(for more information on fill rate, see :ref:`doc_gpu_optimization`).
+
+An alternative is to render not just one tree, but a number of trees together as
+a group. This can be especially effective if you can see an area but cannot
+physically approach it in a game.
+
+You can make imposters by pre-rendering views of an object at different angles.
+Or you can even go one step further, and periodically re-render a view of an
+object onto a texture to be used as an imposter. At a distance, you need to move
+the viewer a considerable distance for the angle of view to change
+significantly. This can be complex to get working, but may be worth it depending
+on the type of project you are making.
+
+Use instancing (MultiMesh)
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If several identical objects have to be drawn in the same place or nearby, try
+using :ref:`MultiMesh <class_MultiMesh>` instead. MultiMesh allows the drawing
+of many thousands of objects at very little performance cost, making it ideal
+for flocks, grass, particles, and anything else where you have thousands of
+identical objects.
+
+Also see the :ref:`Using MultiMesh <doc_using_multimesh>` doc.
+
+Bake lighting
+=============
+
+Lighting objects is one of the most costly rendering operations. Realtime
+lighting, shadows (especially multiple lights), and GI are especially expensive.
+They may simply be too much for lower power mobile devices to handle.
+
+**Consider using baked lighting**, especially for mobile. This can look fantastic,
+but has the downside that it will not be dynamic. Sometimes, this is a trade-off
+worth making.
+
+In general, if several lights need to affect a scene, it's best to use
+:ref:`doc_baked_lightmaps`. Baking can also improve the scene quality by adding
+indirect light bounces.
+
+Animation and skinning
+======================
+
+Animation and vertex animation such as skinning and morphing can be very
+expensive on some platforms. You may need to lower the polycount considerably
+for animated models or limit the number of them on screen at any one time.
+
+Large worlds
+============
+
+If you are making large worlds, there are different considerations than what you
+may be familiar with from smaller games.
+
+Large worlds may need to be built in tiles that can be loaded on demand as you
+move around the world. This can prevent memory use from getting out of hand, and
+also limit the processing needed to the local area.
+
+There may also be rendering and physics glitches due to floating point error in
+large worlds. You may be able to use techniques such as orienting the world
+around the player (rather than the other way around), or shifting the origin
+periodically to keep things centred around ``Vector3(0, 0, 0)``.