Browse Source

Merge pull request #3777 from clayjohn/3.2-optimizations

Overhaul optimization tutorials
Rémi Verschelde 5 years ago
parent
commit
693d2586cd

+ 0 - 1
tutorials/3d/index.rst

@@ -7,7 +7,6 @@
 
    introduction_to_3d
    using_transforms
-   optimizing_3d_performance
    3d_rendering_limitations
    spatial_material
    lights_and_shadows

+ 0 - 192
tutorials/3d/optimizing_3d_performance.rst

@@ -1,192 +0,0 @@
-.. meta::
-    :keywords: optimization
-
-.. _doc_optimizing_3d_performance:
-
-Optimizing 3D performance
-=========================
-
-Introduction
-~~~~~~~~~~~~
-
-Godot follows a balanced performance philosophy. In the performance world,
-there are always trade-offs, which consist of trading speed for
-usability and flexibility. Some practical examples of this are:
-
--  Rendering objects efficiently in high amounts is easy, but when a
-   large scene must be rendered, it can become inefficient. To solve
-   this, visibility computation must be added to the rendering, which
-   makes rendering less efficient, but, at the same time, fewer objects are
-   rendered, so efficiency overall improves.
--  Configuring the properties of every material for every object that
-   needs to be rendered is also slow. To solve this, objects are sorted
-   by material to reduce the costs, but at the same time sorting has a
-   cost.
--  In 3D physics a similar situation happens. The best algorithms to
-   handle large amounts of physics objects (such as SAP) are slow
-   at insertion/removal of objects and ray-casting. Algorithms that
-   allow faster insertion and removal, as well as ray-casting, will not
-   be able to handle as many active objects.
-
-And there are many more examples of this! Game engines strive to be
-general purpose in nature, so balanced algorithms are always favored
-over algorithms that might be fast in some situations and slow in
-others.. or algorithms that are fast but make usability more difficult.
-
-Godot is not an exception and, while it is designed to have backends
-swappable for different algorithms, the default ones (or more like, the
-only ones that are there for now) prioritize balance and flexibility
-over performance.
-
-With this clear, the aim of this tutorial is to explain how to get the
-maximum performance out of Godot.
-
-Rendering
-~~~~~~~~~
-
-3D rendering is one of the most difficult areas to get performance from,
-so this section will have a list of tips.
-
-Reuse shaders and materials
----------------------------
-
-The Godot renderer is a little different to what is out there. It's designed
-to minimize GPU state changes as much as possible.
-:ref:`class_SpatialMaterial`
-does a good job at reusing materials that need similar shaders but, if
-custom shaders are used, make sure to reuse them as much as possible.
-Godot's priorities will be like this:
-
--  **Reusing Materials**: The fewer different materials in the
-   scene, the faster the rendering will be. If a scene has a huge amount
-   of objects (in the hundreds or thousands) try reusing the materials
-   or in the worst case use atlases.
--  **Reusing Shaders**: If materials can't be reused, at least try to
-   re-use shaders (or SpatialMaterials with different parameters but the same
-   configuration).
-
-If a scene has, for example, 20.000 objects with 20.000 different
-materials each, rendering will be slow. If the same scene has
-20.000 objects, but only uses 100 materials, rendering will be blazingly
-fast.
-
-Pixel cost vs vertex cost
--------------------------
-
-It is a common thought that the lower the number of polygons in a model, the
-faster it will be rendered. This is *really* relative and depends on
-many factors.
-
-On a modern PC and console, vertex cost is low. GPUs
-originally only rendered triangles, so all the vertices:
-
-1. Had to be transformed by the CPU (including clipping).
-
-2. Had to be sent to the GPU memory from the main RAM.
-
-Nowadays, all this is handled inside the GPU, so the performance is
-extremely high. 3D artists usually have the wrong feeling about
-polycount performance because 3D DCCs (such as Blender, Max, etc.) need
-to keep geometry in CPU memory in order for it to be edited, reducing
-actual performance. Truth is, a model rendered by a 3D engine is much
-more optimal than how 3D DCCs display them.
-
-On mobile devices, the story is different. PC and Console GPUs are
-brute-force monsters that can pull as much electricity as they need from
-the power grid. Mobile GPUs are limited to a tiny battery, so they need
-to be a lot more power efficient.
-
-To be more efficient, mobile GPUs attempt to avoid *overdraw*. This
-means, the same pixel on the screen being rendered (as in, with lighting
-calculation, etc.) more than once. Imagine a town with several buildings,
-GPUs don't know what is visible and what is hidden until they
-draw it. A house might be drawn and then another house in front of it
-(rendering happened twice for the same pixel!). PC GPUs normally don't
-care much about this and just throw more pixel processors to the
-hardware to increase performance (but this also increases power
-consumption).
-
-On mobile, pulling more power is not an option, so a technique called
-"Tile Based Rendering" is used (almost every mobile hardware uses a
-variant of it), which divides the screen into a grid. Each cell keeps the
-list of triangles drawn to it and sorts them by depth to minimize
-*overdraw*. This technique improves performance and reduces power
-consumption, but takes a toll on vertex performance. As a result, fewer
-vertices and triangles can be processed for drawing.
-
-Generally, this is not so bad, but there is a corner case on mobile that
-must be avoided, which is to have small objects with a lot of geometry
-within a small portion of the screen. This forces mobile GPUs to put a
-lot of strain on a single screen cell, considerably decreasing
-performance (as all the other cells must wait for it to complete in
-order to display the frame).
-
-To make it short, do not worry about vertex count so much on mobile, but
-avoid concentration of vertices in small parts of the screen. If, for
-example, a character, NPC, vehicle, etc. is far away (so it looks tiny),
-use a smaller level of detail (LOD) model instead.
-
-An extra situation where vertex cost must be considered is objects that
-have extra processing per vertex, such as:
-
--  Skinning (skeletal animation)
--  Morphs (shape keys)
--  Vertex Lit Objects (common on mobile)
-
-Texture compression
--------------------
-
-Godot offers to compress textures of 3D models when imported (VRAM
-compression). Video RAM compression is not as efficient in size as PNG
-or JPG when stored, but increases performance enormously when drawing.
-
-This is because the main goal of texture compression is bandwidth
-reduction between memory and the GPU.
-
-In 3D, the shapes of objects depend more on the geometry than the
-texture, so compression is generally not noticeable. In 2D, compression
-depends more on shapes inside the textures, so the artifacts resulting
-from 2D compression are more noticeable.
-
-As a warning, most Android devices do not support texture compression of
-textures with transparency (only opaque), so keep this in mind.
-
-Transparent objects
--------------------
-
-As mentioned before, Godot sorts objects by material and shader to
-improve performance. This, however, can not be done on transparent
-objects. Transparent objects are rendered from back to front to make
-blending with what is behind work. As a result, please try to keep
-transparent objects to a minimum! If an object has a small section with
-transparency, try to make that section a separate material.
-
-Level of detail (LOD)
----------------------
-
-As also mentioned before, using objects with fewer vertices can improve
-performance in some cases. Godot has a simple system to change level
-of detail,
-:ref:`GeometryInstance <class_GeometryInstance>`
-based objects have a visibility range that can be defined. Having
-several GeometryInstance objects in different ranges works as LOD.
-
-Use instancing (MultiMesh)
---------------------------
-
-If several identical objects have to be drawn in the same place or
-nearby, try using :ref:`MultiMesh <class_MultiMesh>`
-instead. MultiMesh allows the drawing of dozens of thousands of objects at
-very little performance cost, making it ideal for flocks, grass,
-particles, etc.
-
-Bake lighting
--------------
-
-Small lights are usually not a performance issue. Shadows a little more.
-In general, if several lights need to affect a scene, it's ideal to bake
-it (:ref:`doc_baked_lightmaps`). Baking can also improve the scene quality by
-adding indirect light bounces.
-
-If working on mobile, baking to texture is recommended, since this
-method is even faster.

+ 549 - 0
tutorials/optimization/batching.rst

@@ -0,0 +1,549 @@
+.. _doc_batching:
+
+Optimization using batching
+===========================
+
+Introduction
+~~~~~~~~~~~~
+
+Game engines have to send a set of instructions to the GPU in order to tell the
+GPU what and where to draw. These instructions are sent using common
+instructions, called APIs (Application Programming Interfaces), examples of
+which are OpenGL, OpenGL ES, and Vulkan.
+
+Different APIs incur different costs when drawing objects. OpenGL handles a lot
+of work for the user in the GPU driver at the cost of more expensive draw calls.
+As a result, applications can often be sped up by reducing the number of draw
+calls.
+
+Draw calls
+^^^^^^^^^^
+
+In 2D, we need to tell the GPU to render a series of primitives (rectangles,
+lines, polygons etc). The most obvious technique is to tell the GPU to render
+one primitive at a time, telling it some information such as the texture used,
+the material, the position, size, etc. then saying "Draw!" (this is called a
+draw call).
+
+It turns out that while this is conceptually simple from the engine side, GPUs
+operate very slowly when used in this manner. GPUs work much more efficiently
+if, instead of telling them to draw a single primitive, you tell them to draw a
+number of similar primitives all in one draw call, which we will call a "batch".
+
+And it turns out that they don't just work a bit faster when used in this
+manner, they work a *lot* faster.
+
+As Godot is designed to be a general purpose engine, the primitives coming into
+the Godot renderer can be in any order, sometimes similar, and sometimes
+dissimilar. In order to match the general purpose nature of Godot with the
+batching preferences of GPUs, Godot features an intermediate layer which can
+automatically group together primitives wherever possible, and send these
+batches on to the GPU. This can give an increase in rendering performance while
+requiring few, if any, changes to your Godot project.
+
+How it works
+~~~~~~~~~~~~
+
+Instructions come into the renderer from your game in the form of a series of
+items, each of which can contain one or more commands. The items correspond to
+Nodes in the scene tree, and the commands correspond to primitives such as
+rectangles or polygons. Some items, such as tilemaps, and text, can contain a
+large number of commands (tiles and letters respectively). Others, such as
+sprites, may only contain a single command (rectangle).
+
+The batcher uses two main techniques to group together primitives:
+
+* Consecutive items can be joined together
+* Consecutive commands within an item can be joined to form a batch
+
+Breaking batching
+^^^^^^^^^^^^^^^^^
+
+Batching can only take place if the items or commands are similar enough to be
+rendered in one draw call. Certain changes (or techniques), by necessity, prevent
+the formation of a contiguous batch, this is referred to as 'breaking batching'.
+
+Batching will be broken by (amongst other things):
+* Change of texture
+* Change of material
+* Change of primitive type (say going from rectangles to lines)
+
+.. note:: 
+	
+	If for example, you draw a series of sprites each with a different texture,
+	there is no way they can be batched.
+
+Render order
+^^^^^^^^^^^^
+
+The question arises, if only similar items can be drawn together in a batch, why
+don't we look through all the items in a scene, group together all the similar
+items, and draw them together?
+
+In 3D, this is often exactly how engines work. However, in Godot 2D, items are
+drawn in 'painter's order', from back to front. This ensures that items at the
+front are drawn on top of earlier items, when they overlap.
+
+This also means that if we try and draw objects in order of, for example,
+texture, then this painter's order may break and objects will be drawn in the
+wrong order.
+
+In Godot this back to front order is determined by:
+* The order of objects in the scene tree
+* The Z index of objects
+* The canvas layer
+* Y sort nodes
+
+.. note::
+	
+	You can group similar objects together for easier batching. While doing so
+	is not a requirement on your part, think of it as an optional approach that
+	can improve performance in some cases. See the diagnostics section in order
+	to help you make this decision.
+
+A trick
+^^^^^^^
+
+And now a sleight of hand. Although the idea of painter's order is that objects
+are rendered from back to front, consider 3 objects A, B and C, that contain 2
+different textures, grass and wood.
+
+.. image:: img/overlap1.png
+
+In painter's order they are ordered:
+
+::
+
+	A - wood
+	B - grass
+	C - wood
+
+Because the texture changes, they cannot be batched, and will be rendered in 3
+draw calls.
+
+However, painter's order is only needed on the assumption that they will be
+drawn *on top* of each other. If we relax that assumption, i.e. if none of these
+3 objects are overlapping, there is *no need* to preserve painter's order. The
+rendered result will be the same. What if we could take advantage of this?
+
+Item reordering
+^^^^^^^^^^^^^^^
+
+.. image:: img/overlap2.png
+
+It turns out that we can reorder items. However, we can only do this if the
+items satisfy the conditions of an overlap test, to ensure that the end result
+will be the same as if they were not reordered. The overlap test is very cheap
+in performance terms, but not absolutely free, so there is a slight cost to
+looking ahead to decide whether items can be reordered. The number of items to
+lookahead for reordering can be set in project settings (see below), in order to
+balance the costs and benefits in your project.
+
+::
+
+	A - wood
+	C - wood
+	B - grass
+	
+Because the texture only changes once, we can render the above in only 2
+draw calls.
+
+Lights
+~~~~~~
+
+Although the job for the batching system is normally quite straightforward, it
+becomes considerably more complex when 2D lights are used, because lights are
+drawn using extra passes, one for each light affecting the primitive. Consider 2
+sprites A and B, with identical texture and material. Without lights they would
+be batched together and drawn in one draw call. But with 3 lights, they would be
+drawn as follows, each line a draw call:
+
+.. image:: img/lights_overlap.png
+
+::
+
+	A
+	A - light 1
+	A - light 2
+	A - light 3
+	B
+	B - light 1
+	B - light 2
+	B - light 3
+
+That is a lot of draw calls, 8 for only 2 sprites. Now consider we are drawing
+1000 sprites, the number of draw calls quickly becomes astronomical, and
+performance suffers. This is partly why lights have the potential to drastically
+slow down 2D.
+
+However, if you remember our magician's trick from item reordering, it turns out
+we can use the same trick to get around painter's order for lights!
+
+If A and B are not overlapping, we can render them together in a batch, so the
+draw process is as follows:
+
+.. image:: img/lights_separate.png
+
+::
+
+	AB
+	AB - light 1
+	AB - light 2
+	AB - light 3
+
+
+That is 4 draw calls. Not bad, that is a 50% improvement. However consider that
+in a real game, you might be drawing closer to 1000 sprites.
+
+- Before: 1000 * 4 = 4000 draw calls.
+- After: 1 * 4 = 4 draw calls.
+
+That is 1000x decrease in draw calls, and should give a huge increase in
+performance.
+
+Overlap test
+^^^^^^^^^^^^
+
+However, as with the item reordering, things are not that simple, we must first
+perform the overlap test to determine whether we can join these primitives, and
+the overlap test has a small cost. So again you can choose the number of
+primitives to lookahead in the overlap test to balance the benefits against the
+cost. Usually with lights the benefits far outweigh the costs.
+
+Also consider that depending on the arrangement of primitives in the viewport,
+the overlap test will sometimes fail (because the primitives overlap and thus
+should not be joined). So in practice the decrease in draw calls may be less
+dramatic than the perfect situation of no overlap. However performance is
+usually far higher than without this lighting optimization.
+
+Light Scissoring
+~~~~~~~~~~~~~~~~
+
+Batching can make it more difficult to cull out objects that are not affected or
+partially affected by a light. This can increase the fill rate requirements
+quite a bit, and slow rendering. Fill rate is the rate at which pixels are
+colored, it is another potential bottleneck unrelated to draw calls.
+
+In order to counter this problem, (and also speedup lighting in general),
+batching introduces light scissoring. This enables the use of the OpenGL command
+``glScissor()``, which identifies an area, outside of which, the GPU will not
+render any pixels. We can thus greatly optimize fill rate by identifying the
+intersection area between a light and a primitive, and limit rendering the light
+to *that area only*.
+
+Light scissoring is controlled with the :ref:`scissor_area_threshold
+<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
+project setting. This value is between 1.0 and 0.0, with 1.0 being off (no
+scissoring), and 0.0 being scissoring in every circumstance. The reason for the
+setting is that there may be some small cost to scissoring on some hardware.
+Generally though, when you are using lighting, it should result in some
+performance gains.
+
+The relationship between the threshold and whether a scissor operation takes
+place is not altogether straight forward, but generally it represents the pixel
+area that is potentially 'saved' by a scissor operation (i.e. the fill rate
+saved). At 1.0, the entire screens pixels would need to be saved, which rarely
+if ever happens, so it is switched off. In practice the useful values are
+bunched towards zero, as only a small percentage of pixels need to be saved for
+the operation to be useful.
+
+The exact relationship is probably not necessary for users to worry about, but
+out of interest is included in the appendix.
+
+.. image:: img/scissoring.png
+
+*Bottom right is a light, the red area is the pixels saved by the scissoring
+operation. Only the intersection needs to be rendered.*
+
+Vertex baking
+~~~~~~~~~~~~~
+
+The GPU shader receives instructions on what to draw in 2 main ways:
+
+* Shader uniforms (e.g. modulate color, item transform)
+* Vertex attributes (vertex color, local transform)
+
+However, within a single draw call (batch) we cannot change uniforms. This means
+that naively, we would not be able to batch together items or commands that
+change final_modulate, or item transform. Unfortunately that is an awful lot of
+cases. Sprites for instance typically are individual nodes with their own item
+transform, and they may have their own color modulate.
+
+To get around this problem, the batching can "bake" some of the uniforms into
+the vertex attributes.
+
+* The item transform can be combined with the local transform and sent in a
+  vertex attribute.
+
+* The final modulate color can be combined with the vertex colors, and sent in a
+  vertex attribute.
+
+In most cases this works fine, but this shortcut breaks down if a shader expects
+these values to be available individually, rather than combined. This can happen
+in custom shaders.
+
+Custom Shaders
+^^^^^^^^^^^^^^
+
+As a result certain operations in custom shaders will prevent baking, and thus
+decrease the potential for batching. While we are working to decrease these
+cases, currently the following conditions apply:
+
+* Reading or writing ``COLOR`` or ``MODULATE`` - disables vertex color baking
+* Reading ``VERTEX`` - disables vertex position baking
+
+Project Settings
+~~~~~~~~~~~~~~~~
+
+In order to fine tune batching, a number of project settings are available. You
+can usually leave these at default during development, but it is a good idea to
+experiment to ensure you are getting maximum performance. Spending a little time
+tweaking parameters can often give considerable performance gain, for very
+little effort. See the tooltips in the project settings for more info.
+
+rendering/batching/options
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+* :ref:`use_batching
+  <class_ProjectSettings_property_rendering/batching/options/use_batching>` -
+  Turns batching on and off
+
+* :ref:`use_batching_in_editor
+  <class_ProjectSettings_property_rendering/batching/options/use_batching_in_editor>`
+
+* :ref:`single_rect_fallback
+  <class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>`
+  - This is a faster way of drawing unbatchable rectangles, however it may lead
+  to flicker on some hardware so is not recommended
+
+rendering/batching/parameters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+* :ref:`max_join_item_commands <class_ProjectSettings_property_rendering/batching/parameters/max_join_item_commands>` -
+  One of the most important ways of achieving
+  batching is to join suitable adjacent items (nodes) together, however they can
+  only be joined if the commands they contain are compatible. The system must
+  therefore do a lookahead through the commands in an item to determine whether
+  it can be joined. This has a small cost per command, and items with a large
+  number of commands are not worth joining, so the best value may be project
+  dependent.
+
+* :ref:`colored_vertex_format_threshold
+  <class_ProjectSettings_property_rendering/batching/parameters/colored_vertex_format_threshold>` - Baking colors into
+  vertices results in a
+  larger vertex format. This is not necessarily worth doing unless there are a
+  lot of color changes going on within a joined item. This parameter represents
+  the proportion of commands containing color changes / the total commands,
+  above which it switches to baked colors.
+
+* :ref:`batch_buffer_size
+  <class_ProjectSettings_property_rendering/batching/parameters/batch_buffer_size>`
+  - This determines the maximum size of a batch, it doesn't have a huge effect
+  on performance but can be worth decreasing for mobile if RAM is at a premium.
+
+* :ref:`item_reordering_lookahead
+  <class_ProjectSettings_property_rendering/batching/parameters/item_reordering_lookahead>`
+  - Item reordering can help especially with
+  interleaved sprites using different textures. The lookahead for the overlap
+  test has a small cost, so the best value may change per project.
+
+rendering/batching/lights
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+* :ref:`scissor_area_threshold
+  <class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
+  - See light scissoring.
+
+* :ref:`max_join_items
+  <class_ProjectSettings_property_rendering/batching/lights/max_join_items>`  -
+  Joining items before lighting can significantly increase
+  performance. This requires an overlap test, which has a small cost, so the
+  costs and benefits may be project dependent, and hence the best value to use
+  here.
+
+rendering/batching/debug
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+* :ref:`flash_batching
+  <class_ProjectSettings_property_rendering/batching/debug/flash_batching>`  -
+  This is purely a debugging feature to identify regressions between the
+  batching and legacy renderer. When it is switched on, the batching and legacy
+  renderer are used alternately on each frame. This will decrease performance,
+  and should not be used for your final export, only for testing.
+
+* :ref:`diagnose_frame
+  <class_ProjectSettings_property_rendering/batching/debug/diagnose_frame>`  -
+  This will periodically print a diagnostic batching log to
+  the Godot IDE / console.
+
+rendering/batching/precision
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+* :ref:`uv_contract
+  <class_ProjectSettings_property_rendering/batching/precision/uv_contract>` -
+  On some hardware (notably some Android devices) there have been reports of
+  tilemap tiles drawing slightly outside their UV range, leading to edge
+  artifacts such as lines around tiles. If you see this problem, try enabling uv
+  contract. This makes a small contraction in the UV coordinates to compensate
+  for precision errors on devices.
+
+* :ref:`uv_contract_amount
+  <class_ProjectSettings_property_rendering/batching/precision/uv_contract_amount>`
+  - Hopefully the default amount should cure artifacts on most devices, but just
+  in case, this value is editable.
+
+Diagnostics
+~~~~~~~~~~~
+
+Although you can change parameters and examine the effect on frame rate, this
+can feel like working blindly, with no idea of what is going on under the hood.
+To help with this, batching offers a diagnostic mode, which will periodically
+print out (to the IDE or console) a list of the batches that are being
+processed. This can help pin point situations where batching is not occurring as
+intended, and help you to fix them, in order to get the best possible
+performance.
+
+Reading a diagnostic
+^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: cpp
+
+	canvas_begin FRAME 2604
+	items
+		joined_item 1 refs
+				batch D 0-0 
+				batch D 0-2 n n
+				batch R 0-1 [0 - 0] {255 255 255 255 }
+		joined_item 1 refs
+				batch D 0-0 
+				batch R 0-1 [0 - 146] {255 255 255 255 }
+				batch D 0-0 
+				batch R 0-1 [0 - 146] {255 255 255 255 }
+		joined_item 1 refs
+				batch D 0-0 
+				batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
+				batch D 0-0 
+				batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
+				batch D 0-0 
+				batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
+	canvas_end
+
+
+This is a typical diagnostic.
+
+* **joined_item** - A joined item can contain 1 or
+  more references to items (nodes). Generally joined_items containing many
+  references is preferable to many joined_items containing a single reference.
+  Whether items can be joined will be determined by their contents and
+  compatibility with the previous item.
+* **batch R** - a batch containing rectangles. The second number is the number of
+  rects. The second number in square brackets is the Godot texture ID, and the
+  numbers in curly braces is the color. If the batch contains more than one rect,
+  MULTI is added to the line to make it easy to identify. Seeing MULTI is good,
+  because this indicates successful batching.
+* **batch D** - a default batch, containing everything else that is not currently
+  batched.
+
+Default Batches
+^^^^^^^^^^^^^^^
+
+The second number following default batches is the number of commands in the
+batch, and it is followed by a brief summary of the contents:
+
+::
+
+	l - line
+	PL - polyline
+	r - rect
+	n - ninepatch
+	PR - primitive
+	p - polygon
+	m - mesh
+	MM - multimesh
+	PA - particles
+	c - circle
+	t - transform
+	CI - clip_ignore
+
+You may see "dummy" default batches containing no commands, you can ignore
+these.
+
+FAQ
+~~~
+
+I don't get a large performance increase from switching on batching
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+* Try the diagnostics, see how much batching is occurring, and whether it can be
+  improved
+* Try changing parameters
+* Consider that batching may not be your bottleneck (see bottlenecks)
+
+I get a decrease in performance with batching
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+* Try steps to increase batching given above
+* Try switching :ref:`single_rect_fallback
+  <class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>`
+  to on
+* The single rect fallback method is the default used without batching, and it
+  is approximately twice as fast, however it can result in flicker on some
+  hardware, so its use is discouraged
+* After trying the above, if your scene is still performing worse, consider
+  turning off batching.
+
+I use custom shaders and the items are not batching
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+* Custom shaders can be problematic for batching, see the custom shaders section
+
+I am seeing line artifacts appear on certain hardware
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+* See the :ref:`uv_contract
+  <class_ProjectSettings_property_rendering/batching/precision/uv_contract>`
+  project setting which can be used to solve this problem.
+
+I use a large number of textures, so few items are being batched
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+* Consider the use of texture atlases. As well as allowing batching, these
+  reduce the need for state changes associated with changing texture.
+
+Appendix
+~~~~~~~~
+
+Light scissoring threshold calculation
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The actual proportion of screen pixel area used as the threshold is the
+:ref:`scissor_area_threshold
+<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
+value to the power of 4.
+
+For example, on a screen size ``1920 x 1080`` there are ``2,073,600`` pixels.
+
+At a threshold of ``1000`` pixels, the proportion would be:
+
+::
+
+	1000 / 2073600 = 0.00048225
+	0.00048225 ^ 0.25 = 0.14819
+
+.. note:: The power of 0.25 is the opposite of power of 4).
+
+So a :ref:`scissor_area_threshold
+<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
+of 0.15 would be a reasonable value to try.
+
+Going the other way, for instance with a :ref:`scissor_area_threshold
+<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
+of ``0.5``:
+
+::
+
+	0.5 ^ 4 = 0.0625
+	0.0625 * 2073600 = 129600 pixels
+
+If the number of pixels saved is more than this threshold, the scissor is
+activated.

+ 258 - 0
tutorials/optimization/cpu_optimization.rst

@@ -0,0 +1,258 @@
+.. _doc_cpu_optimization:
+
+CPU Optimizations
+=================
+
+Measuring performance
+=====================
+
+To know how to speed up our program, we have to know where the "bottlenecks"
+are. Bottlenecks are  the slowest parts of the program that limit the rate that
+everything can progress. This allows us to concentrate our efforts on optimizing
+the areas which will give us the greatest speed improvement, instead of spending
+a lot of time optimizing functions that will lead to small performance
+improvements.
+
+For the CPU, the easiest way to identify bottlenecks is to use a profiler.
+
+CPU profilers
+=============
+
+Profilers run alongside your program and take timing measurements to work out
+what proportion of time is spent in each function.
+
+The Godot IDE conveniently has a built in profiler. It does not run every time
+you start your project, and must be manually started and stopped. This is
+because, in common with most profilers, recording these timing measurements can
+slow down your project significantly.
+
+After profiling, you can look back at the results for a frame.
+
+.. image:: img/godot_profiler.png
+
+`These are the results of a profile of one of the demo projects.`
+
+.. note:: We can see the cost of built-in processes such as physics and audio,
+          as well as seeing the cost of our own scripting functions at the
+          bottom.
+
+When a project is running slowly, you will often see an obvious function or
+process taking a lot more time than others. This is your primary bottleneck, and
+you can usually increase speed by optimizing this area.
+
+For more info about using the profiler within Godot see
+:ref:`doc_debugger_panel`.
+
+External profilers
+~~~~~~~~~~~~~~~~~~
+
+Although the Godot IDE profiler is very convenient and useful, sometimes you
+need more power, and the ability to profile the Godot engine source code itself.
+
+You can use a number of third party profilers to do this including Valgrind,
+VerySleepy, Visual Studio and Intel VTune. 
+
+.. note:: You may need to compile Godot from source in order to use a third
+          party profiler so that you have program database information
+          available. You can also use a debug build, however, note that the
+          results of profiling a debug build will be different to a release
+          build, because debug builds are less optimized. Bottlenecks are often
+          in a different place in debug builds, so you should profile release
+          builds wherever possible.
+
+.. image:: img/valgrind.png
+
+`These are example results from Callgrind, part of Valgrind, on Linux.`
+
+From the left, Callgrind is listing the percentage of time within a function and
+its children (Inclusive), the percentage of time spent within the function
+itself, excluding child functions (Self), the number of times the function is
+called, the function name, and the file or module.
+
+In this example we can see nearly all time is spent under the
+`Main::iteration()` function, this is the master function in the Godot source
+code that is called repeatedly, and causes frames to be drawn, physics ticks to
+be simulated, and nodes and scripts to be updated. A large proportion of the
+time is spent in the functions to render a canvas (66%), because this example
+uses a 2d benchmark. Below this we see that almost 50% of the time is spent
+outside Godot code in `libglapi`, and `i965_dri` (the graphics driver). This
+tells us the a large proportion of CPU time is being spent in the graphics
+driver.
+
+This is actually an excellent example because in an ideal world, only a very
+small proportion of time would be spent in the graphics driver, and this is an
+indication that there is a problem with too much communication and work being
+done in the graphics API. This profiling lead to the development of 2d batching,
+which greatly speeds up 2d by reducing bottlenecks in this area.
+
+Manually timing functions
+=========================
+
+Another handy technique, especially once you have identified the bottleneck
+using a profiler, is to manually time the function or area under test. The
+specifics vary according to language, but in GDScript, you would do the
+following:
+
+::
+
+    var time_start = OS.get_system_time_msecs()
+    
+    # Your function you want to time
+    update_enemies()
+
+    var time_end = OS.get_system_time_msecs()
+    print("Function took: " + str(time_end - time_start)) 
+
+
+You may want to consider using other functions for time if another time unit is
+more suitable, for example :ref:`OS.get_system_time_secs
+<class_OS_method_get_system_time_secs>` if the function will take many seconds.
+
+When manually timing functions, it is usually a good idea to run the function
+many times (say ``1000`` or more times), instead of just once (unless it is a
+very slow function). A large part of the reason for this is that timers often
+have limited accuracy, and CPUs will schedule processes in a haphazard manner,
+so an average over a series of runs is more accurate than a single measurement.
+
+As you attempt to optimize functions, be sure to either repeatedly profile or
+time them as you go. This will give you crucial feedback as to whether the
+optimization is working (or not).
+
+Caches
+======
+
+Something else to be particularly aware of, especially when comparing timing
+results of two different versions of a function, is that the results can be
+highly dependent on whether the data is in the CPU cache or not. CPUs don't load
+data directly from main memory, because although main memory can be huge (many
+GBs), it is very slow to access. Instead CPUs load data from a smaller, higher
+speed bank of memory, called cache. Loading data from cache is super fast, but
+every time you try and load a memory address that is not stored in cache, the
+cache must make a trip to main memory and slowly load in some data. This delay
+can result in the CPU sitting around idle for a long time, and is referred to as
+a "cache miss".
+
+This means that the first time you run a function, it may run slowly, because
+the data is not in cache. The second and later times, it may run much faster
+because the data is in cache. So always use averages when timing, and be aware
+of the effects of cache.
+
+Understanding caching is also crucial to CPU optimization. If you have an
+algorithm (routine) that loads small bits of data from randomly spread out areas
+of main memory, this can result in a lot of cache misses, a lot of the time, the
+CPU will be waiting around for data instead of doing any work. Instead, if you
+can make your data accesses localised, or even better, access memory in a linear
+fashion (like a continuous list), then the cache will work optimally and the CPU
+will be able to work as fast as possible.
+
+Godot usually takes care of such low-level details for you. For example, the
+Server APIs make sure data is optimized for caching already for things like
+rendering and physics. But you should be especially aware of caching when using
+GDNative.
+
+Languages
+=========
+
+Godot supports a number of different languages, and it is worth bearing in mind
+that there are trade-offs involved - some languages are designed for ease of
+use, at the cost of speed, and others are faster but more difficult to work
+with.
+
+Built-in engine functions run at the same speed regardless of the scripting
+language you choose. If your project is making a lot of calculations in its own
+code, consider moving those calculations to a faster language.
+
+GDScript
+~~~~~~~~
+
+GDScript is designed to be easy to use and iterate, and is ideal for making many
+types of games. However, ease of use is considered more important than
+performance, so if you need to make heavy calculations, consider moving some of
+your project to one of the other languages.
+
+C#
+~~
+
+C# is popular and has first class support in Godot. It offers a good compromise
+between speed and ease of use.
+
+Other languages
+~~~~~~~~~~~~~~~
+
+Third parties provide support for several other languages, including `Rust
+<https://github.com/godot-rust/godot-rust>`_ and `Javascript
+<https://github.com/GodotExplorer/ECMAScript>`_.
+
+C++
+~~~
+
+Godot is written in C++. Using C++ will usually result in the fastest code,
+however, on a practical level, it is the most difficult to deploy to end users'
+machines on different platforms. Options for using C++ include GDNative, and
+custom modules.
+
+Threads
+=======
+
+Consider using threads when making a lot of calculations that can run parallel
+to one another. Modern CPUs have multiple cores, each one capable of doing a
+limited amount of work. By spreading work over multiple threads you can move
+further towards peak CPU efficiency.
+
+The disadvantage of threads is that you have to be incredibly careful. As each
+CPU core operates independently, they can end up trying to access the same
+memory at the same time. One thread can be reading to a variable while another
+is writing. Before you use threads make sure you understand the dangers and how
+to try and prevent these race conditions.
+
+For more information on threads see :ref:`doc_using_multiple_threads`.
+
+SceneTree
+=========
+
+Although Nodes are an incredibly powerful and versatile concept, be aware that
+every node has a cost. Built in functions such as `_process()` and
+`_physics_process()` propagate through the tree. This housekeeping can reduce
+performance when you have very large numbers of nodes.
+
+Each node is handled individually in the Godot renderer so sometimes a smaller
+number of nodes with more in each can lead to better performance.
+
+One quirk of the :ref:`SceneTree <class_SceneTree>` is that you can sometimes
+get much better performance by removing nodes from the SceneTree, rather than
+by pausing or hiding them. You don't have to delete a detached node. You
+can for example, keep a reference to a node, detach it from the scene tree, then
+reattach it later. This can be very useful for adding and removing areas from a
+game for example.
+
+You can avoid the SceneTree altogether by using Server APIs. For more
+information, see :ref:`doc_using_servers`.
+
+Physics
+=======
+
+In some situations physics can end up becoming a bottleneck, particularly with
+complex worlds, and large numbers of physics objects.
+
+Some techniques to speed up physics:
+
+* Try using simplified versions of your rendered geometry for physics. Often
+  this won't be noticeable for end users, but can greatly increase performance.
+* Try removing objects from physics when they are out of view / outside the
+  current area, or reusing physics objects (maybe you allow 8 monsters per area,
+  for example, and reuse these).
+
+Another crucial aspect to physics is the physics tick rate. In some games you
+can greatly reduce the tick rate, and instead of for example, updating physics
+60 times per second, you may update it at 20, or even 10 ticks per second. This
+can greatly reduce the CPU load.
+
+The downside of changing physics tick rate is you can get jerky movement or
+jitter when the physics update rate does not match the frames rendered.
+
+The solution to this problem is 'fixed timestep interpolation', which involves
+smoothing the rendered positions and rotations over multiple frames to match the
+physics. You can either implement this yourself or use a third-party addon.
+Interpolation is a very cheap operation, performance wise, compared to running a
+physics tick, orders of magnitude faster, so this can be a significant win, as
+well as reducing jitter.

+ 291 - 0
tutorials/optimization/general_optimization.rst

@@ -0,0 +1,291 @@
+.. _doc_general_optimization:
+
+General optimization tips
+=========================
+
+Introduction
+~~~~~~~~~~~~
+
+In an ideal world, computers would run at infinite speed, and the only limit to
+what we could achieve would be our imagination. In the real world, however, it
+is all too easy to produce software that will bring even the fastest computer to
+its knees.
+
+Designing games and other software is thus a compromise between what we would
+like to be possible, and what we can realistically achieve while maintaining
+good performance.
+
+To achieve the best results, we have two approaches:
+* Work faster
+* Work smarter
+
+And preferably, we will use a blend of the two.
+
+Smoke and Mirrors
+^^^^^^^^^^^^^^^^^
+
+Part of working smarter is recognizing that, especially in games, we can often
+get the player to believe they are in a world that is far more complex, 
+interactive, and graphically exciting than it really is. A good programmer is a
+magician, and should strive to learn the tricks of the trade, and try to invent
+new ones.
+
+The nature of slowness
+^^^^^^^^^^^^^^^^^^^^^^
+
+To the outside observer, performance problems are often lumped together. But in
+reality, there are several different kinds of performance problem:
+
+* A slow process that occurs every frame, leading to a continuously low frame
+  rate 
+* An intermittent process that causes 'spikes' of slowness, leading to
+  stalls 
+* A slow process that occurs outside of normal gameplay, for instance, on
+  level load
+
+Each of these are annoying to the user, but in different ways.
+
+Measuring Performance
+=====================
+
+Probably the most important tool for optimization is the ability to measure
+performance - to identify where bottlenecks are, and to measure the success of
+our attempts to speed them up.
+
+There are several methods of measuring performance, including :
+* Putting a start / stop timer around code of interest
+* Using the Godot profiler
+* Using external third party profilers
+* Using GPU profilers / debuggers
+* Checking the frame rate (with vsync disabled)
+
+Be very aware that the relative performance of different areas can vary on
+different hardware. Often it is a good idea to make timings on more than one
+device, especially including mobile as well as desktop, if you are targeting
+mobile.
+
+Limitations
+~~~~~~~~~~~
+
+CPU Profilers are often the 'go to' method for measuring performance, however
+they don't always tell the whole story.
+
+- Bottlenecks are often on the GPU, `as a result` of instructions given by the
+  CPU
+- Spikes can occur in the Operating System processes (outside of Godot) `as a
+  result` of instructions used in Godot (for example dynamic memory allocation)
+- You may not be able to profile e.g. a mobile phone
+- You may have to solve performance problems that occur on hardware you don't
+  have access to
+
+As a result of these limitations, you often need to use detective work to find
+out where bottlenecks are.
+
+Detective work
+~~~~~~~~~~~~~~
+
+Detective work is a crucial skill for developers (both in terms of performance,
+and also in terms of bug fixing). This can include hypothesis testing, and
+binary search.
+
+Hypothesis testing
+^^^^^^^^^^^^^^^^^^
+
+Say for example you believe that sprites are slowing down your game. You can
+test this hypothesis for example by:
+
+* Measuring the performance when you add more sprites, or take some away.
+
+This may lead to a further hypothesis - does the size of the sprite determine
+the performance drop?
+
+* You can test this by keeping everything the same, but changing the sprite
+  size, and measuring performance
+
+Binary search
+^^^^^^^^^^^^^
+
+Say you know that frames are taking much longer than they should, but you are
+not sure where the bottleneck lies. You could begin by commenting out
+approximately half the routines that occur on a normal frame. Has the
+performance improved more or less than expected?
+
+Once you know which of the two halves contains the bottleneck, you can then
+repeat this process, until you have pinned down the problematic area.
+
+Profilers
+=========
+
+Profilers allow you to time your program while running it. Profilers then
+provide results telling you what percentage of time was spent in different
+functions and areas, and how often functions were called.
+
+This can be very useful both to identify bottlenecks and to measure the results
+of your improvements. Sometimes attempts to improve performance can backfire and
+lead to slower performance, so always use profiling and timing to guide your
+efforts.
+
+For more info about using the profiler within Godot see
+:ref:`doc_debugger_panel`.
+
+Principles
+==========
+
+Donald Knuth: 
+
+    *Programmers waste enormous amounts of time thinking about, or worrying
+    about, the speed of noncritical parts of their programs, and these attempts
+    at efficiency actually have a strong negative impact when debugging and
+    maintenance are considered. We should forget about small efficiencies, say
+    about 97% of the time: premature optimization is the root of all evil. Yet
+    we should not pass up our opportunities in that critical 3%.*
+
+The messages are very important:
+
+* Programmer / Developer time is limited. Instead of blindly trying to speed up
+  all aspects of a program we should concentrate our efforts on the aspects that
+  really matter.
+* Efforts at optimization often end up with code that is harder to read and
+  debug than non-optimized code. It is in our interests to limit this to areas
+  that will really benefit.
+
+Just because we `can` optimize a particular bit of code, it doesn't necessarily
+mean that we should. Knowing when, and when not to optimize is a great skill to
+develop.
+
+One misleading aspect of the quote is that people tend to focus on the subquote
+"premature optimization is the root of all evil". While `premature` optimization
+is (by definition) undesirable, performant software is the result of performant
+design.
+
+Performant design
+~~~~~~~~~~~~~~~~~
+
+The danger with encouraging people to ignore optimization until necessary, is
+that it conveniently ignores that the most important time to consider
+performance is at the design stage, before a key has even hit a keyboard. If the
+design / algorithms of a program are inefficient, then no amount of polishing the
+details later will make it run fast. It may run `faster`, but it will never run
+as fast as a program designed for performance.
+
+This tends to be far more important in game / graphics programming than in
+general programming. A performant design, even without low level optimization,
+will often run many times faster than a mediocre design with low level
+optimization.
+
+Incremental design
+~~~~~~~~~~~~~~~~~~
+
+Of course, in practice, unless you have prior knowledge, you are unlikely to
+come up with the best design first time. So you will often make a series of
+versions of a particular area of code, each taking a different approach to the
+problem, until you come to a satisfactory solution. It is important not to spend
+too much time on the details at this stage until you have finalized the overall
+design, otherwise much of your work will be thrown out.
+
+It is difficult to give general guidelines for performant design because this is
+so dependent on the problem. One point worth mentioning though, on the CPU
+side, is that modern CPUs are nearly always limited by memory bandwidth. This
+has led to a resurgence in data orientated design, which involves designing data
+structures and algorithms for locality of data and linear access, rather than
+jumping around in memory.
+
+The optimization process
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Assuming we have a reasonable design, and taking our lessons from Knuth, our
+first step in optimization should be to identify the biggest bottlenecks - the
+slowest functions, the low hanging fruit.
+
+Once we have successfully improved the speed of the slowest area, it may no
+longer be the bottleneck. So we should test / profile again, and find the next
+bottleneck on which to focus.
+
+The process is thus:
+
+1. Profile / Identify bottleneck
+2. Optimize bottleneck
+3. Return to step 1
+
+Optimizing bottlenecks
+~~~~~~~~~~~~~~~~~~~~~~
+
+Some profilers will even tell you which part of a function (which data accesses,
+calculations) are slowing things down.
+
+As with design you should concentrate your efforts first on making sure the
+algorithms and data structures are the best they can be. Data access should be
+local (to make best use of CPU cache), and it can often be better to use compact
+storage of data (again, always profile to test results). Often you precalculate
+heavy computations ahead of time (e.g. at level load, or loading precalculated
+data files).
+
+Once algorithms and data are good, you can often make small changes in routines
+which improve performance, things like moving calculations outside of loops.
+
+Always retest your timing / bottlenecks after making each change. Some changes
+will increase speed, others may have a negative effect. Sometimes a small
+positive effect will be outweighed by the negatives of more complex code, and
+you may choose to leave out that optimization.
+
+Appendix
+========
+
+Bottleneck math
+~~~~~~~~~~~~~~~
+
+The proverb "a chain is only as strong as its weakest link" applies directly to
+performance optimization. If your project is spending 90% of the time in
+function 'A', then optimizing A can have a massive effect on performance.
+
+.. code-block:: none
+
+    A: 9 ms
+    Everything else: 1 ms
+    Total frame time: 10 ms
+
+.. code-block:: none
+
+    A: 1 ms 
+    Everything else: 1ms 
+    Total frame time: 2 ms
+
+So in this example improving this bottleneck A by a factor of 9x, decreases
+overall frame time by 5x, and increases frames per second by 5x.
+
+If however, something else is running slowly and also bottlenecking your
+project, then the same improvement can lead to less dramatic gains:
+
+.. code-block:: none
+
+    A: 9 ms
+    Everything else: 50 ms
+    Total frame time: 59 ms
+
+.. code-block:: none
+
+    A: 1 ms
+    Everything else: 50 ms
+    Total frame time: 51 ms
+
+So in this example, even though we have hugely optimized functionality A, the
+actual gain in terms of frame rate is quite small.
+
+In games, things become even more complicated because the CPU and GPU run
+independently of one another. Your total frame time is determined by the slower
+of the two.
+
+.. code-block:: none
+
+    CPU: 9 ms
+    GPU: 50 ms
+    Total frame time: 50 ms
+
+.. code-block:: none
+
+    CPU: 1 ms
+    GPU: 50 ms
+    Total frame time: 50 ms
+
+In this example, we optimized the CPU hugely again, but the frame time did not
+improve, because we are GPU-bottlenecked.

+ 263 - 0
tutorials/optimization/gpu_optimization.rst

@@ -0,0 +1,263 @@
+.. _doc_gpu_optimization:
+
+GPU Optimizations
+=================
+
+Introduction
+~~~~~~~~~~~~
+
+The demand for new graphics features and progress almost guarantees that you
+will encounter graphics bottlenecks. Some of these can be CPU side, for instance
+in calculations inside the Godot engine to prepare objects for rendering.
+Bottlenecks can also occur on the CPU in the graphics driver, which sorts
+instructions to pass to the GPU, and in the transfer of these instructions. And
+finally bottlenecks also occur on the GPU itself.
+
+Where bottlenecks occur in rendering is highly hardware specific. Mobile GPUs in
+particular may struggle with scenes that run easily on desktop.
+
+Understanding and investigating GPU bottlenecks is slightly different to the
+situation on the CPU, because often you can only change performance indirectly,
+by changing the instructions you give to the GPU, and it may be more difficult
+to take measurements. Often the only way of measuring performance is by
+examining changes in frame rate.
+
+Drawcalls, state changes, and APIs
+==================================
+
+.. note:: The following section is not relevant to end-users, but is useful to
+          provide background information that is relevant in later sections.
+
+Godot sends instructions to the GPU via a graphics API (OpenGL, GLES2, GLES3,
+Vulkan). The communication and driver activity involved can be quite costly,
+especially in OpenGL. If we can provide these instructions in a way that is
+preferred by the driver and GPU, we can greatly increase performance.
+
+Nearly every API command in OpenGL requires a certain amount of validation, to
+make sure the GPU is in the correct state. Even seemingly simple commands can
+lead to a flurry of behind the scenes housekeeping. Therefore the name of the
+game is reduce these instructions to a bare minimum, and group together similar
+objects as much as possible so they can be rendered together, or with the
+minimum number of these expensive state changes.
+
+2D batching
+~~~~~~~~~~~
+
+In 2d, the costs of treating each item individually can be prohibitively high -
+there can easily be thousands on screen. This is why 2d batching is used -
+multiple similar items are grouped together and rendered in a batch, via a
+single drawcall, rather than making a separate drawcall for each item. In
+addition this means that state changes, material and texture changes can be kept
+to a minimum.
+
+For more information on 2D batching see :ref:`doc_batching`.
+
+3D batching
+~~~~~~~~~~~
+
+In 3d, we still aim to minimize draw calls and state changes, however, it can be
+more difficult to batch together several objects into a single draw call. 3d
+meshes tend to comprise hundreds or thousands of triangles, and combining large
+meshes at runtime is prohibitively expensive. The costs of joining them quickly
+exceeds any benefits as the number of triangles grows per mesh. A much better
+alternative is to join meshes ahead of time (static meshes in relation to each
+other). This can either be done by artists, or programmatically within Godot.
+
+There is also a cost to batching together objects in 3d. Several objects
+rendered as one cannot be individually culled. An entire city that is off screen
+will still be rendered if it is joined to a single blade of grass that is on
+screen. So attempting to batch together 3d objects should take account of their
+location and effect on culling. Despite this, the benefits of joining static
+objects often outweigh other considerations, especially for large numbers of low
+poly objects. 
+
+For more information on 3D specific optimizations, see
+:ref:`doc_optimizing_3d_performance`.
+
+Reuse Shaders and Materials
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The Godot renderer is a little different to what is out there. It's designed to
+minimize GPU state changes as much as possible. :ref:`SpatialMaterial
+<class_SpatialMaterial>` does a good job at reusing materials that need similar
+shaders but, if custom shaders are used, make sure to reuse them as much as
+possible. Godot's priorities are:
+
+-  **Reusing Materials**: The fewer different materials in the
+   scene, the faster the rendering will be. If a scene has a huge amount
+   of objects (in the hundreds or thousands) try reusing the materials
+   or in the worst case use atlases.
+-  **Reusing Shaders**: If materials can't be reused, at least try to
+   re-use shaders (or SpatialMaterials with different parameters but the same
+   configuration).
+
+If a scene has, for example, ``20,000`` objects with ``20,000`` different
+materials each, rendering will be slow. If the same scene has ``20,000``
+objects, but only uses ``100`` materials, rendering will be much faster.
+
+Pixel cost vs vertex cost
+=========================
+
+You may have heard that the lower the number of polygons in a model, the faster
+it will be rendered. This is *really* relative and depends on many factors.
+
+On a modern PC and console, vertex cost is low. GPUs originally only rendered
+triangles, so every frame all the vertices:
+
+1. Had to be transformed by the CPU (including clipping).
+
+2. Had to be sent to the GPU memory from the main RAM.
+
+Now all this is handled inside the GPU, so the performance is much higher. 3D
+artists usually have the wrong feeling about polycount performance because 3D
+DCCs (such as Blender, Max, etc.) need to keep geometry in CPU memory in order
+for it to be edited, reducing actual performance. Game engines rely on the GPU
+more so they can render many triangles much more efficiently.
+
+On mobile devices, the story is different. PC and Console GPUs are
+brute-force monsters that can pull as much electricity as they need from
+the power grid. Mobile GPUs are limited to a tiny battery, so they need
+to be a lot more power efficient.
+
+To be more efficient, mobile GPUs attempt to avoid *overdraw*. This means, the
+same pixel on the screen being rendered more than once. Imagine a town with
+several buildings, GPUs don't know what is visible and what is hidden until they
+draw it. A house might be drawn and then another house in front of it (rendering
+happened twice for the same pixel!). PC GPUs normally don't care much about this
+and just throw more pixel processors to the hardware to increase performance
+(but this also increases power consumption).
+
+Using more power is not an option on mobile so mobile devices use a technique
+called "Tile Based Rendering" which divides the screen into a grid. Each cell
+keeps the list of triangles drawn to it and sorts them by depth to minimize
+*overdraw*. This technique improves performance and reduces power consumption,
+but takes a toll on vertex performance. As a result, fewer vertices and
+triangles can be processed for drawing.
+
+Additionally, Tile Based Rendering struggles when there are small objects with a
+lot of geometry within a small portion of the screen. This forces mobile GPUs to
+put a lot of strain on a single screen tile which considerably decreases
+performance as all the other cells must wait for it to complete in order to
+display the frame.
+
+In summary, do not worry about vertex count on mobile, but avoid concentration
+of vertices in small parts of the screen. If a character, NPC, vehicle, etc. is
+far away (so it looks tiny), use a smaller level of detail (LOD) model.
+
+Pay attention to the additional vertex processing required when using:
+
+-  Skinning (skeletal animation)
+-  Morphs (shape keys)
+-  Vertex-lit objects (common on mobile)
+
+Pixel / fragment shaders - fill rate
+====================================
+
+In contrast to vertex processing, the costs of fragment shading has increased
+dramatically over the years. Screen resolutions have increased (the area of a 4K
+screen is ``8,294,400`` pixels, versus ``307,200`` for an old ``640x480`` VGA
+screen, that is 27x the area), but also the complexity of fragment shaders has
+exploded. Physically based rendering requires complex calculations for each
+fragment.
+
+You can test whether a project is fill rate limited quite easily. Turn off vsync
+to prevent capping the frames per second, then compare the frames per second
+when running with a large window, to running with a postage stamp sized window
+(you may also benefit from similarly reducing your shadow map size if using
+shadows). Usually you will find the fps increases quite a bit using a small
+window, which indicates you are to some extent fill rate limited. If on the
+other hand there is little to no increase in fps, then your bottleneck lies
+elsewhere.
+
+You can increase performance in a fill rate limited project by reducing the
+amount of work the GPU has to do. You can do this by simplifying the shader
+(perhaps turn off expensive options if you are using a :ref:`SpatialMaterial
+<class_SpatialMaterial>`), or reducing the number and size of textures used.
+
+Consider shipping simpler shaders for mobile.
+
+Reading textures
+~~~~~~~~~~~~~~~~
+
+The other factor in fragment shaders is the cost of reading textures. Reading
+textures is an expensive operation (especially reading from several in a single
+fragment shader), and also consider the filtering may add expense to this
+(trilinear filtering between mipmaps, and averaging). Reading textures is also
+expensive in power terms, which is a big issue on mobiles.
+
+Texture compression
+~~~~~~~~~~~~~~~~~~~
+
+Godot compresses textures of 3D models when imported (VRAM compression) by
+default. Video RAM compression is not as efficient in size as PNG or JPG when
+stored, but increases performance enormously when drawing.
+
+This is because the main goal of texture compression is bandwidth reduction
+between memory and the GPU.
+
+In 3D, the shapes of objects depend more on the geometry than the texture, so
+compression is generally not noticeable. In 2D, compression depends more on
+shapes inside the textures, so the artifacts resulting from 2D compression are
+more noticeable.
+
+As a warning, most Android devices do not support texture compression of
+textures with transparency (only opaque), so keep this in mind.
+
+Post processing / shadows
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Post processing effects and shadows can also be expensive in terms of fragment
+shading activity. Always test the impact of these on different hardware.
+
+Reducing the size of shadow maps can increase performance, both in terms of
+writing, and reading the maps.
+
+Transparency / blending
+=======================
+
+Transparent items present particular problems for rendering efficiency. Opaque
+items (especially in 3d) can be essentially rendered in any order and the
+Z-buffer will ensure that only the front most objects get shaded. Transparent or
+blended objects are different - in most cases they cannot rely on the Z-buffer
+and must be rendered in "painter's order" (i.e. from back to front) in order to
+look correct.
+
+The transparent items are also particularly bad for fill rate, because every
+item has to be drawn, even if later transparent items will be drawn on top.
+
+Opaque items don't have to do this. They can usually take advantage of the
+Z-buffer by writing to the Z-buffer only first, then only performing the
+fragment shader on the 'winning' fragment, the item that is at the front at a
+particular pixel.
+
+Transparency is particularly expensive where multiple transparent items overlap.
+It is usually better to use as small a transparent area as possible in order to
+minimize these fill rate requirements, especially on mobile, where fill rate is
+very expensive. Indeed, in many situations, rendering more complex opaque
+geometry can end up being faster than using transparency to "cheat".
+
+Multi-Platform Advice
+=====================
+
+If you are aiming to release on multiple platforms, test `early` and test
+`often` on all your platforms, especially mobile. Developing a game on desktop
+but attempting to port to mobile at the last minute is a recipe for disaster.
+
+In general you should design your game for the lowest common denominator, then
+add optional enhancements for more powerful platforms. For example, you may want
+to use the GLES2 backend for both desktop and mobile platforms where you target
+both.
+
+Mobile / tile renderers
+=======================
+
+GPUs on mobile devices work in dramatically different ways from GPUs on desktop.
+Most mobile devices use tile renderers. Tile renderers split up the screen into
+regular sized tiles that fit into super fast cache memory, and reduce the reads
+and writes to main memory.
+
+There are some downsides though, it can make certain techniques much more
+complicated and expensive to perform. Tiles that rely on the results of
+rendering in different tiles or on the results of earlier operations being
+preserved can be very slow. Be very careful to test the performance of shaders,
+viewport textures and post processing.

BIN
tutorials/optimization/img/godot_profiler.png


BIN
tutorials/optimization/img/lights_overlap.png


BIN
tutorials/optimization/img/lights_separate.png


BIN
tutorials/optimization/img/overlap1.png


BIN
tutorials/optimization/img/overlap2.png


BIN
tutorials/optimization/img/scissoring.png


BIN
tutorials/optimization/img/valgrind.png


+ 67 - 1
tutorials/optimization/index.rst

@@ -1,9 +1,75 @@
 Optimization
 =============
 
+Introduction
+~~~~~~~~~~~~
+
+Godot follows a balanced performance philosophy. In the performance world, there
+are always trade-offs, which consist of trading speed for usability and
+flexibility. Some practical examples of this are:
+
+-  Rendering objects efficiently in high amounts is easy, but when a
+   large scene must be rendered, it can become inefficient. To solve this,
+   visibility computation must be added to the rendering, which makes rendering
+   less efficient, but, at the same time, fewer objects are rendered, so
+   efficiency overall improves.
+
+-  Configuring the properties of every material for every object that
+   needs to be rendered is also slow. To solve this, objects are sorted by
+   material to reduce the costs, but at the same time sorting has a cost.
+
+-  In 3D physics a similar situation happens. The best algorithms to
+   handle large amounts of physics objects (such as SAP) are slow at
+   insertion/removal of objects and ray-casting. Algorithms that allow faster
+   insertion and removal, as well as ray-casting, will not be able to handle as
+   many active objects.
+
+And there are many more examples of this! Game engines strive to be general
+purpose in nature, so balanced algorithms are always favored over algorithms
+that might be fast in some situations and slow in others or algorithms that are
+fast but make usability more difficult.
+
+Godot is not an exception and, while it is designed to have backends swappable
+for different algorithms, the default ones prioritize balance and flexibility
+over performance.
+
+With this clear, the aim of this tutorial section is to explain how to get the
+maximum performance out of Godot. While the tutorials can be read in any order,
+it is a good idea to start from :ref:`doc_general_optimization`.
+
 .. toctree::
    :maxdepth: 1
-   :name: toc-learn-features-optimization
+   :caption: Common
+   :name: toc-learn-features-general-optimization
 
+   general_optimization
    using_servers
+
+.. toctree::
+   :maxdepth: 1
+   :caption: CPU
+   :name: toc-learn-features-cpu-optimization
+
+   cpu_optimization
+
+.. toctree::
+   :maxdepth: 1
+   :caption: GPU
+   :name: toc-learn-features-gpu-optimization
+
+   gpu_optimization
    using_multimesh
+
+.. toctree::
+   :maxdepth: 1
+   :caption: 2D
+   :name: toc-learn-features-2d-optimization
+
+   batching
+
+.. toctree::
+   :maxdepth: 1
+   :caption: 3D
+   :name: toc-learn-features-3d-optimization
+
+   optimizing_3d_performance

+ 143 - 0
tutorials/optimization/optimizing_3d_performance.rst

@@ -0,0 +1,143 @@
+.. meta::
+    :keywords: optimization
+
+.. _doc_optimizing_3d_performance:
+
+Optimizing 3D performance
+=========================
+
+Culling
+=======
+
+Godot will automatically perform view frustum culling in order to prevent
+rendering objects that are outside the viewport. This works well for games that
+take place in a small area, however things can quickly become problematic in
+larger levels.
+
+Occlusion culling
+~~~~~~~~~~~~~~~~~
+
+Walking around a town for example, you may only be able to see a few buildings
+in the street you are in, as well as the sky and a few birds flying overhead. As
+far as a naive renderer is concerned however, you can still see the entire town.
+It won't just render the buildings in front of you, it will render the street
+behind that, with the people on that street, the buildings behind that. You
+quickly end up in situations where you are attempting to render 10x, or 100x
+more than what is visible.
+
+Things aren't quite as bad as they seem, because the Z-buffer usually allows the
+GPU to only fully shade the objects that are at the front. However, unneeded
+objects are still reducing performance.
+
+One way we can potentially reduce the amount to be rendered is to take advantage
+of occlusion. As of version 3.2.2 there is no built in support for occlusion in
+Godot, however with careful design you can still get many of the advantages.
+
+For instance in our city street scenario, you may be able to work out in advance
+that you can only see two other streets, ``B`` and ``C``, from street ``A``.
+Streets ``D`` to ``Z`` are hidden. In order to take advantage of occlusion, all
+you have to do is work out when your viewer is in street ``A`` (perhaps using
+Godot Areas), then you can hide the other streets.
+
+This is a manual version of what is known as a 'potentially visible set'. It is
+a very powerful technique for speeding up rendering. You can also use it to
+restrict physics or AI to the local area, and speed these up as well as
+rendering.
+
+Other occlusion techniques
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There are other occlusion techniques such as portals, automatic PVS, and raster
+based occlusion culling. Some of these may be available through addons, and may
+be available in core Godot in the future.
+
+Transparent objects
+~~~~~~~~~~~~~~~~~~~
+
+Godot sorts objects by :ref:`Material <class_Material>` and :ref:`Shader
+<class_Shader>` to improve performance. This, however, can not be done with
+transparent objects. Transparent objects are rendered from back to front to make
+blending with what is behind work. As a result, try to use as few transparent
+objects as possible. If an object has a small section with transparency, try to
+make that section a separate surface with its own Material.
+
+For more information, see the :ref:`GPU optimizations <doc_gpu_optimization>`
+doc.
+
+Level of detail (LOD)
+=====================
+
+In some situations, particularly at a distance, it can be a good idea to replace
+complex geometry with simpler versions - the end user will probably not be able
+to see much difference. Consider looking at a large number of trees in the far
+distance. There are several strategies for replacing models at varying distance.
+You could use lower poly models, or use transparency to simulate more complex
+geometry.
+
+Billboards and imposters
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+The simplest version of using transparency to deal with LOD is billboards. For
+example, you can use a single transparent quad to represent a tree at distance.
+This can be very cheap to render, unless of course, there are many trees in
+front of each other. In which case transparency may start eating into fill rate
+(for more information on fill rate, see :ref:`doc_gpu_optimization`).
+
+An alternative is to render not just one tree, but a number of trees together as
+a group. This can be especially effective if you can see an area but cannot
+physically approach it in a game.
+
+You can make imposters by pre-rendering views of an object at different angles.
+Or you can even go one step further, and periodically re-render a view of an
+object onto a texture to be used as an imposter. At a distance, you need to move
+the viewer a considerable distance for the angle of view to change
+significantly. This can be complex to get working, but may be worth it depending
+on the type of project you are making.
+
+Use instancing (MultiMesh)
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If several identical objects have to be drawn in the same place or nearby, try
+using :ref:`MultiMesh <class_MultiMesh>` instead. MultiMesh allows the drawing
+of many thousands of objects at very little performance cost, making it ideal
+for flocks, grass, particles, and anything else where you have thousands of
+identical objects.
+
+Also see the :ref:`Using MultiMesh <doc_using_multimesh>` doc.
+
+Bake lighting
+=============
+
+Lighting objects is one of the most costly rendering operations. Realtime
+lighting, shadows (especially multiple lights), and GI are especially expensive.
+They may simply be too much for lower power mobile devices to handle.
+
+Consider using baked lighting, especially for mobile. This can look fantastic,
+but has the downside that it will not be dynamic. Sometimes this is a trade off
+worth making.
+
+In general, if several lights need to affect a scene, it's best to use
+:ref:`doc_baked_lightmaps`. Baking can also improve the scene quality by adding
+indirect light bounces.
+
+Animation / Skinning
+====================
+
+Animation and particularly vertex animation such as skinning and morphing can be
+very expensive on some platforms. You may need to lower poly count considerably
+for animated models or limit the number of them on screen at any one time.
+
+Large worlds
+============
+
+If you are making large worlds, there are different considerations than what you
+may be familiar with from smaller games.
+
+Large worlds may need to be built in tiles that can be loaded on demand as you
+move around the world. This can prevent memory use from getting out of hand, and
+also limit the processing needed to the local area.
+
+There may be glitches due to floating point error in large worlds. You may be
+able to use techniques such as orienting the world around the player (rather
+than the other way around), or shifting the origin periodically to keep things
+centred around (0, 0, 0).