5 年之前 · 693d2586cd
--- a/tutorials/3d/index.rst
+++ b/tutorials/3d/index.rst
@@ -7,7 +7,6 @@
 
				 
			
 
				    introduction_to_3d
			
 
				    using_transforms
			
 
				-   optimizing_3d_performance
			
 
				    3d_rendering_limitations
			
 
				    spatial_material
			
 
				    lights_and_shadows
			
--- a/tutorials/3d/optimizing_3d_performance.rst
+++ b/tutorials/3d/optimizing_3d_performance.rst
@@ -1,192 +0,0 @@
 
				-.. meta::
			
 
				-    :keywords: optimization
			
 
				-
			
 
				-.. _doc_optimizing_3d_performance:
			
 
				-
			
 
				-Optimizing 3D performance
			
 
				-=========================
			
 
				-
			
 
				-Introduction
			
 
				-~~~~~~~~~~~~
			
 
				-
			
 
				-Godot follows a balanced performance philosophy. In the performance world,
			
 
				-there are always trade-offs, which consist of trading speed for
			
 
				-usability and flexibility. Some practical examples of this are:
			
 
				-
			
 
				--  Rendering objects efficiently in high amounts is easy, but when a
			
 
				-   large scene must be rendered, it can become inefficient. To solve
			
 
				-   this, visibility computation must be added to the rendering, which
			
 
				-   makes rendering less efficient, but, at the same time, fewer objects are
			
 
				-   rendered, so efficiency overall improves.
			
 
				--  Configuring the properties of every material for every object that
			
 
				-   needs to be rendered is also slow. To solve this, objects are sorted
			
 
				-   by material to reduce the costs, but at the same time sorting has a
			
 
				-   cost.
			
 
				--  In 3D physics a similar situation happens. The best algorithms to
			
 
				-   handle large amounts of physics objects (such as SAP) are slow
			
 
				-   at insertion/removal of objects and ray-casting. Algorithms that
			
 
				-   allow faster insertion and removal, as well as ray-casting, will not
			
 
				-   be able to handle as many active objects.
			
 
				-
			
 
				-And there are many more examples of this! Game engines strive to be
			
 
				-general purpose in nature, so balanced algorithms are always favored
			
 
				-over algorithms that might be fast in some situations and slow in
			
 
				-others.. or algorithms that are fast but make usability more difficult.
			
 
				-
			
 
				-Godot is not an exception and, while it is designed to have backends
			
 
				-swappable for different algorithms, the default ones (or more like, the
			
 
				-only ones that are there for now) prioritize balance and flexibility
			
 
				-over performance.
			
 
				-
			
 
				-With this clear, the aim of this tutorial is to explain how to get the
			
 
				-maximum performance out of Godot.
			
 
				-
			
 
				-Rendering
			
 
				-~~~~~~~~~
			
 
				-
			
 
				-3D rendering is one of the most difficult areas to get performance from,
			
 
				-so this section will have a list of tips.
			
 
				-
			
 
				-Reuse shaders and materials
			
 
				----------------------------
			
 
				-
			
 
				-The Godot renderer is a little different to what is out there. It's designed
			
 
				-to minimize GPU state changes as much as possible.
			
 
				-:ref:`class_SpatialMaterial`
			
 
				-does a good job at reusing materials that need similar shaders but, if
			
 
				-custom shaders are used, make sure to reuse them as much as possible.
			
 
				-Godot's priorities will be like this:
			
 
				-
			
 
				--  **Reusing Materials**: The fewer different materials in the
			
 
				-   scene, the faster the rendering will be. If a scene has a huge amount
			
 
				-   of objects (in the hundreds or thousands) try reusing the materials
			
 
				-   or in the worst case use atlases.
			
 
				--  **Reusing Shaders**: If materials can't be reused, at least try to
			
 
				-   re-use shaders (or SpatialMaterials with different parameters but the same
			
 
				-   configuration).
			
 
				-
			
 
				-If a scene has, for example, 20.000 objects with 20.000 different
			
 
				-materials each, rendering will be slow. If the same scene has
			
 
				-20.000 objects, but only uses 100 materials, rendering will be blazingly
			
 
				-fast.
			
 
				-
			
 
				-Pixel cost vs vertex cost
			
 
				--------------------------
			
 
				-
			
 
				-It is a common thought that the lower the number of polygons in a model, the
			
 
				-faster it will be rendered. This is *really* relative and depends on
			
 
				-many factors.
			
 
				-
			
 
				-On a modern PC and console, vertex cost is low. GPUs
			
 
				-originally only rendered triangles, so all the vertices:
			
 
				-
			
 
				-1. Had to be transformed by the CPU (including clipping).
			
 
				-
			
 
				-2. Had to be sent to the GPU memory from the main RAM.
			
 
				-
			
 
				-Nowadays, all this is handled inside the GPU, so the performance is
			
 
				-extremely high. 3D artists usually have the wrong feeling about
			
 
				-polycount performance because 3D DCCs (such as Blender, Max, etc.) need
			
 
				-to keep geometry in CPU memory in order for it to be edited, reducing
			
 
				-actual performance. Truth is, a model rendered by a 3D engine is much
			
 
				-more optimal than how 3D DCCs display them.
			
 
				-
			
 
				-On mobile devices, the story is different. PC and Console GPUs are
			
 
				-brute-force monsters that can pull as much electricity as they need from
			
 
				-the power grid. Mobile GPUs are limited to a tiny battery, so they need
			
 
				-to be a lot more power efficient.
			
 
				-
			
 
				-To be more efficient, mobile GPUs attempt to avoid *overdraw*. This
			
 
				-means, the same pixel on the screen being rendered (as in, with lighting
			
 
				-calculation, etc.) more than once. Imagine a town with several buildings,
			
 
				-GPUs don't know what is visible and what is hidden until they
			
 
				-draw it. A house might be drawn and then another house in front of it
			
 
				-(rendering happened twice for the same pixel!). PC GPUs normally don't
			
 
				-care much about this and just throw more pixel processors to the
			
 
				-hardware to increase performance (but this also increases power
			
 
				-consumption).
			
 
				-
			
 
				-On mobile, pulling more power is not an option, so a technique called
			
 
				-"Tile Based Rendering" is used (almost every mobile hardware uses a
			
 
				-variant of it), which divides the screen into a grid. Each cell keeps the
			
 
				-list of triangles drawn to it and sorts them by depth to minimize
			
 
				-*overdraw*. This technique improves performance and reduces power
			
 
				-consumption, but takes a toll on vertex performance. As a result, fewer
			
 
				-vertices and triangles can be processed for drawing.
			
 
				-
			
 
				-Generally, this is not so bad, but there is a corner case on mobile that
			
 
				-must be avoided, which is to have small objects with a lot of geometry
			
 
				-within a small portion of the screen. This forces mobile GPUs to put a
			
 
				-lot of strain on a single screen cell, considerably decreasing
			
 
				-performance (as all the other cells must wait for it to complete in
			
 
				-order to display the frame).
			
 
				-
			
 
				-To make it short, do not worry about vertex count so much on mobile, but
			
 
				-avoid concentration of vertices in small parts of the screen. If, for
			
 
				-example, a character, NPC, vehicle, etc. is far away (so it looks tiny),
			
 
				-use a smaller level of detail (LOD) model instead.
			
 
				-
			
 
				-An extra situation where vertex cost must be considered is objects that
			
 
				-have extra processing per vertex, such as:
			
 
				-
			
 
				--  Skinning (skeletal animation)
			
 
				--  Morphs (shape keys)
			
 
				--  Vertex Lit Objects (common on mobile)
			
 
				-
			
 
				-Texture compression
			
 
				--------------------
			
 
				-
			
 
				-Godot offers to compress textures of 3D models when imported (VRAM
			
 
				-compression). Video RAM compression is not as efficient in size as PNG
			
 
				-or JPG when stored, but increases performance enormously when drawing.
			
 
				-
			
 
				-This is because the main goal of texture compression is bandwidth
			
 
				-reduction between memory and the GPU.
			
 
				-
			
 
				-In 3D, the shapes of objects depend more on the geometry than the
			
 
				-texture, so compression is generally not noticeable. In 2D, compression
			
 
				-depends more on shapes inside the textures, so the artifacts resulting
			
 
				-from 2D compression are more noticeable.
			
 
				-
			
 
				-As a warning, most Android devices do not support texture compression of
			
 
				-textures with transparency (only opaque), so keep this in mind.
			
 
				-
			
 
				-Transparent objects
			
 
				--------------------
			
 
				-
			
 
				-As mentioned before, Godot sorts objects by material and shader to
			
 
				-improve performance. This, however, can not be done on transparent
			
 
				-objects. Transparent objects are rendered from back to front to make
			
 
				-blending with what is behind work. As a result, please try to keep
			
 
				-transparent objects to a minimum! If an object has a small section with
			
 
				-transparency, try to make that section a separate material.
			
 
				-
			
 
				-Level of detail (LOD)
			
 
				----------------------
			
 
				-
			
 
				-As also mentioned before, using objects with fewer vertices can improve
			
 
				-performance in some cases. Godot has a simple system to change level
			
 
				-of detail,
			
 
				-:ref:`GeometryInstance <class_GeometryInstance>`
			
 
				-based objects have a visibility range that can be defined. Having
			
 
				-several GeometryInstance objects in different ranges works as LOD.
			
 
				-
			
 
				-Use instancing (MultiMesh)
			
 
				---------------------------
			
 
				-
			
 
				-If several identical objects have to be drawn in the same place or
			
 
				-nearby, try using :ref:`MultiMesh <class_MultiMesh>`
			
 
				-instead. MultiMesh allows the drawing of dozens of thousands of objects at
			
 
				-very little performance cost, making it ideal for flocks, grass,
			
 
				-particles, etc.
			
 
				-
			
 
				-Bake lighting
			
 
				--------------
			
 
				-
			
 
				-Small lights are usually not a performance issue. Shadows a little more.
			
 
				-In general, if several lights need to affect a scene, it's ideal to bake
			
 
				-it (:ref:`doc_baked_lightmaps`). Baking can also improve the scene quality by
			
 
				-adding indirect light bounces.
			
 
				-
			
 
				-If working on mobile, baking to texture is recommended, since this
			
 
				-method is even faster.
			
--- a/tutorials/optimization/batching.rst
+++ b/tutorials/optimization/batching.rst
@@ -0,0 +1,549 @@
 
				+.. _doc_batching:
			
 
				+
			
 
				+Optimization using batching
			
 
				+===========================
			
 
				+
			
 
				+Introduction
			
 
				+~~~~~~~~~~~~
			
 
				+
			
 
				+Game engines have to send a set of instructions to the GPU in order to tell the
			
 
				+GPU what and where to draw. These instructions are sent using common
			
 
				+instructions, called APIs (Application Programming Interfaces), examples of
			
 
				+which are OpenGL, OpenGL ES, and Vulkan.
			
 
				+
			
 
				+Different APIs incur different costs when drawing objects. OpenGL handles a lot
			
 
				+of work for the user in the GPU driver at the cost of more expensive draw calls.
			
 
				+As a result, applications can often be sped up by reducing the number of draw
			
 
				+calls.
			
 
				+
			
 
				+Draw calls
			
 
				+^^^^^^^^^^
			
 
				+
			
 
				+In 2D, we need to tell the GPU to render a series of primitives (rectangles,
			
 
				+lines, polygons etc). The most obvious technique is to tell the GPU to render
			
 
				+one primitive at a time, telling it some information such as the texture used,
			
 
				+the material, the position, size, etc. then saying "Draw!" (this is called a
			
 
				+draw call).
			
 
				+
			
 
				+It turns out that while this is conceptually simple from the engine side, GPUs
			
 
				+operate very slowly when used in this manner. GPUs work much more efficiently
			
 
				+if, instead of telling them to draw a single primitive, you tell them to draw a
			
 
				+number of similar primitives all in one draw call, which we will call a "batch".
			
 
				+
			
 
				+And it turns out that they don't just work a bit faster when used in this
			
 
				+manner, they work a *lot* faster.
			
 
				+
			
 
				+As Godot is designed to be a general purpose engine, the primitives coming into
			
 
				+the Godot renderer can be in any order, sometimes similar, and sometimes
			
 
				+dissimilar. In order to match the general purpose nature of Godot with the
			
 
				+batching preferences of GPUs, Godot features an intermediate layer which can
			
 
				+automatically group together primitives wherever possible, and send these
			
 
				+batches on to the GPU. This can give an increase in rendering performance while
			
 
				+requiring few, if any, changes to your Godot project.
			
 
				+
			
 
				+How it works
			
 
				+~~~~~~~~~~~~
			
 
				+
			
 
				+Instructions come into the renderer from your game in the form of a series of
			
 
				+items, each of which can contain one or more commands. The items correspond to
			
 
				+Nodes in the scene tree, and the commands correspond to primitives such as
			
 
				+rectangles or polygons. Some items, such as tilemaps, and text, can contain a
			
 
				+large number of commands (tiles and letters respectively). Others, such as
			
 
				+sprites, may only contain a single command (rectangle).
			
 
				+
			
 
				+The batcher uses two main techniques to group together primitives:
			
 
				+
			
 
				+* Consecutive items can be joined together
			
 
				+* Consecutive commands within an item can be joined to form a batch
			
 
				+
			
 
				+Breaking batching
			
 
				+^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+Batching can only take place if the items or commands are similar enough to be
			
 
				+rendered in one draw call. Certain changes (or techniques), by necessity, prevent
			
 
				+the formation of a contiguous batch, this is referred to as 'breaking batching'.
			
 
				+
			
 
				+Batching will be broken by (amongst other things):
			
 
				+* Change of texture
			
 
				+* Change of material
			
 
				+* Change of primitive type (say going from rectangles to lines)
			
 
				+
			
 
				+.. note:: 
			
 
				+	
			
 
				+	If for example, you draw a series of sprites each with a different texture,
			
 
				+	there is no way they can be batched.
			
 
				+
			
 
				+Render order
			
 
				+^^^^^^^^^^^^
			
 
				+
			
 
				+The question arises, if only similar items can be drawn together in a batch, why
			
 
				+don't we look through all the items in a scene, group together all the similar
			
 
				+items, and draw them together?
			
 
				+
			
 
				+In 3D, this is often exactly how engines work. However, in Godot 2D, items are
			
 
				+drawn in 'painter's order', from back to front. This ensures that items at the
			
 
				+front are drawn on top of earlier items, when they overlap.
			
 
				+
			
 
				+This also means that if we try and draw objects in order of, for example,
			
 
				+texture, then this painter's order may break and objects will be drawn in the
			
 
				+wrong order.
			
 
				+
			
 
				+In Godot this back to front order is determined by:
			
 
				+* The order of objects in the scene tree
			
 
				+* The Z index of objects
			
 
				+* The canvas layer
			
 
				+* Y sort nodes
			
 
				+
			
 
				+.. note::
			
 
				+	
			
 
				+	You can group similar objects together for easier batching. While doing so
			
 
				+	is not a requirement on your part, think of it as an optional approach that
			
 
				+	can improve performance in some cases. See the diagnostics section in order
			
 
				+	to help you make this decision.
			
 
				+
			
 
				+A trick
			
 
				+^^^^^^^
			
 
				+
			
 
				+And now a sleight of hand. Although the idea of painter's order is that objects
			
 
				+are rendered from back to front, consider 3 objects A, B and C, that contain 2
			
 
				+different textures, grass and wood.
			
 
				+
			
 
				+.. image:: img/overlap1.png
			
 
				+
			
 
				+In painter's order they are ordered:
			
 
				+
			
 
				+::
			
 
				+
			
 
				+	A - wood
			
 
				+	B - grass
			
 
				+	C - wood
			
 
				+
			
 
				+Because the texture changes, they cannot be batched, and will be rendered in 3
			
 
				+draw calls.
			
 
				+
			
 
				+However, painter's order is only needed on the assumption that they will be
			
 
				+drawn *on top* of each other. If we relax that assumption, i.e. if none of these
			
 
				+3 objects are overlapping, there is *no need* to preserve painter's order. The
			
 
				+rendered result will be the same. What if we could take advantage of this?
			
 
				+
			
 
				+Item reordering
			
 
				+^^^^^^^^^^^^^^^
			
 
				+
			
 
				+.. image:: img/overlap2.png
			
 
				+
			
 
				+It turns out that we can reorder items. However, we can only do this if the
			
 
				+items satisfy the conditions of an overlap test, to ensure that the end result
			
 
				+will be the same as if they were not reordered. The overlap test is very cheap
			
 
				+in performance terms, but not absolutely free, so there is a slight cost to
			
 
				+looking ahead to decide whether items can be reordered. The number of items to
			
 
				+lookahead for reordering can be set in project settings (see below), in order to
			
 
				+balance the costs and benefits in your project.
			
 
				+
			
 
				+::
			
 
				+
			
 
				+	A - wood
			
 
				+	C - wood
			
 
				+	B - grass
			
 
				+	
			
 
				+Because the texture only changes once, we can render the above in only 2
			
 
				+draw calls.
			
 
				+
			
 
				+Lights
			
 
				+~~~~~~
			
 
				+
			
 
				+Although the job for the batching system is normally quite straightforward, it
			
 
				+becomes considerably more complex when 2D lights are used, because lights are
			
 
				+drawn using extra passes, one for each light affecting the primitive. Consider 2
			
 
				+sprites A and B, with identical texture and material. Without lights they would
			
 
				+be batched together and drawn in one draw call. But with 3 lights, they would be
			
 
				+drawn as follows, each line a draw call:
			
 
				+
			
 
				+.. image:: img/lights_overlap.png
			
 
				+
			
 
				+::
			
 
				+
			
 
				+	A
			
 
				+	A - light 1
			
 
				+	A - light 2
			
 
				+	A - light 3
			
 
				+	B
			
 
				+	B - light 1
			
 
				+	B - light 2
			
 
				+	B - light 3
			
 
				+
			
 
				+That is a lot of draw calls, 8 for only 2 sprites. Now consider we are drawing
			
 
				+1000 sprites, the number of draw calls quickly becomes astronomical, and
			
 
				+performance suffers. This is partly why lights have the potential to drastically
			
 
				+slow down 2D.
			
 
				+
			
 
				+However, if you remember our magician's trick from item reordering, it turns out
			
 
				+we can use the same trick to get around painter's order for lights!
			
 
				+
			
 
				+If A and B are not overlapping, we can render them together in a batch, so the
			
 
				+draw process is as follows:
			
 
				+
			
 
				+.. image:: img/lights_separate.png
			
 
				+
			
 
				+::
			
 
				+
			
 
				+	AB
			
 
				+	AB - light 1
			
 
				+	AB - light 2
			
 
				+	AB - light 3
			
 
				+
			
 
				+
			
 
				+That is 4 draw calls. Not bad, that is a 50% improvement. However consider that
			
 
				+in a real game, you might be drawing closer to 1000 sprites.
			
 
				+
			
 
				+- Before: 1000 * 4 = 4000 draw calls.
			
 
				+- After: 1 * 4 = 4 draw calls.
			
 
				+
			
 
				+That is 1000x decrease in draw calls, and should give a huge increase in
			
 
				+performance.
			
 
				+
			
 
				+Overlap test
			
 
				+^^^^^^^^^^^^
			
 
				+
			
 
				+However, as with the item reordering, things are not that simple, we must first
			
 
				+perform the overlap test to determine whether we can join these primitives, and
			
 
				+the overlap test has a small cost. So again you can choose the number of
			
 
				+primitives to lookahead in the overlap test to balance the benefits against the
			
 
				+cost. Usually with lights the benefits far outweigh the costs.
			
 
				+
			
 
				+Also consider that depending on the arrangement of primitives in the viewport,
			
 
				+the overlap test will sometimes fail (because the primitives overlap and thus
			
 
				+should not be joined). So in practice the decrease in draw calls may be less
			
 
				+dramatic than the perfect situation of no overlap. However performance is
			
 
				+usually far higher than without this lighting optimization.
			
 
				+
			
 
				+Light Scissoring
			
 
				+~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Batching can make it more difficult to cull out objects that are not affected or
			
 
				+partially affected by a light. This can increase the fill rate requirements
			
 
				+quite a bit, and slow rendering. Fill rate is the rate at which pixels are
			
 
				+colored, it is another potential bottleneck unrelated to draw calls.
			
 
				+
			
 
				+In order to counter this problem, (and also speedup lighting in general),
			
 
				+batching introduces light scissoring. This enables the use of the OpenGL command
			
 
				+``glScissor()``, which identifies an area, outside of which, the GPU will not
			
 
				+render any pixels. We can thus greatly optimize fill rate by identifying the
			
 
				+intersection area between a light and a primitive, and limit rendering the light
			
 
				+to *that area only*.
			
 
				+
			
 
				+Light scissoring is controlled with the :ref:`scissor_area_threshold
			
 
				+<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
			
 
				+project setting. This value is between 1.0 and 0.0, with 1.0 being off (no
			
 
				+scissoring), and 0.0 being scissoring in every circumstance. The reason for the
			
 
				+setting is that there may be some small cost to scissoring on some hardware.
			
 
				+Generally though, when you are using lighting, it should result in some
			
 
				+performance gains.
			
 
				+
			
 
				+The relationship between the threshold and whether a scissor operation takes
			
 
				+place is not altogether straight forward, but generally it represents the pixel
			
 
				+area that is potentially 'saved' by a scissor operation (i.e. the fill rate
			
 
				+saved). At 1.0, the entire screens pixels would need to be saved, which rarely
			
 
				+if ever happens, so it is switched off. In practice the useful values are
			
 
				+bunched towards zero, as only a small percentage of pixels need to be saved for
			
 
				+the operation to be useful.
			
 
				+
			
 
				+The exact relationship is probably not necessary for users to worry about, but
			
 
				+out of interest is included in the appendix.
			
 
				+
			
 
				+.. image:: img/scissoring.png
			
 
				+
			
 
				+*Bottom right is a light, the red area is the pixels saved by the scissoring
			
 
				+operation. Only the intersection needs to be rendered.*
			
 
				+
			
 
				+Vertex baking
			
 
				+~~~~~~~~~~~~~
			
 
				+
			
 
				+The GPU shader receives instructions on what to draw in 2 main ways:
			
 
				+
			
 
				+* Shader uniforms (e.g. modulate color, item transform)
			
 
				+* Vertex attributes (vertex color, local transform)
			
 
				+
			
 
				+However, within a single draw call (batch) we cannot change uniforms. This means
			
 
				+that naively, we would not be able to batch together items or commands that
			
 
				+change final_modulate, or item transform. Unfortunately that is an awful lot of
			
 
				+cases. Sprites for instance typically are individual nodes with their own item
			
 
				+transform, and they may have their own color modulate.
			
 
				+
			
 
				+To get around this problem, the batching can "bake" some of the uniforms into
			
 
				+the vertex attributes.
			
 
				+
			
 
				+* The item transform can be combined with the local transform and sent in a
			
 
				+  vertex attribute.
			
 
				+
			
 
				+* The final modulate color can be combined with the vertex colors, and sent in a
			
 
				+  vertex attribute.
			
 
				+
			
 
				+In most cases this works fine, but this shortcut breaks down if a shader expects
			
 
				+these values to be available individually, rather than combined. This can happen
			
 
				+in custom shaders.
			
 
				+
			
 
				+Custom Shaders
			
 
				+^^^^^^^^^^^^^^
			
 
				+
			
 
				+As a result certain operations in custom shaders will prevent baking, and thus
			
 
				+decrease the potential for batching. While we are working to decrease these
			
 
				+cases, currently the following conditions apply:
			
 
				+
			
 
				+* Reading or writing ``COLOR`` or ``MODULATE`` - disables vertex color baking
			
 
				+* Reading ``VERTEX`` - disables vertex position baking
			
 
				+
			
 
				+Project Settings
			
 
				+~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+In order to fine tune batching, a number of project settings are available. You
			
 
				+can usually leave these at default during development, but it is a good idea to
			
 
				+experiment to ensure you are getting maximum performance. Spending a little time
			
 
				+tweaking parameters can often give considerable performance gain, for very
			
 
				+little effort. See the tooltips in the project settings for more info.
			
 
				+
			
 
				+rendering/batching/options
			
 
				+^^^^^^^^^^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+* :ref:`use_batching
			
 
				+  <class_ProjectSettings_property_rendering/batching/options/use_batching>` -
			
 
				+  Turns batching on and off
			
 
				+
			
 
				+* :ref:`use_batching_in_editor
			
 
				+  <class_ProjectSettings_property_rendering/batching/options/use_batching_in_editor>`
			
 
				+
			
 
				+* :ref:`single_rect_fallback
			
 
				+  <class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>`
			
 
				+  - This is a faster way of drawing unbatchable rectangles, however it may lead
			
 
				+  to flicker on some hardware so is not recommended
			
 
				+
			
 
				+rendering/batching/parameters
			
 
				+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+* :ref:`max_join_item_commands <class_ProjectSettings_property_rendering/batching/parameters/max_join_item_commands>` -
			
 
				+  One of the most important ways of achieving
			
 
				+  batching is to join suitable adjacent items (nodes) together, however they can
			
 
				+  only be joined if the commands they contain are compatible. The system must
			
 
				+  therefore do a lookahead through the commands in an item to determine whether
			
 
				+  it can be joined. This has a small cost per command, and items with a large
			
 
				+  number of commands are not worth joining, so the best value may be project
			
 
				+  dependent.
			
 
				+
			
 
				+* :ref:`colored_vertex_format_threshold
			
 
				+  <class_ProjectSettings_property_rendering/batching/parameters/colored_vertex_format_threshold>` - Baking colors into
			
 
				+  vertices results in a
			
 
				+  larger vertex format. This is not necessarily worth doing unless there are a
			
 
				+  lot of color changes going on within a joined item. This parameter represents
			
 
				+  the proportion of commands containing color changes / the total commands,
			
 
				+  above which it switches to baked colors.
			
 
				+
			
 
				+* :ref:`batch_buffer_size
			
 
				+  <class_ProjectSettings_property_rendering/batching/parameters/batch_buffer_size>`
			
 
				+  - This determines the maximum size of a batch, it doesn't have a huge effect
			
 
				+  on performance but can be worth decreasing for mobile if RAM is at a premium.
			
 
				+
			
 
				+* :ref:`item_reordering_lookahead
			
 
				+  <class_ProjectSettings_property_rendering/batching/parameters/item_reordering_lookahead>`
			
 
				+  - Item reordering can help especially with
			
 
				+  interleaved sprites using different textures. The lookahead for the overlap
			
 
				+  test has a small cost, so the best value may change per project.
			
 
				+
			
 
				+rendering/batching/lights
			
 
				+^^^^^^^^^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+* :ref:`scissor_area_threshold
			
 
				+  <class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
			
 
				+  - See light scissoring.
			
 
				+
			
 
				+* :ref:`max_join_items
			
 
				+  <class_ProjectSettings_property_rendering/batching/lights/max_join_items>`  -
			
 
				+  Joining items before lighting can significantly increase
			
 
				+  performance. This requires an overlap test, which has a small cost, so the
			
 
				+  costs and benefits may be project dependent, and hence the best value to use
			
 
				+  here.
			
 
				+
			
 
				+rendering/batching/debug
			
 
				+^^^^^^^^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+* :ref:`flash_batching
			
 
				+  <class_ProjectSettings_property_rendering/batching/debug/flash_batching>`  -
			
 
				+  This is purely a debugging feature to identify regressions between the
			
 
				+  batching and legacy renderer. When it is switched on, the batching and legacy
			
 
				+  renderer are used alternately on each frame. This will decrease performance,
			
 
				+  and should not be used for your final export, only for testing.
			
 
				+
			
 
				+* :ref:`diagnose_frame
			
 
				+  <class_ProjectSettings_property_rendering/batching/debug/diagnose_frame>`  -
			
 
				+  This will periodically print a diagnostic batching log to
			
 
				+  the Godot IDE / console.
			
 
				+
			
 
				+rendering/batching/precision
			
 
				+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+* :ref:`uv_contract
			
 
				+  <class_ProjectSettings_property_rendering/batching/precision/uv_contract>` -
			
 
				+  On some hardware (notably some Android devices) there have been reports of
			
 
				+  tilemap tiles drawing slightly outside their UV range, leading to edge
			
 
				+  artifacts such as lines around tiles. If you see this problem, try enabling uv
			
 
				+  contract. This makes a small contraction in the UV coordinates to compensate
			
 
				+  for precision errors on devices.
			
 
				+
			
 
				+* :ref:`uv_contract_amount
			
 
				+  <class_ProjectSettings_property_rendering/batching/precision/uv_contract_amount>`
			
 
				+  - Hopefully the default amount should cure artifacts on most devices, but just
			
 
				+  in case, this value is editable.
			
 
				+
			
 
				+Diagnostics
			
 
				+~~~~~~~~~~~
			
 
				+
			
 
				+Although you can change parameters and examine the effect on frame rate, this
			
 
				+can feel like working blindly, with no idea of what is going on under the hood.
			
 
				+To help with this, batching offers a diagnostic mode, which will periodically
			
 
				+print out (to the IDE or console) a list of the batches that are being
			
 
				+processed. This can help pin point situations where batching is not occurring as
			
 
				+intended, and help you to fix them, in order to get the best possible
			
 
				+performance.
			
 
				+
			
 
				+Reading a diagnostic
			
 
				+^^^^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+.. code-block:: cpp
			
 
				+
			
 
				+	canvas_begin FRAME 2604
			
 
				+	items
			
 
				+		joined_item 1 refs
			
 
				+				batch D 0-0 
			
 
				+				batch D 0-2 n n
			
 
				+				batch R 0-1 [0 - 0] {255 255 255 255 }
			
 
				+		joined_item 1 refs
			
 
				+				batch D 0-0 
			
 
				+				batch R 0-1 [0 - 146] {255 255 255 255 }
			
 
				+				batch D 0-0 
			
 
				+				batch R 0-1 [0 - 146] {255 255 255 255 }
			
 
				+		joined_item 1 refs
			
 
				+				batch D 0-0 
			
 
				+				batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
			
 
				+				batch D 0-0 
			
 
				+				batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
			
 
				+				batch D 0-0 
			
 
				+				batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
			
 
				+	canvas_end
			
 
				+
			
 
				+
			
 
				+This is a typical diagnostic.
			
 
				+
			
 
				+* **joined_item** - A joined item can contain 1 or
			
 
				+  more references to items (nodes). Generally joined_items containing many
			
 
				+  references is preferable to many joined_items containing a single reference.
			
 
				+  Whether items can be joined will be determined by their contents and
			
 
				+  compatibility with the previous item.
			
 
				+* **batch R** - a batch containing rectangles. The second number is the number of
			
 
				+  rects. The second number in square brackets is the Godot texture ID, and the
			
 
				+  numbers in curly braces is the color. If the batch contains more than one rect,
			
 
				+  MULTI is added to the line to make it easy to identify. Seeing MULTI is good,
			
 
				+  because this indicates successful batching.
			
 
				+* **batch D** - a default batch, containing everything else that is not currently
			
 
				+  batched.
			
 
				+
			
 
				+Default Batches
			
 
				+^^^^^^^^^^^^^^^
			
 
				+
			
 
				+The second number following default batches is the number of commands in the
			
 
				+batch, and it is followed by a brief summary of the contents:
			
 
				+
			
 
				+::
			
 
				+
			
 
				+	l - line
			
 
				+	PL - polyline
			
 
				+	r - rect
			
 
				+	n - ninepatch
			
 
				+	PR - primitive
			
 
				+	p - polygon
			
 
				+	m - mesh
			
 
				+	MM - multimesh
			
 
				+	PA - particles
			
 
				+	c - circle
			
 
				+	t - transform
			
 
				+	CI - clip_ignore
			
 
				+
			
 
				+You may see "dummy" default batches containing no commands, you can ignore
			
 
				+these.
			
 
				+
			
 
				+FAQ
			
 
				+~~~
			
 
				+
			
 
				+I don't get a large performance increase from switching on batching
			
 
				+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+* Try the diagnostics, see how much batching is occurring, and whether it can be
			
 
				+  improved
			
 
				+* Try changing parameters
			
 
				+* Consider that batching may not be your bottleneck (see bottlenecks)
			
 
				+
			
 
				+I get a decrease in performance with batching
			
 
				+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+* Try steps to increase batching given above
			
 
				+* Try switching :ref:`single_rect_fallback
			
 
				+  <class_ProjectSettings_property_rendering/batching/options/single_rect_fallback>`
			
 
				+  to on
			
 
				+* The single rect fallback method is the default used without batching, and it
			
 
				+  is approximately twice as fast, however it can result in flicker on some
			
 
				+  hardware, so its use is discouraged
			
 
				+* After trying the above, if your scene is still performing worse, consider
			
 
				+  turning off batching.
			
 
				+
			
 
				+I use custom shaders and the items are not batching
			
 
				+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+* Custom shaders can be problematic for batching, see the custom shaders section
			
 
				+
			
 
				+I am seeing line artifacts appear on certain hardware
			
 
				+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+* See the :ref:`uv_contract
			
 
				+  <class_ProjectSettings_property_rendering/batching/precision/uv_contract>`
			
 
				+  project setting which can be used to solve this problem.
			
 
				+
			
 
				+I use a large number of textures, so few items are being batched
			
 
				+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+* Consider the use of texture atlases. As well as allowing batching, these
			
 
				+  reduce the need for state changes associated with changing texture.
			
 
				+
			
 
				+Appendix
			
 
				+~~~~~~~~
			
 
				+
			
 
				+Light scissoring threshold calculation
			
 
				+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+The actual proportion of screen pixel area used as the threshold is the
			
 
				+:ref:`scissor_area_threshold
			
 
				+<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
			
 
				+value to the power of 4.
			
 
				+
			
 
				+For example, on a screen size ``1920 x 1080`` there are ``2,073,600`` pixels.
			
 
				+
			
 
				+At a threshold of ``1000`` pixels, the proportion would be:
			
 
				+
			
 
				+::
			
 
				+
			
 
				+	1000 / 2073600 = 0.00048225
			
 
				+	0.00048225 ^ 0.25 = 0.14819
			
 
				+
			
 
				+.. note:: The power of 0.25 is the opposite of power of 4).
			
 
				+
			
 
				+So a :ref:`scissor_area_threshold
			
 
				+<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
			
 
				+of 0.15 would be a reasonable value to try.
			
 
				+
			
 
				+Going the other way, for instance with a :ref:`scissor_area_threshold
			
 
				+<class_ProjectSettings_property_rendering/batching/lights/scissor_area_threshold>`
			
 
				+of ``0.5``:
			
 
				+
			
 
				+::
			
 
				+
			
 
				+	0.5 ^ 4 = 0.0625
			
 
				+	0.0625 * 2073600 = 129600 pixels
			
 
				+
			
 
				+If the number of pixels saved is more than this threshold, the scissor is
			
 
				+activated.
			
--- a/tutorials/optimization/cpu_optimization.rst
+++ b/tutorials/optimization/cpu_optimization.rst
@@ -0,0 +1,258 @@
 
				+.. _doc_cpu_optimization:
			
 
				+
			
 
				+CPU Optimizations
			
 
				+=================
			
 
				+
			
 
				+Measuring performance
			
 
				+=====================
			
 
				+
			
 
				+To know how to speed up our program, we have to know where the "bottlenecks"
			
 
				+are. Bottlenecks are  the slowest parts of the program that limit the rate that
			
 
				+everything can progress. This allows us to concentrate our efforts on optimizing
			
 
				+the areas which will give us the greatest speed improvement, instead of spending
			
 
				+a lot of time optimizing functions that will lead to small performance
			
 
				+improvements.
			
 
				+
			
 
				+For the CPU, the easiest way to identify bottlenecks is to use a profiler.
			
 
				+
			
 
				+CPU profilers
			
 
				+=============
			
 
				+
			
 
				+Profilers run alongside your program and take timing measurements to work out
			
 
				+what proportion of time is spent in each function.
			
 
				+
			
 
				+The Godot IDE conveniently has a built in profiler. It does not run every time
			
 
				+you start your project, and must be manually started and stopped. This is
			
 
				+because, in common with most profilers, recording these timing measurements can
			
 
				+slow down your project significantly.
			
 
				+
			
 
				+After profiling, you can look back at the results for a frame.
			
 
				+
			
 
				+.. image:: img/godot_profiler.png
			
 
				+
			
 
				+`These are the results of a profile of one of the demo projects.`
			
 
				+
			
 
				+.. note:: We can see the cost of built-in processes such as physics and audio,
			
 
				+          as well as seeing the cost of our own scripting functions at the
			
 
				+          bottom.
			
 
				+
			
 
				+When a project is running slowly, you will often see an obvious function or
			
 
				+process taking a lot more time than others. This is your primary bottleneck, and
			
 
				+you can usually increase speed by optimizing this area.
			
 
				+
			
 
				+For more info about using the profiler within Godot see
			
 
				+:ref:`doc_debugger_panel`.
			
 
				+
			
 
				+External profilers
			
 
				+~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Although the Godot IDE profiler is very convenient and useful, sometimes you
			
 
				+need more power, and the ability to profile the Godot engine source code itself.
			
 
				+
			
 
				+You can use a number of third party profilers to do this including Valgrind,
			
 
				+VerySleepy, Visual Studio and Intel VTune. 
			
 
				+
			
 
				+.. note:: You may need to compile Godot from source in order to use a third
			
 
				+          party profiler so that you have program database information
			
 
				+          available. You can also use a debug build, however, note that the
			
 
				+          results of profiling a debug build will be different to a release
			
 
				+          build, because debug builds are less optimized. Bottlenecks are often
			
 
				+          in a different place in debug builds, so you should profile release
			
 
				+          builds wherever possible.
			
 
				+
			
 
				+.. image:: img/valgrind.png
			
 
				+
			
 
				+`These are example results from Callgrind, part of Valgrind, on Linux.`
			
 
				+
			
 
				+From the left, Callgrind is listing the percentage of time within a function and
			
 
				+its children (Inclusive), the percentage of time spent within the function
			
 
				+itself, excluding child functions (Self), the number of times the function is
			
 
				+called, the function name, and the file or module.
			
 
				+
			
 
				+In this example we can see nearly all time is spent under the
			
 
				+`Main::iteration()` function, this is the master function in the Godot source
			
 
				+code that is called repeatedly, and causes frames to be drawn, physics ticks to
			
 
				+be simulated, and nodes and scripts to be updated. A large proportion of the
			
 
				+time is spent in the functions to render a canvas (66%), because this example
			
 
				+uses a 2d benchmark. Below this we see that almost 50% of the time is spent
			
 
				+outside Godot code in `libglapi`, and `i965_dri` (the graphics driver). This
			
 
				+tells us the a large proportion of CPU time is being spent in the graphics
			
 
				+driver.
			
 
				+
			
 
				+This is actually an excellent example because in an ideal world, only a very
			
 
				+small proportion of time would be spent in the graphics driver, and this is an
			
 
				+indication that there is a problem with too much communication and work being
			
 
				+done in the graphics API. This profiling lead to the development of 2d batching,
			
 
				+which greatly speeds up 2d by reducing bottlenecks in this area.
			
 
				+
			
 
				+Manually timing functions
			
 
				+=========================
			
 
				+
			
 
				+Another handy technique, especially once you have identified the bottleneck
			
 
				+using a profiler, is to manually time the function or area under test. The
			
 
				+specifics vary according to language, but in GDScript, you would do the
			
 
				+following:
			
 
				+
			
 
				+::
			
 
				+
			
 
				+    var time_start = OS.get_system_time_msecs()
			
 
				+    
			
 
				+    # Your function you want to time
			
 
				+    update_enemies()
			
 
				+
			
 
				+    var time_end = OS.get_system_time_msecs()
			
 
				+    print("Function took: " + str(time_end - time_start)) 
			
 
				+
			
 
				+
			
 
				+You may want to consider using other functions for time if another time unit is
			
 
				+more suitable, for example :ref:`OS.get_system_time_secs
			
 
				+<class_OS_method_get_system_time_secs>` if the function will take many seconds.
			
 
				+
			
 
				+When manually timing functions, it is usually a good idea to run the function
			
 
				+many times (say ``1000`` or more times), instead of just once (unless it is a
			
 
				+very slow function). A large part of the reason for this is that timers often
			
 
				+have limited accuracy, and CPUs will schedule processes in a haphazard manner,
			
 
				+so an average over a series of runs is more accurate than a single measurement.
			
 
				+
			
 
				+As you attempt to optimize functions, be sure to either repeatedly profile or
			
 
				+time them as you go. This will give you crucial feedback as to whether the
			
 
				+optimization is working (or not).
			
 
				+
			
 
				+Caches
			
 
				+======
			
 
				+
			
 
				+Something else to be particularly aware of, especially when comparing timing
			
 
				+results of two different versions of a function, is that the results can be
			
 
				+highly dependent on whether the data is in the CPU cache or not. CPUs don't load
			
 
				+data directly from main memory, because although main memory can be huge (many
			
 
				+GBs), it is very slow to access. Instead CPUs load data from a smaller, higher
			
 
				+speed bank of memory, called cache. Loading data from cache is super fast, but
			
 
				+every time you try and load a memory address that is not stored in cache, the
			
 
				+cache must make a trip to main memory and slowly load in some data. This delay
			
 
				+can result in the CPU sitting around idle for a long time, and is referred to as
			
 
				+a "cache miss".
			
 
				+
			
 
				+This means that the first time you run a function, it may run slowly, because
			
 
				+the data is not in cache. The second and later times, it may run much faster
			
 
				+because the data is in cache. So always use averages when timing, and be aware
			
 
				+of the effects of cache.
			
 
				+
			
 
				+Understanding caching is also crucial to CPU optimization. If you have an
			
 
				+algorithm (routine) that loads small bits of data from randomly spread out areas
			
 
				+of main memory, this can result in a lot of cache misses, a lot of the time, the
			
 
				+CPU will be waiting around for data instead of doing any work. Instead, if you
			
 
				+can make your data accesses localised, or even better, access memory in a linear
			
 
				+fashion (like a continuous list), then the cache will work optimally and the CPU
			
 
				+will be able to work as fast as possible.
			
 
				+
			
 
				+Godot usually takes care of such low-level details for you. For example, the
			
 
				+Server APIs make sure data is optimized for caching already for things like
			
 
				+rendering and physics. But you should be especially aware of caching when using
			
 
				+GDNative.
			
 
				+
			
 
				+Languages
			
 
				+=========
			
 
				+
			
 
				+Godot supports a number of different languages, and it is worth bearing in mind
			
 
				+that there are trade-offs involved - some languages are designed for ease of
			
 
				+use, at the cost of speed, and others are faster but more difficult to work
			
 
				+with.
			
 
				+
			
 
				+Built-in engine functions run at the same speed regardless of the scripting
			
 
				+language you choose. If your project is making a lot of calculations in its own
			
 
				+code, consider moving those calculations to a faster language.
			
 
				+
			
 
				+GDScript
			
 
				+~~~~~~~~
			
 
				+
			
 
				+GDScript is designed to be easy to use and iterate, and is ideal for making many
			
 
				+types of games. However, ease of use is considered more important than
			
 
				+performance, so if you need to make heavy calculations, consider moving some of
			
 
				+your project to one of the other languages.
			
 
				+
			
 
				+C#
			
 
				+~~
			
 
				+
			
 
				+C# is popular and has first class support in Godot. It offers a good compromise
			
 
				+between speed and ease of use.
			
 
				+
			
 
				+Other languages
			
 
				+~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Third parties provide support for several other languages, including `Rust
			
 
				+<https://github.com/godot-rust/godot-rust>`_ and `Javascript
			
 
				+<https://github.com/GodotExplorer/ECMAScript>`_.
			
 
				+
			
 
				+C++
			
 
				+~~~
			
 
				+
			
 
				+Godot is written in C++. Using C++ will usually result in the fastest code,
			
 
				+however, on a practical level, it is the most difficult to deploy to end users'
			
 
				+machines on different platforms. Options for using C++ include GDNative, and
			
 
				+custom modules.
			
 
				+
			
 
				+Threads
			
 
				+=======
			
 
				+
			
 
				+Consider using threads when making a lot of calculations that can run parallel
			
 
				+to one another. Modern CPUs have multiple cores, each one capable of doing a
			
 
				+limited amount of work. By spreading work over multiple threads you can move
			
 
				+further towards peak CPU efficiency.
			
 
				+
			
 
				+The disadvantage of threads is that you have to be incredibly careful. As each
			
 
				+CPU core operates independently, they can end up trying to access the same
			
 
				+memory at the same time. One thread can be reading to a variable while another
			
 
				+is writing. Before you use threads make sure you understand the dangers and how
			
 
				+to try and prevent these race conditions.
			
 
				+
			
 
				+For more information on threads see :ref:`doc_using_multiple_threads`.
			
 
				+
			
 
				+SceneTree
			
 
				+=========
			
 
				+
			
 
				+Although Nodes are an incredibly powerful and versatile concept, be aware that
			
 
				+every node has a cost. Built in functions such as `_process()` and
			
 
				+`_physics_process()` propagate through the tree. This housekeeping can reduce
			
 
				+performance when you have very large numbers of nodes.
			
 
				+
			
 
				+Each node is handled individually in the Godot renderer so sometimes a smaller
			
 
				+number of nodes with more in each can lead to better performance.
			
 
				+
			
 
				+One quirk of the :ref:`SceneTree <class_SceneTree>` is that you can sometimes
			
 
				+get much better performance by removing nodes from the SceneTree, rather than
			
 
				+by pausing or hiding them. You don't have to delete a detached node. You
			
 
				+can for example, keep a reference to a node, detach it from the scene tree, then
			
 
				+reattach it later. This can be very useful for adding and removing areas from a
			
 
				+game for example.
			
 
				+
			
 
				+You can avoid the SceneTree altogether by using Server APIs. For more
			
 
				+information, see :ref:`doc_using_servers`.
			
 
				+
			
 
				+Physics
			
 
				+=======
			
 
				+
			
 
				+In some situations physics can end up becoming a bottleneck, particularly with
			
 
				+complex worlds, and large numbers of physics objects.
			
 
				+
			
 
				+Some techniques to speed up physics:
			
 
				+
			
 
				+* Try using simplified versions of your rendered geometry for physics. Often
			
 
				+  this won't be noticeable for end users, but can greatly increase performance.
			
 
				+* Try removing objects from physics when they are out of view / outside the
			
 
				+  current area, or reusing physics objects (maybe you allow 8 monsters per area,
			
 
				+  for example, and reuse these).
			
 
				+
			
 
				+Another crucial aspect to physics is the physics tick rate. In some games you
			
 
				+can greatly reduce the tick rate, and instead of for example, updating physics
			
 
				+60 times per second, you may update it at 20, or even 10 ticks per second. This
			
 
				+can greatly reduce the CPU load.
			
 
				+
			
 
				+The downside of changing physics tick rate is you can get jerky movement or
			
 
				+jitter when the physics update rate does not match the frames rendered.
			
 
				+
			
 
				+The solution to this problem is 'fixed timestep interpolation', which involves
			
 
				+smoothing the rendered positions and rotations over multiple frames to match the
			
 
				+physics. You can either implement this yourself or use a third-party addon.
			
 
				+Interpolation is a very cheap operation, performance wise, compared to running a
			
 
				+physics tick, orders of magnitude faster, so this can be a significant win, as
			
 
				+well as reducing jitter.
			
--- a/tutorials/optimization/general_optimization.rst
+++ b/tutorials/optimization/general_optimization.rst
@@ -0,0 +1,291 @@
 
				+.. _doc_general_optimization:
			
 
				+
			
 
				+General optimization tips
			
 
				+=========================
			
 
				+
			
 
				+Introduction
			
 
				+~~~~~~~~~~~~
			
 
				+
			
 
				+In an ideal world, computers would run at infinite speed, and the only limit to
			
 
				+what we could achieve would be our imagination. In the real world, however, it
			
 
				+is all too easy to produce software that will bring even the fastest computer to
			
 
				+its knees.
			
 
				+
			
 
				+Designing games and other software is thus a compromise between what we would
			
 
				+like to be possible, and what we can realistically achieve while maintaining
			
 
				+good performance.
			
 
				+
			
 
				+To achieve the best results, we have two approaches:
			
 
				+* Work faster
			
 
				+* Work smarter
			
 
				+
			
 
				+And preferably, we will use a blend of the two.
			
 
				+
			
 
				+Smoke and Mirrors
			
 
				+^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+Part of working smarter is recognizing that, especially in games, we can often
			
 
				+get the player to believe they are in a world that is far more complex, 
			
 
				+interactive, and graphically exciting than it really is. A good programmer is a
			
 
				+magician, and should strive to learn the tricks of the trade, and try to invent
			
 
				+new ones.
			
 
				+
			
 
				+The nature of slowness
			
 
				+^^^^^^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+To the outside observer, performance problems are often lumped together. But in
			
 
				+reality, there are several different kinds of performance problem:
			
 
				+
			
 
				+* A slow process that occurs every frame, leading to a continuously low frame
			
 
				+  rate 
			
 
				+* An intermittent process that causes 'spikes' of slowness, leading to
			
 
				+  stalls 
			
 
				+* A slow process that occurs outside of normal gameplay, for instance, on
			
 
				+  level load
			
 
				+
			
 
				+Each of these are annoying to the user, but in different ways.
			
 
				+
			
 
				+Measuring Performance
			
 
				+=====================
			
 
				+
			
 
				+Probably the most important tool for optimization is the ability to measure
			
 
				+performance - to identify where bottlenecks are, and to measure the success of
			
 
				+our attempts to speed them up.
			
 
				+
			
 
				+There are several methods of measuring performance, including :
			
 
				+* Putting a start / stop timer around code of interest
			
 
				+* Using the Godot profiler
			
 
				+* Using external third party profilers
			
 
				+* Using GPU profilers / debuggers
			
 
				+* Checking the frame rate (with vsync disabled)
			
 
				+
			
 
				+Be very aware that the relative performance of different areas can vary on
			
 
				+different hardware. Often it is a good idea to make timings on more than one
			
 
				+device, especially including mobile as well as desktop, if you are targeting
			
 
				+mobile.
			
 
				+
			
 
				+Limitations
			
 
				+~~~~~~~~~~~
			
 
				+
			
 
				+CPU Profilers are often the 'go to' method for measuring performance, however
			
 
				+they don't always tell the whole story.
			
 
				+
			
 
				+- Bottlenecks are often on the GPU, `as a result` of instructions given by the
			
 
				+  CPU
			
 
				+- Spikes can occur in the Operating System processes (outside of Godot) `as a
			
 
				+  result` of instructions used in Godot (for example dynamic memory allocation)
			
 
				+- You may not be able to profile e.g. a mobile phone
			
 
				+- You may have to solve performance problems that occur on hardware you don't
			
 
				+  have access to
			
 
				+
			
 
				+As a result of these limitations, you often need to use detective work to find
			
 
				+out where bottlenecks are.
			
 
				+
			
 
				+Detective work
			
 
				+~~~~~~~~~~~~~~
			
 
				+
			
 
				+Detective work is a crucial skill for developers (both in terms of performance,
			
 
				+and also in terms of bug fixing). This can include hypothesis testing, and
			
 
				+binary search.
			
 
				+
			
 
				+Hypothesis testing
			
 
				+^^^^^^^^^^^^^^^^^^
			
 
				+
			
 
				+Say for example you believe that sprites are slowing down your game. You can
			
 
				+test this hypothesis for example by:
			
 
				+
			
 
				+* Measuring the performance when you add more sprites, or take some away.
			
 
				+
			
 
				+This may lead to a further hypothesis - does the size of the sprite determine
			
 
				+the performance drop?
			
 
				+
			
 
				+* You can test this by keeping everything the same, but changing the sprite
			
 
				+  size, and measuring performance
			
 
				+
			
 
				+Binary search
			
 
				+^^^^^^^^^^^^^
			
 
				+
			
 
				+Say you know that frames are taking much longer than they should, but you are
			
 
				+not sure where the bottleneck lies. You could begin by commenting out
			
 
				+approximately half the routines that occur on a normal frame. Has the
			
 
				+performance improved more or less than expected?
			
 
				+
			
 
				+Once you know which of the two halves contains the bottleneck, you can then
			
 
				+repeat this process, until you have pinned down the problematic area.
			
 
				+
			
 
				+Profilers
			
 
				+=========
			
 
				+
			
 
				+Profilers allow you to time your program while running it. Profilers then
			
 
				+provide results telling you what percentage of time was spent in different
			
 
				+functions and areas, and how often functions were called.
			
 
				+
			
 
				+This can be very useful both to identify bottlenecks and to measure the results
			
 
				+of your improvements. Sometimes attempts to improve performance can backfire and
			
 
				+lead to slower performance, so always use profiling and timing to guide your
			
 
				+efforts.
			
 
				+
			
 
				+For more info about using the profiler within Godot see
			
 
				+:ref:`doc_debugger_panel`.
			
 
				+
			
 
				+Principles
			
 
				+==========
			
 
				+
			
 
				+Donald Knuth: 
			
 
				+
			
 
				+    *Programmers waste enormous amounts of time thinking about, or worrying
			
 
				+    about, the speed of noncritical parts of their programs, and these attempts
			
 
				+    at efficiency actually have a strong negative impact when debugging and
			
 
				+    maintenance are considered. We should forget about small efficiencies, say
			
 
				+    about 97% of the time: premature optimization is the root of all evil. Yet
			
 
				+    we should not pass up our opportunities in that critical 3%.*
			
 
				+
			
 
				+The messages are very important:
			
 
				+
			
 
				+* Programmer / Developer time is limited. Instead of blindly trying to speed up
			
 
				+  all aspects of a program we should concentrate our efforts on the aspects that
			
 
				+  really matter.
			
 
				+* Efforts at optimization often end up with code that is harder to read and
			
 
				+  debug than non-optimized code. It is in our interests to limit this to areas
			
 
				+  that will really benefit.
			
 
				+
			
 
				+Just because we `can` optimize a particular bit of code, it doesn't necessarily
			
 
				+mean that we should. Knowing when, and when not to optimize is a great skill to
			
 
				+develop.
			
 
				+
			
 
				+One misleading aspect of the quote is that people tend to focus on the subquote
			
 
				+"premature optimization is the root of all evil". While `premature` optimization
			
 
				+is (by definition) undesirable, performant software is the result of performant
			
 
				+design.
			
 
				+
			
 
				+Performant design
			
 
				+~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+The danger with encouraging people to ignore optimization until necessary, is
			
 
				+that it conveniently ignores that the most important time to consider
			
 
				+performance is at the design stage, before a key has even hit a keyboard. If the
			
 
				+design / algorithms of a program are inefficient, then no amount of polishing the
			
 
				+details later will make it run fast. It may run `faster`, but it will never run
			
 
				+as fast as a program designed for performance.
			
 
				+
			
 
				+This tends to be far more important in game / graphics programming than in
			
 
				+general programming. A performant design, even without low level optimization,
			
 
				+will often run many times faster than a mediocre design with low level
			
 
				+optimization.
			
 
				+
			
 
				+Incremental design
			
 
				+~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Of course, in practice, unless you have prior knowledge, you are unlikely to
			
 
				+come up with the best design first time. So you will often make a series of
			
 
				+versions of a particular area of code, each taking a different approach to the
			
 
				+problem, until you come to a satisfactory solution. It is important not to spend
			
 
				+too much time on the details at this stage until you have finalized the overall
			
 
				+design, otherwise much of your work will be thrown out.
			
 
				+
			
 
				+It is difficult to give general guidelines for performant design because this is
			
 
				+so dependent on the problem. One point worth mentioning though, on the CPU
			
 
				+side, is that modern CPUs are nearly always limited by memory bandwidth. This
			
 
				+has led to a resurgence in data orientated design, which involves designing data
			
 
				+structures and algorithms for locality of data and linear access, rather than
			
 
				+jumping around in memory.
			
 
				+
			
 
				+The optimization process
			
 
				+~~~~~~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Assuming we have a reasonable design, and taking our lessons from Knuth, our
			
 
				+first step in optimization should be to identify the biggest bottlenecks - the
			
 
				+slowest functions, the low hanging fruit.
			
 
				+
			
 
				+Once we have successfully improved the speed of the slowest area, it may no
			
 
				+longer be the bottleneck. So we should test / profile again, and find the next
			
 
				+bottleneck on which to focus.
			
 
				+
			
 
				+The process is thus:
			
 
				+
			
 
				+1. Profile / Identify bottleneck
			
 
				+2. Optimize bottleneck
			
 
				+3. Return to step 1
			
 
				+
			
 
				+Optimizing bottlenecks
			
 
				+~~~~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Some profilers will even tell you which part of a function (which data accesses,
			
 
				+calculations) are slowing things down.
			
 
				+
			
 
				+As with design you should concentrate your efforts first on making sure the
			
 
				+algorithms and data structures are the best they can be. Data access should be
			
 
				+local (to make best use of CPU cache), and it can often be better to use compact
			
 
				+storage of data (again, always profile to test results). Often you precalculate
			
 
				+heavy computations ahead of time (e.g. at level load, or loading precalculated
			
 
				+data files).
			
 
				+
			
 
				+Once algorithms and data are good, you can often make small changes in routines
			
 
				+which improve performance, things like moving calculations outside of loops.
			
 
				+
			
 
				+Always retest your timing / bottlenecks after making each change. Some changes
			
 
				+will increase speed, others may have a negative effect. Sometimes a small
			
 
				+positive effect will be outweighed by the negatives of more complex code, and
			
 
				+you may choose to leave out that optimization.
			
 
				+
			
 
				+Appendix
			
 
				+========
			
 
				+
			
 
				+Bottleneck math
			
 
				+~~~~~~~~~~~~~~~
			
 
				+
			
 
				+The proverb "a chain is only as strong as its weakest link" applies directly to
			
 
				+performance optimization. If your project is spending 90% of the time in
			
 
				+function 'A', then optimizing A can have a massive effect on performance.
			
 
				+
			
 
				+.. code-block:: none
			
 
				+
			
 
				+    A: 9 ms
			
 
				+    Everything else: 1 ms
			
 
				+    Total frame time: 10 ms
			
 
				+
			
 
				+.. code-block:: none
			
 
				+
			
 
				+    A: 1 ms 
			
 
				+    Everything else: 1ms 
			
 
				+    Total frame time: 2 ms
			
 
				+
			
 
				+So in this example improving this bottleneck A by a factor of 9x, decreases
			
 
				+overall frame time by 5x, and increases frames per second by 5x.
			
 
				+
			
 
				+If however, something else is running slowly and also bottlenecking your
			
 
				+project, then the same improvement can lead to less dramatic gains:
			
 
				+
			
 
				+.. code-block:: none
			
 
				+
			
 
				+    A: 9 ms
			
 
				+    Everything else: 50 ms
			
 
				+    Total frame time: 59 ms
			
 
				+
			
 
				+.. code-block:: none
			
 
				+
			
 
				+    A: 1 ms
			
 
				+    Everything else: 50 ms
			
 
				+    Total frame time: 51 ms
			
 
				+
			
 
				+So in this example, even though we have hugely optimized functionality A, the
			
 
				+actual gain in terms of frame rate is quite small.
			
 
				+
			
 
				+In games, things become even more complicated because the CPU and GPU run
			
 
				+independently of one another. Your total frame time is determined by the slower
			
 
				+of the two.
			
 
				+
			
 
				+.. code-block:: none
			
 
				+
			
 
				+    CPU: 9 ms
			
 
				+    GPU: 50 ms
			
 
				+    Total frame time: 50 ms
			
 
				+
			
 
				+.. code-block:: none
			
 
				+
			
 
				+    CPU: 1 ms
			
 
				+    GPU: 50 ms
			
 
				+    Total frame time: 50 ms
			
 
				+
			
 
				+In this example, we optimized the CPU hugely again, but the frame time did not
			
 
				+improve, because we are GPU-bottlenecked.
			
--- a/tutorials/optimization/gpu_optimization.rst
+++ b/tutorials/optimization/gpu_optimization.rst
@@ -0,0 +1,263 @@
 
				+.. _doc_gpu_optimization:
			
 
				+
			
 
				+GPU Optimizations
			
 
				+=================
			
 
				+
			
 
				+Introduction
			
 
				+~~~~~~~~~~~~
			
 
				+
			
 
				+The demand for new graphics features and progress almost guarantees that you
			
 
				+will encounter graphics bottlenecks. Some of these can be CPU side, for instance
			
 
				+in calculations inside the Godot engine to prepare objects for rendering.
			
 
				+Bottlenecks can also occur on the CPU in the graphics driver, which sorts
			
 
				+instructions to pass to the GPU, and in the transfer of these instructions. And
			
 
				+finally bottlenecks also occur on the GPU itself.
			
 
				+
			
 
				+Where bottlenecks occur in rendering is highly hardware specific. Mobile GPUs in
			
 
				+particular may struggle with scenes that run easily on desktop.
			
 
				+
			
 
				+Understanding and investigating GPU bottlenecks is slightly different to the
			
 
				+situation on the CPU, because often you can only change performance indirectly,
			
 
				+by changing the instructions you give to the GPU, and it may be more difficult
			
 
				+to take measurements. Often the only way of measuring performance is by
			
 
				+examining changes in frame rate.
			
 
				+
			
 
				+Drawcalls, state changes, and APIs
			
 
				+==================================
			
 
				+
			
 
				+.. note:: The following section is not relevant to end-users, but is useful to
			
 
				+          provide background information that is relevant in later sections.
			
 
				+
			
 
				+Godot sends instructions to the GPU via a graphics API (OpenGL, GLES2, GLES3,
			
 
				+Vulkan). The communication and driver activity involved can be quite costly,
			
 
				+especially in OpenGL. If we can provide these instructions in a way that is
			
 
				+preferred by the driver and GPU, we can greatly increase performance.
			
 
				+
			
 
				+Nearly every API command in OpenGL requires a certain amount of validation, to
			
 
				+make sure the GPU is in the correct state. Even seemingly simple commands can
			
 
				+lead to a flurry of behind the scenes housekeeping. Therefore the name of the
			
 
				+game is reduce these instructions to a bare minimum, and group together similar
			
 
				+objects as much as possible so they can be rendered together, or with the
			
 
				+minimum number of these expensive state changes.
			
 
				+
			
 
				+2D batching
			
 
				+~~~~~~~~~~~
			
 
				+
			
 
				+In 2d, the costs of treating each item individually can be prohibitively high -
			
 
				+there can easily be thousands on screen. This is why 2d batching is used -
			
 
				+multiple similar items are grouped together and rendered in a batch, via a
			
 
				+single drawcall, rather than making a separate drawcall for each item. In
			
 
				+addition this means that state changes, material and texture changes can be kept
			
 
				+to a minimum.
			
 
				+
			
 
				+For more information on 2D batching see :ref:`doc_batching`.
			
 
				+
			
 
				+3D batching
			
 
				+~~~~~~~~~~~
			
 
				+
			
 
				+In 3d, we still aim to minimize draw calls and state changes, however, it can be
			
 
				+more difficult to batch together several objects into a single draw call. 3d
			
 
				+meshes tend to comprise hundreds or thousands of triangles, and combining large
			
 
				+meshes at runtime is prohibitively expensive. The costs of joining them quickly
			
 
				+exceeds any benefits as the number of triangles grows per mesh. A much better
			
 
				+alternative is to join meshes ahead of time (static meshes in relation to each
			
 
				+other). This can either be done by artists, or programmatically within Godot.
			
 
				+
			
 
				+There is also a cost to batching together objects in 3d. Several objects
			
 
				+rendered as one cannot be individually culled. An entire city that is off screen
			
 
				+will still be rendered if it is joined to a single blade of grass that is on
			
 
				+screen. So attempting to batch together 3d objects should take account of their
			
 
				+location and effect on culling. Despite this, the benefits of joining static
			
 
				+objects often outweigh other considerations, especially for large numbers of low
			
 
				+poly objects. 
			
 
				+
			
 
				+For more information on 3D specific optimizations, see
			
 
				+:ref:`doc_optimizing_3d_performance`.
			
 
				+
			
 
				+Reuse Shaders and Materials
			
 
				+~~~~~~~~~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+The Godot renderer is a little different to what is out there. It's designed to
			
 
				+minimize GPU state changes as much as possible. :ref:`SpatialMaterial
			
 
				+<class_SpatialMaterial>` does a good job at reusing materials that need similar
			
 
				+shaders but, if custom shaders are used, make sure to reuse them as much as
			
 
				+possible. Godot's priorities are:
			
 
				+
			
 
				+-  **Reusing Materials**: The fewer different materials in the
			
 
				+   scene, the faster the rendering will be. If a scene has a huge amount
			
 
				+   of objects (in the hundreds or thousands) try reusing the materials
			
 
				+   or in the worst case use atlases.
			
 
				+-  **Reusing Shaders**: If materials can't be reused, at least try to
			
 
				+   re-use shaders (or SpatialMaterials with different parameters but the same
			
 
				+   configuration).
			
 
				+
			
 
				+If a scene has, for example, ``20,000`` objects with ``20,000`` different
			
 
				+materials each, rendering will be slow. If the same scene has ``20,000``
			
 
				+objects, but only uses ``100`` materials, rendering will be much faster.
			
 
				+
			
 
				+Pixel cost vs vertex cost
			
 
				+=========================
			
 
				+
			
 
				+You may have heard that the lower the number of polygons in a model, the faster
			
 
				+it will be rendered. This is *really* relative and depends on many factors.
			
 
				+
			
 
				+On a modern PC and console, vertex cost is low. GPUs originally only rendered
			
 
				+triangles, so every frame all the vertices:
			
 
				+
			
 
				+1. Had to be transformed by the CPU (including clipping).
			
 
				+
			
 
				+2. Had to be sent to the GPU memory from the main RAM.
			
 
				+
			
 
				+Now all this is handled inside the GPU, so the performance is much higher. 3D
			
 
				+artists usually have the wrong feeling about polycount performance because 3D
			
 
				+DCCs (such as Blender, Max, etc.) need to keep geometry in CPU memory in order
			
 
				+for it to be edited, reducing actual performance. Game engines rely on the GPU
			
 
				+more so they can render many triangles much more efficiently.
			
 
				+
			
 
				+On mobile devices, the story is different. PC and Console GPUs are
			
 
				+brute-force monsters that can pull as much electricity as they need from
			
 
				+the power grid. Mobile GPUs are limited to a tiny battery, so they need
			
 
				+to be a lot more power efficient.
			
 
				+
			
 
				+To be more efficient, mobile GPUs attempt to avoid *overdraw*. This means, the
			
 
				+same pixel on the screen being rendered more than once. Imagine a town with
			
 
				+several buildings, GPUs don't know what is visible and what is hidden until they
			
 
				+draw it. A house might be drawn and then another house in front of it (rendering
			
 
				+happened twice for the same pixel!). PC GPUs normally don't care much about this
			
 
				+and just throw more pixel processors to the hardware to increase performance
			
 
				+(but this also increases power consumption).
			
 
				+
			
 
				+Using more power is not an option on mobile so mobile devices use a technique
			
 
				+called "Tile Based Rendering" which divides the screen into a grid. Each cell
			
 
				+keeps the list of triangles drawn to it and sorts them by depth to minimize
			
 
				+*overdraw*. This technique improves performance and reduces power consumption,
			
 
				+but takes a toll on vertex performance. As a result, fewer vertices and
			
 
				+triangles can be processed for drawing.
			
 
				+
			
 
				+Additionally, Tile Based Rendering struggles when there are small objects with a
			
 
				+lot of geometry within a small portion of the screen. This forces mobile GPUs to
			
 
				+put a lot of strain on a single screen tile which considerably decreases
			
 
				+performance as all the other cells must wait for it to complete in order to
			
 
				+display the frame.
			
 
				+
			
 
				+In summary, do not worry about vertex count on mobile, but avoid concentration
			
 
				+of vertices in small parts of the screen. If a character, NPC, vehicle, etc. is
			
 
				+far away (so it looks tiny), use a smaller level of detail (LOD) model.
			
 
				+
			
 
				+Pay attention to the additional vertex processing required when using:
			
 
				+
			
 
				+-  Skinning (skeletal animation)
			
 
				+-  Morphs (shape keys)
			
 
				+-  Vertex-lit objects (common on mobile)
			
 
				+
			
 
				+Pixel / fragment shaders - fill rate
			
 
				+====================================
			
 
				+
			
 
				+In contrast to vertex processing, the costs of fragment shading has increased
			
 
				+dramatically over the years. Screen resolutions have increased (the area of a 4K
			
 
				+screen is ``8,294,400`` pixels, versus ``307,200`` for an old ``640x480`` VGA
			
 
				+screen, that is 27x the area), but also the complexity of fragment shaders has
			
 
				+exploded. Physically based rendering requires complex calculations for each
			
 
				+fragment.
			
 
				+
			
 
				+You can test whether a project is fill rate limited quite easily. Turn off vsync
			
 
				+to prevent capping the frames per second, then compare the frames per second
			
 
				+when running with a large window, to running with a postage stamp sized window
			
 
				+(you may also benefit from similarly reducing your shadow map size if using
			
 
				+shadows). Usually you will find the fps increases quite a bit using a small
			
 
				+window, which indicates you are to some extent fill rate limited. If on the
			
 
				+other hand there is little to no increase in fps, then your bottleneck lies
			
 
				+elsewhere.
			
 
				+
			
 
				+You can increase performance in a fill rate limited project by reducing the
			
 
				+amount of work the GPU has to do. You can do this by simplifying the shader
			
 
				+(perhaps turn off expensive options if you are using a :ref:`SpatialMaterial
			
 
				+<class_SpatialMaterial>`), or reducing the number and size of textures used.
			
 
				+
			
 
				+Consider shipping simpler shaders for mobile.
			
 
				+
			
 
				+Reading textures
			
 
				+~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+The other factor in fragment shaders is the cost of reading textures. Reading
			
 
				+textures is an expensive operation (especially reading from several in a single
			
 
				+fragment shader), and also consider the filtering may add expense to this
			
 
				+(trilinear filtering between mipmaps, and averaging). Reading textures is also
			
 
				+expensive in power terms, which is a big issue on mobiles.
			
 
				+
			
 
				+Texture compression
			
 
				+~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Godot compresses textures of 3D models when imported (VRAM compression) by
			
 
				+default. Video RAM compression is not as efficient in size as PNG or JPG when
			
 
				+stored, but increases performance enormously when drawing.
			
 
				+
			
 
				+This is because the main goal of texture compression is bandwidth reduction
			
 
				+between memory and the GPU.
			
 
				+
			
 
				+In 3D, the shapes of objects depend more on the geometry than the texture, so
			
 
				+compression is generally not noticeable. In 2D, compression depends more on
			
 
				+shapes inside the textures, so the artifacts resulting from 2D compression are
			
 
				+more noticeable.
			
 
				+
			
 
				+As a warning, most Android devices do not support texture compression of
			
 
				+textures with transparency (only opaque), so keep this in mind.
			
 
				+
			
 
				+Post processing / shadows
			
 
				+~~~~~~~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Post processing effects and shadows can also be expensive in terms of fragment
			
 
				+shading activity. Always test the impact of these on different hardware.
			
 
				+
			
 
				+Reducing the size of shadow maps can increase performance, both in terms of
			
 
				+writing, and reading the maps.
			
 
				+
			
 
				+Transparency / blending
			
 
				+=======================
			
 
				+
			
 
				+Transparent items present particular problems for rendering efficiency. Opaque
			
 
				+items (especially in 3d) can be essentially rendered in any order and the
			
 
				+Z-buffer will ensure that only the front most objects get shaded. Transparent or
			
 
				+blended objects are different - in most cases they cannot rely on the Z-buffer
			
 
				+and must be rendered in "painter's order" (i.e. from back to front) in order to
			
 
				+look correct.
			
 
				+
			
 
				+The transparent items are also particularly bad for fill rate, because every
			
 
				+item has to be drawn, even if later transparent items will be drawn on top.
			
 
				+
			
 
				+Opaque items don't have to do this. They can usually take advantage of the
			
 
				+Z-buffer by writing to the Z-buffer only first, then only performing the
			
 
				+fragment shader on the 'winning' fragment, the item that is at the front at a
			
 
				+particular pixel.
			
 
				+
			
 
				+Transparency is particularly expensive where multiple transparent items overlap.
			
 
				+It is usually better to use as small a transparent area as possible in order to
			
 
				+minimize these fill rate requirements, especially on mobile, where fill rate is
			
 
				+very expensive. Indeed, in many situations, rendering more complex opaque
			
 
				+geometry can end up being faster than using transparency to "cheat".
			
 
				+
			
 
				+Multi-Platform Advice
			
 
				+=====================
			
 
				+
			
 
				+If you are aiming to release on multiple platforms, test `early` and test
			
 
				+`often` on all your platforms, especially mobile. Developing a game on desktop
			
 
				+but attempting to port to mobile at the last minute is a recipe for disaster.
			
 
				+
			
 
				+In general you should design your game for the lowest common denominator, then
			
 
				+add optional enhancements for more powerful platforms. For example, you may want
			
 
				+to use the GLES2 backend for both desktop and mobile platforms where you target
			
 
				+both.
			
 
				+
			
 
				+Mobile / tile renderers
			
 
				+=======================
			
 
				+
			
 
				+GPUs on mobile devices work in dramatically different ways from GPUs on desktop.
			
 
				+Most mobile devices use tile renderers. Tile renderers split up the screen into
			
 
				+regular sized tiles that fit into super fast cache memory, and reduce the reads
			
 
				+and writes to main memory.
			
 
				+
			
 
				+There are some downsides though, it can make certain techniques much more
			
 
				+complicated and expensive to perform. Tiles that rely on the results of
			
 
				+rendering in different tiles or on the results of earlier operations being
			
 
				+preserved can be very slow. Be very careful to test the performance of shaders,
			
 
				+viewport textures and post processing.
			
--- a/tutorials/optimization/img/godot_profiler.png
+++ b/tutorials/optimization/img/godot_profiler.png
--- a/tutorials/optimization/img/lights_overlap.png
+++ b/tutorials/optimization/img/lights_overlap.png
--- a/tutorials/optimization/img/lights_separate.png
+++ b/tutorials/optimization/img/lights_separate.png
--- a/tutorials/optimization/img/overlap1.png
+++ b/tutorials/optimization/img/overlap1.png
--- a/tutorials/optimization/img/overlap2.png
+++ b/tutorials/optimization/img/overlap2.png
--- a/tutorials/optimization/img/scissoring.png
+++ b/tutorials/optimization/img/scissoring.png
--- a/tutorials/optimization/img/valgrind.png
+++ b/tutorials/optimization/img/valgrind.png
--- a/tutorials/optimization/index.rst
+++ b/tutorials/optimization/index.rst
@@ -1,9 +1,75 @@
 
				 Optimization
			
 
				 =============
			
 
				 
			
 
				+Introduction
			
 
				+~~~~~~~~~~~~
			
 
				+
			
 
				+Godot follows a balanced performance philosophy. In the performance world, there
			
 
				+are always trade-offs, which consist of trading speed for usability and
			
 
				+flexibility. Some practical examples of this are:
			
 
				+
			
 
				+-  Rendering objects efficiently in high amounts is easy, but when a
			
 
				+   large scene must be rendered, it can become inefficient. To solve this,
			
 
				+   visibility computation must be added to the rendering, which makes rendering
			
 
				+   less efficient, but, at the same time, fewer objects are rendered, so
			
 
				+   efficiency overall improves.
			
 
				+
			
 
				+-  Configuring the properties of every material for every object that
			
 
				+   needs to be rendered is also slow. To solve this, objects are sorted by
			
 
				+   material to reduce the costs, but at the same time sorting has a cost.
			
 
				+
			
 
				+-  In 3D physics a similar situation happens. The best algorithms to
			
 
				+   handle large amounts of physics objects (such as SAP) are slow at
			
 
				+   insertion/removal of objects and ray-casting. Algorithms that allow faster
			
 
				+   insertion and removal, as well as ray-casting, will not be able to handle as
			
 
				+   many active objects.
			
 
				+
			
 
				+And there are many more examples of this! Game engines strive to be general
			
 
				+purpose in nature, so balanced algorithms are always favored over algorithms
			
 
				+that might be fast in some situations and slow in others or algorithms that are
			
 
				+fast but make usability more difficult.
			
 
				+
			
 
				+Godot is not an exception and, while it is designed to have backends swappable
			
 
				+for different algorithms, the default ones prioritize balance and flexibility
			
 
				+over performance.
			
 
				+
			
 
				+With this clear, the aim of this tutorial section is to explain how to get the
			
 
				+maximum performance out of Godot. While the tutorials can be read in any order,
			
 
				+it is a good idea to start from :ref:`doc_general_optimization`.
			
 
				+
			
 
				 .. toctree::
			
 
				    :maxdepth: 1
			
 
				-   :name: toc-learn-features-optimization
			
 
				+   :caption: Common
			
 
				+   :name: toc-learn-features-general-optimization
			
 
				 
			
 
				+   general_optimization
			
 
				    using_servers
			
 
				+
			
 
				+.. toctree::
			
 
				+   :maxdepth: 1
			
 
				+   :caption: CPU
			
 
				+   :name: toc-learn-features-cpu-optimization
			
 
				+
			
 
				+   cpu_optimization
			
 
				+
			
 
				+.. toctree::
			
 
				+   :maxdepth: 1
			
 
				+   :caption: GPU
			
 
				+   :name: toc-learn-features-gpu-optimization
			
 
				+
			
 
				+   gpu_optimization
			
 
				    using_multimesh
			
 
				+
			
 
				+.. toctree::
			
 
				+   :maxdepth: 1
			
 
				+   :caption: 2D
			
 
				+   :name: toc-learn-features-2d-optimization
			
 
				+
			
 
				+   batching
			
 
				+
			
 
				+.. toctree::
			
 
				+   :maxdepth: 1
			
 
				+   :caption: 3D
			
 
				+   :name: toc-learn-features-3d-optimization
			
 
				+
			
 
				+   optimizing_3d_performance
			
--- a/tutorials/optimization/optimizing_3d_performance.rst
+++ b/tutorials/optimization/optimizing_3d_performance.rst
@@ -0,0 +1,143 @@
 
				+.. meta::
			
 
				+    :keywords: optimization
			
 
				+
			
 
				+.. _doc_optimizing_3d_performance:
			
 
				+
			
 
				+Optimizing 3D performance
			
 
				+=========================
			
 
				+
			
 
				+Culling
			
 
				+=======
			
 
				+
			
 
				+Godot will automatically perform view frustum culling in order to prevent
			
 
				+rendering objects that are outside the viewport. This works well for games that
			
 
				+take place in a small area, however things can quickly become problematic in
			
 
				+larger levels.
			
 
				+
			
 
				+Occlusion culling
			
 
				+~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Walking around a town for example, you may only be able to see a few buildings
			
 
				+in the street you are in, as well as the sky and a few birds flying overhead. As
			
 
				+far as a naive renderer is concerned however, you can still see the entire town.
			
 
				+It won't just render the buildings in front of you, it will render the street
			
 
				+behind that, with the people on that street, the buildings behind that. You
			
 
				+quickly end up in situations where you are attempting to render 10x, or 100x
			
 
				+more than what is visible.
			
 
				+
			
 
				+Things aren't quite as bad as they seem, because the Z-buffer usually allows the
			
 
				+GPU to only fully shade the objects that are at the front. However, unneeded
			
 
				+objects are still reducing performance.
			
 
				+
			
 
				+One way we can potentially reduce the amount to be rendered is to take advantage
			
 
				+of occlusion. As of version 3.2.2 there is no built in support for occlusion in
			
 
				+Godot, however with careful design you can still get many of the advantages.
			
 
				+
			
 
				+For instance in our city street scenario, you may be able to work out in advance
			
 
				+that you can only see two other streets, ``B`` and ``C``, from street ``A``.
			
 
				+Streets ``D`` to ``Z`` are hidden. In order to take advantage of occlusion, all
			
 
				+you have to do is work out when your viewer is in street ``A`` (perhaps using
			
 
				+Godot Areas), then you can hide the other streets.
			
 
				+
			
 
				+This is a manual version of what is known as a 'potentially visible set'. It is
			
 
				+a very powerful technique for speeding up rendering. You can also use it to
			
 
				+restrict physics or AI to the local area, and speed these up as well as
			
 
				+rendering.
			
 
				+
			
 
				+Other occlusion techniques
			
 
				+~~~~~~~~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+There are other occlusion techniques such as portals, automatic PVS, and raster
			
 
				+based occlusion culling. Some of these may be available through addons, and may
			
 
				+be available in core Godot in the future.
			
 
				+
			
 
				+Transparent objects
			
 
				+~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+Godot sorts objects by :ref:`Material <class_Material>` and :ref:`Shader
			
 
				+<class_Shader>` to improve performance. This, however, can not be done with
			
 
				+transparent objects. Transparent objects are rendered from back to front to make
			
 
				+blending with what is behind work. As a result, try to use as few transparent
			
 
				+objects as possible. If an object has a small section with transparency, try to
			
 
				+make that section a separate surface with its own Material.
			
 
				+
			
 
				+For more information, see the :ref:`GPU optimizations <doc_gpu_optimization>`
			
 
				+doc.
			
 
				+
			
 
				+Level of detail (LOD)
			
 
				+=====================
			
 
				+
			
 
				+In some situations, particularly at a distance, it can be a good idea to replace
			
 
				+complex geometry with simpler versions - the end user will probably not be able
			
 
				+to see much difference. Consider looking at a large number of trees in the far
			
 
				+distance. There are several strategies for replacing models at varying distance.
			
 
				+You could use lower poly models, or use transparency to simulate more complex
			
 
				+geometry.
			
 
				+
			
 
				+Billboards and imposters
			
 
				+~~~~~~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+The simplest version of using transparency to deal with LOD is billboards. For
			
 
				+example, you can use a single transparent quad to represent a tree at distance.
			
 
				+This can be very cheap to render, unless of course, there are many trees in
			
 
				+front of each other. In which case transparency may start eating into fill rate
			
 
				+(for more information on fill rate, see :ref:`doc_gpu_optimization`).
			
 
				+
			
 
				+An alternative is to render not just one tree, but a number of trees together as
			
 
				+a group. This can be especially effective if you can see an area but cannot
			
 
				+physically approach it in a game.
			
 
				+
			
 
				+You can make imposters by pre-rendering views of an object at different angles.
			
 
				+Or you can even go one step further, and periodically re-render a view of an
			
 
				+object onto a texture to be used as an imposter. At a distance, you need to move
			
 
				+the viewer a considerable distance for the angle of view to change
			
 
				+significantly. This can be complex to get working, but may be worth it depending
			
 
				+on the type of project you are making.
			
 
				+
			
 
				+Use instancing (MultiMesh)
			
 
				+~~~~~~~~~~~~~~~~~~~~~~~~~~
			
 
				+
			
 
				+If several identical objects have to be drawn in the same place or nearby, try
			
 
				+using :ref:`MultiMesh <class_MultiMesh>` instead. MultiMesh allows the drawing
			
 
				+of many thousands of objects at very little performance cost, making it ideal
			
 
				+for flocks, grass, particles, and anything else where you have thousands of
			
 
				+identical objects.
			
 
				+
			
 
				+Also see the :ref:`Using MultiMesh <doc_using_multimesh>` doc.
			
 
				+
			
 
				+Bake lighting
			
 
				+=============
			
 
				+
			
 
				+Lighting objects is one of the most costly rendering operations. Realtime
			
 
				+lighting, shadows (especially multiple lights), and GI are especially expensive.
			
 
				+They may simply be too much for lower power mobile devices to handle.
			
 
				+
			
 
				+Consider using baked lighting, especially for mobile. This can look fantastic,
			
 
				+but has the downside that it will not be dynamic. Sometimes this is a trade off
			
 
				+worth making.
			
 
				+
			
 
				+In general, if several lights need to affect a scene, it's best to use
			
 
				+:ref:`doc_baked_lightmaps`. Baking can also improve the scene quality by adding
			
 
				+indirect light bounces.
			
 
				+
			
 
				+Animation / Skinning
			
 
				+====================
			
 
				+
			
 
				+Animation and particularly vertex animation such as skinning and morphing can be
			
 
				+very expensive on some platforms. You may need to lower poly count considerably
			
 
				+for animated models or limit the number of them on screen at any one time.
			
 
				+
			
 
				+Large worlds
			
 
				+============
			
 
				+
			
 
				+If you are making large worlds, there are different considerations than what you
			
 
				+may be familiar with from smaller games.
			
 
				+
			
 
				+Large worlds may need to be built in tiles that can be loaded on demand as you
			
 
				+move around the world. This can prevent memory use from getting out of hand, and
			
 
				+also limit the processing needed to the local area.
			
 
				+
			
 
				+There may be glitches due to floating point error in large worlds. You may be
			
 
				+able to use techniques such as orienting the world around the player (rather
			
 
				+than the other way around), or shifting the origin periodically to keep things
			
 
				+centred around (0, 0, 0).