Optimization Guidelines

From Neos Wiki
Jump to navigation Jump to search
This page contains changes which are not marked for translation.
Other languages:
English • ‎日本語 • ‎한국어

This is a community-generated page to help other users discover how to optimize the performance of their creations in Neos. As with the whole Wiki, anyone is free to edit this page. However, since this page makes specific technical recommendations, it is best if information is backed up with an official source (e.g. a Discord comment from one of the Neos development team).

Rendering

Blendshapes

Every time a blendshape changes, the vertices have to be retransformed on the entire mesh. [1] If the majority of the mesh is not part of any blendshape, then that performance is wasted. Neos can automatically optimize this with the "Separate parts of mesh unaffected by blendshapes" found under the SkinnedMeshRenderer component. Whether this is worth it or not varies on a case-by-case basis, so you'll have to test your before/after performance while driving blendshapes to be sure.

Any blendshape not at zero has a performance cost proportional to the number of vertices, even if the blendshape is not changing. For example, a one million vertex mesh will have a significantly higher performance impact with a blendshape at 0.01 than with the blendshape at 0. If you have a mesh with non-driven blendshapes set to anything other than zero, consider baking them.

Materials

Some materials (notably the Fur material) are much more expensive than others. [2]

Alpha/Transparent/Additive/Multiply blend modes count as transparent materials and are slightly more expensive because things behind them have to be rendered and filtered through them. Transparent materials use the forward rendering pipeline, so they don't handle dynamic lights as consistently. Opaque and Cutout blend modes on PBS materials use the deferred rendering pipeline, and handle dynamic lights better.

Texture Dimensions

Square textures with pixel dimensions of a power of two (2, 4, 8, 16, 32, 64, 128, 512, 1024, 2048, 4096, etc) are more efficiently handled in VRAM.

Texture Atlasing

If you have a number of different materials of the same type, consider atlasing (combining multiple textures into one larger texture). [3] Even if different parts of your mesh use different settings, the addition of maps can let you combine many materials into one. Try to avoid large empty spaces in the resulting atlas, as they can waste VRAM. [4]

Places where atlasing doesn't help:

  • If you need a different material all together, e.g. a Fur part of a mostly PBS avatar.
  • If you need part of your avatar to have Alpha blend, but the majority is fine with Opaque or Cutout.

Procedural vs Static Assets

If you are not driving the parameters of a procedural mesh, then you can save performance by baking it into a static mesh. Procedural meshes and textures are per-world. This is because the procedural asset is duplicated with the item. Static meshes and textures are automatically instanced across worlds so there's only a single copy in memory at all times, and do not need to be saved on the item itself. [5]

GPU Mesh Instancing

If there are multiple instances of the same static mesh/material combination, they will be instanced (on most shaders). This can significantly improve performance when rendering multiple instances of the same object, e.g. having lots of trees in the environment [6]. Note that SkinnedMeshRenderers are not eligible for GPU instancing. [7]

Mirrors and Cameras

Mirrors and cameras can be quite expensive, especially at higher resolutions, as they require additional rendering passes. Mirrors are generally more expensive than cameras, as they require two additional passes (one per eye).

The performance of cameras can be improved by using appropriate near/far clip values and using the selective/exclusive render lists. Basically, avoid rendering what you don't need to.

It's good practice to localize mirrors and cameras with a ValueUserOverride so users can opt in if they're willing to sacrifice performance to see them.

Reflection Probes

Baked reflection probes are quite cheap, especially at the default resolution of 128x128. The only real cost is the VRAM used to store the cube map. [8]

Real-time reflection probes are extremely expensive, and are comparable to six cameras. [9]

Lighting

Light impact is proportional to how many pixels a light shines on. This is determined by the size of the visible light volume in the world, regardless of much geometry it affects. Short range or partially occluded lights are therefore cheaper.

Lights with shadows are much more expensive than lights without. In deferred shading, shadow-casting meshes still need to be rendered once or more for each shadow-casting light. Furthermore, the lighting shader that applies shadows has a higher rendering overhead than the one used when shadows are disabled. [10] [11]

Point lights with shadows are very expensive, as they render the surrounding scene six times. If you need shadows try to keep them restrained to a spot or directional light. [12]

It is possible to control whether a MeshRenderer or SkinnedMeshRenderer component casts shadows using the ShadowCastMode Enum value on the component. Changing this to 'Off' may be helpful if users wish to have some meshes cast shadows, but not all (and hence don't want to disable shadows on the relevant lights). Alternatively, there may be some performance benefits by turning off shadow casting for a highly detailed mesh and placing a similar, but lower detail, mesh at the same location with ShadowCastMost 'ShadowsOnly'. Comment from Geenz in response to Medra during Office Hours

Culling

'Culling' refers to not processing, or at least not rendering, specific parts of a world to reduce performance costs.

Frustrum culling

Neos automatically performs frustum culling, meaning objects outside of the field of view will not render (e.g. objects behind a user). With frustrum culling, there is some cost associated with calculating visibility, but it is generally constant for each mesh. The detection process relies on each objects's bounding box, which is essentially an axis-aligned box that fully wraps around the entire mesh. Therefore the mesh complexity is irrelevant (save for the initial calculation of its bounds or, in case of skinned meshes with some calculation modes, the number of bones). There are a few considerations to optimizing content to work best with this system [13]:

  • The cost of checking for whether objects should be culled by this system scales with the number separate active meshes. This means it can work together with user-made culling systems (see below) which can reduce the number of active meshes which must be tested.
  • If any part of a mesh is detemined to be visible due to the bounding box calculation, the entire mesh must be rendered - Neos cannot only render parts of a mesh.
    • As such, sometimes it makes sense to separate a large mesh into smaller pieces if the whole mesh would not normally be visible all at once. World terrain meshes may be good candidates for splitting into separate submeshes.
    • On the other hand, in some situations it is better to combine meshes which will generally be visible at the same time. Even though combining meshes with multiple materials does not directly save on rendering costs, it does save on testing for whether meshes should be culled. If multiple meshes are baked into a single mesh, the bounding box testing only needs to be performed once for that combined mesh. This is effectively Neos' version of the static mesh batching which occurs in Unity.

Note that SkinnedMeshRenderer components have multiple modes for bounds calculation which impose different performance costs. The calculation mode is indicated by the BoundsComputeMethod on each SkinnedMeshRenderer component [14]:

  • Static is a very cheap method based on the mesh alone. This does not require realtime computation, so ideally use this if possible.
  • FastDisjointRootApproximate first merges all bones into disjoint groups (any overlapping bones are merged into a single one) to reduce overall number of bones. It then uses those to approximate bounds in realtime. Fastest realtime method, recommended if parts of a mesh are being culled when using `Static`.
  • MediumPerBoneApproximate computes mesh bounds from bounds of every single bone. More accurate, but also much slower.
  • SlowRealtimeAccurate uses actual transformed geometry, requiring the skinned mesh to be processed all the time. Very heavy, but will respect things like blendshapes in addition to bones.
  • Proxy is slightly different from the others, but also potentially very cheap. It relies on the bounding box calculated for another SkinnedMeshRenderer referenced in the ProxyBoundsSource field. Useful in cases where you have a large main mesh and you need the visiblity of smaller meshes to be linked to it [15].

User-made culling systems

It is possible to create additional culling systems for your worlds by selectively deactivating slots and/or disabling components depending on e.g. a user's position. Very efficient culling solutions can be created with the ColliderUserTracker component. Note that this method is incompatible with NoClip locomotion. A simple demonstration of how to use this component for culling systems is in this tutorial by ProbablePrime:

Specific culling considerations for collider components

Avoid performing manual culling of colliders in such a way that they are activated/deactivated very often. Collider performance impact works differently than for renderered meshes; performance costs for colliders are already heavily optimized as they are only checked when they're relevant. Toggling them on and off regularly can disrupt this under-the-hood optimization process and may even be more expensive. [16]

Logix

Less Logix is not always better! It's more important to make your calculations do less work than it is to do them in a smaller space. Users should also not feel pressured to avoid using LogiX because they believe using 'only components' is somehow more optimal. There are many cases where specific components are a better solution than using several LogiX nodes and vice-versa. The decision mainly comes down to using the best tool for the job. Fundamentally, LogiX nodes are components anyway - see this video from ProbablePrime.

Writes and Drivers

Changing the value of a Sync will result in network traffic, as that change to the data model needs to be sent to the other users in the session. ValueUserOverride does not remove this network activity, as the overrides themselves are Syncs. [17]

Exceptions:

  • Drivers compute things locally for every user, and do not cause network traffic
  • "Self Driven values" (A ValueCopy with Writeback and the same source and target) are also locally calculated, even if you use the Write node to change the value.
  • If multiple writes to a value occur in the same update, only the last value will be replicated over the network.

Generally it's cheaper to perform computations locally and avoid network activity, but for more expensive computations it's better to have one user do it and sync the result.

Dynamic Variables and Impulses

Dynamic Variables are extremely efficient and can be used without concern for performance, however creating and destroying Dynamic Variable Spaces can be costly and should be done infrequently. [18]

Dynamic Impulses are also extremely efficient, especially if you target them at a slot close to their receiver. [19]

Frequent Impulses

High frequency updates from the Update node, Fire While True, etc. should be avoided if possible if the action results in network replication. Consider replacing them with Drivers.

Updating Relay Node

Updating Relays can be expensive, as they bypass the normal even-driven nature of Logix and force an evaluation every frame. Note while it may appear you need an updating relay due to a display not updating, often times that problem is specific to the display and is not needed for the finalized Logix. Use of this node should be avoided wherever possible, but sometimes there's no way around it.

Sequence Node

The Sequence node is not bad for performance, but its overuse can lead to poor coding practices. Chaining nodes prevents unexpected errors from propagating, and as Sequence will continue execution even on error it can lead to naive use putting your Logix into a bad state.

Cache Node

Don't worry about using the Cache node, as it is a very specific optimization that Neos will perform automatically in the future. [20]

Sample Color node

Sample Color is an inherently expensive node to use as it works by rendering a small and narrow view. Best to use this sparingly. Performance cost can be reduced by limiting the range which must be rendered by using the NearClip and FarClip inputs. [21]

Slot Count

Slot count and packed Logix nodes don't matter much performance-wise. Loading and saving do take a hit for complex setups but this hit is not eliminated by placing the Logix nodes on one slot. Neos still has to load and save the exact same number of components. [22]

Profiling

If you're working on a new item that might be expensive, consider profiling it:

  • The Debug menu you can find in Home tab of the Dash has many helpful timings
  • SteamVR has a "Display Performance Graph" that can show GPU frametimes. This can also be shown in-headset from a toggle in the developer settings (toggle "Advanced Settings" on in the settings menu)