Wednesday, November 11, 2015

Optimize Memory Usage and Performance when using Coroutines in Unity3d

Introduction

Hi!

I am Alejandro Santiago, co-founder of Tochas Studios. Main, and only programmer, in the team.
It took us three years to make SodaCity, our latest and biggest project so far, during that time we missed a few things, broke some others, overall learned a few tricks here and there. In this blog I will try to elaborate on those findings so anyone that stumbles upon this posts can learn a few things from us too and avoid the downsides.

As I am the 'tech' guy in the team most of my posts will cover technical topics.

In this first post I will address a small trick I came to find regarding the use of coroutines in Unity3d.

Optimize Memory Usage and Performance when using Coroutines in Unity3d


This topic was discussed at Unity forums in the following thread C# Coroutine WaitForSeconds Garbage Collection tip

The Coroutines in Unity offer great advantages for building complex behaviors and it use can span from a couple of objects to thousands as they can be used for AI agents, fade effects, motion controllers, bullets, particles, UI components, etc.

Guides on how to use coroutines and even nested coroutines are common, easy to find and digest.


But in all references I came across the same principle was applied.

IEnumerator MyCoroutine (Transform target)
    {
        while(Vector3.Distance(transform.position, target.position) > 0.05f)
        {
            transform.position = Vector3.Lerp(transform.position, target.position, smoothing * Time.deltaTime);
            
            yield return null;
        }
        
        print("Reached the target.");
        
        yield return new WaitForSeconds(3f);
        
        print("MyCoroutine is now finished.");
    }

Yielding a new YieldInstruction every time a delay or pause in the coroutine execution is needed.

As the coroutine can be executed every frame or multiple times per second and the behavior can be attached to multiple objects (ex. Bullets or Enemies) this can potentially cause several even thousand of YieldInstructions to be created each frame.

This poses a problem with the GC, as we all know the GC can cause jitters and heavy frame drops each time it does its thing.

To avoid the GC problem it is recommended to pool objects. with almost every optimization guide I came across they referred to it as GameObject pooling. But I never came across a YieldInstruction pooling.

As SodaCity code grew in size and complexity each build was using more and more coroutines within more and more behaviors and objects. That had me worried as this could cause problems in the long run, So I decided to try to cache YieldInstruction objects and see what happened.

The results are as follows


Yes! You can and should pool or cache your YieldInstruction objects.


As all of the yielders provided by Unity do not expose members to allow changing values of already created objects, simply caching the references and reusing them at the instance level wouldn't do the trick (for most cases).

WaitForSeconds shortWait = new WaitForSeconds(0.1f);
WaitForSeconds longWait = new WaitForSeconds(5.0f);
IEnumerator myEvenAwesomerCoroutine()
{
    while (true)
    {
        if (iNeedToDoStuffFast)
        {
            doAwesomeStuffReallyFast();
            yield return shortWait;
        }
        else{
            dontDoMuch();
            yield return longWait;
        }
    }
}

The solution I am proposing is to use a generic Dictionary within a Yielders static class to allow cache and reuse of yielder objects at the application/game level.

using UnityEngine;
using System.Collections;
using System.Collections.Generic;
 
public static class Yielders {
 
    static Dictionary<float, WaitForSeconds> _timeInterval = new Dictionary<float, WaitForSeconds>(100);
 
    static WaitForEndOfFrame _endOfFrame = new WaitForEndOfFrame();
    public static WaitForEndOfFrame EndOfFrame {
        get{ return _endOfFrame;}
    }
 
    static WaitForFixedUpdate _fixedUpdate = new WaitForFixedUpdate();
    public static WaitForFixedUpdate FixedUpdate{
        get{ return _fixedUpdate; }
    }
 
    public static WaitForSeconds Get(float seconds){
        if(!_timeInterval.ContainsKey(seconds))
            _timeInterval.Add(seconds, new WaitForSeconds(seconds));
        return _timeInterval[seconds];
    }
   
}

This can be done because it seems Coroutine objects only use YieldInstructions as an exit condition or objective, and each Coroutine object handle its current state independently of the current YieldInstruction yielding the execution or the GameObject the behavior is attached to.

This allows to globally pre-cache a single WaitForEndOfFrame and a single WaitForFixedUpdate object ready to be used by any coroutine in any object, even simultaneously.

In the case of WaitForSeconds we need to use the Dictionary using the float 'waitForSeconds' value as a key. This will allow to reuse a WaitForSeconds for a specific time interval.

To validate this theory and measure the real gains or losses for each method I made a small test project. It can be downloaded from here.

To see details on the tests performed, continue reading...



These tests were made with Unity 4.3

The Setup


The scene is simple, it has a camera and 12 spawners, this will instantiate prefabs that use Coroutines to increase the load and catch alloc and cpu spikes while profiling.

Each Spawner has a CoroutineUser prefab and will create 100 of them. (Totaling in 1200 CoroutineUser objects)


To create a few dynamism within the test there are three prefabs.





Each CoroutineUser has a sprite component attached and will change its color every "Delay" seconds.

//#define USE_YIELDERS
//#define USE_WAITER
using UnityEngine;
using System.Collections;
 
public class CoroutineUser : MonoBehaviour {
 
    private SpriteRenderer Renderer;
 
    public float offset = 0.0f;
    public float delay = 5.0f;
 
    int idx = 0;
    private Color[] Colors = new Color[]{ Color.white, Color.red, Color.green, Color.blue};
 
    public Transform CurrentTransform {get; private set;}
 
    void Awake(){
        this.Renderer = this.GetComponent<SpriteRenderer>();
        this.CurrentTransform = this.transform;
    }
 
    IEnumerator Start(){
        if(this.offset > 0.0f)
            #if USE_YIELDERS && !USE_WAITER
            yield return Yielders.Get(this.offset);
            #endif
            #if !USE_YIELDERS && USE_WAITER
            yield return Waiter.Wait(this, this.offset);
            #endif
            #if !USE_YIELDERS && !USE_WAITER
            yield return new WaitForSeconds(this.offset);
            #endif
        while(Application.isPlaying){
            #if USE_YIELDERS && !USE_WAITER
            yield return Yielders.Get(this.delay);
            #endif
            #if !USE_YIELDERS && USE_WAITER
            yield return Waiter.Wait(this, this.delay);
            #endif
            #if !USE_YIELDERS && !USE_WAITER
            yield return new WaitForSeconds(this.delay);
            #endif
            this.idx = (this.idx + 1) % this.Colors.Length;
            this.Renderer.color = this.Colors[this.idx];
        }
    }
}

With the use of preprocessor symbols we can change the test to be run, thus only changing the lines corresponding with the "Wait" part of the process.

The Test

When run, the screen will be "mostly" covered by the same sprite, starting all with white tint and every "x" seconds a row will switch colors.



For best results I do encourage making a "deep profile"
Note: I do recommend to disable VSync to be able to spot more easily the cpu spikes at the profiler.
Edit->Project->Quality

The Results

new WaitForSeconds(this.Delay);

Creating new Yielders will cause allocs on the frame the instruction was created, and it will be disposed right after the yield statement has finished thus generating GC calls every once in a while.


As can be seen in the profiler window, the 1200 calls to CoroutineUser.Start.Iterator.MoveNext() will cause allocs of 14.1KB and 1200 WaitForSeconds..ctor() can be seen within the stack hierarchy.

Cache YieldInstruction objects

Caching the YieldInstructions within a static dictionary will cause allocs the first frame the Yielder is needed and subsequently will reuse those objects.
Having a strong reference to those objects will keep them away from the GC and can be purged at will ensuring GC calls to be made in safe zones, like before/after Application.LoadLevelAsync for example.


In the profiler window can be seen the 1200 calls to CoroutineUser.Start.Iterator.MoveNext() and the 0KB allocs.
also no .ctor() can be seen within the stack hierarchy.

Nested StartCoroutine

Starting a new "nested" Coroutine within a Coroutine (Coroutineception!) as powerful as it can be is a double edged sword and must be used carefully. This will create a new Coroutine object and all the overhead Unity needs to do to handle a new Coroutine.


As can be seen at this profiler window, each frame the 1200 CoroutineUser objects need to wait will cause allocs of 57.4KB, Iterator<WaitRoutine>..ctor() can be seen within the stack hierarchy.
due to the size of the allocs this method will cause GC calls to be more often.

Conclusion

The standard new WaitForSeconds() method it is not so bad, causing about 0.01175KB of garbage per YieldInstruction.

Caching YieldInstructions as exposed at the beginning of the post can help reduce the GC load.

In "not-so-"extreme environments saving a few KB here and there can be helpful, so caching them globally is appropriate as multiple objects can reuse the same YieldInstruction. Helpful for custom particles and particle engines, spawned objects like bullets, explosions, enemies, making tweens. 

Proper handling of the bucket used to cache them is required.
In this case a dictionary do not pose an issue as the key is a float and do not cause issues mostly caused by string keys.
The dictionary size can be constrained and optimized further like, only allowing a certain float precision as the key, self purging stale (unused) objects and so on

I applied this technique to all the Coroutines used by SodaCity code and it heavily improve the game's performance, Allowing us to have more objects on screen and be more bold in the refresh rate of some game sub-systems like AI, depth sorting and custom game particles.

Please feel free to download the test project, run your own tests, make changes and provide feedback.
Leave your findings, questions or doubts in the comments below.

No comments:

Post a Comment