InvokeLeakTest_Analysis.md 13 KB

InvokeLeakTest Failure Analysis

Status: FIXED ✅

Fixed in commit a6d064a - Replaced DateTime.UtcNow with Stopwatch.GetTimestamp() in TimedEvents.cs

Fix Results

  • ✅ InvokeLeakTest now passes on x64 under debugger
  • ✅ All 3128 unit tests pass
  • ✅ Added 5 new comprehensive tests for high-frequency scenarios
  • ✅ Cross-platform consistent (x64 and ARM)

Original Issue Summary

The InvokeLeakTest stress test was failing only on x64 machines when running under a debugger:

  • Visual Studio 2022 on Windows (x64)
  • Visual Studio 2022 on macOS (Intel-based VM)
  • Visual Studio Code on Windows

The test passed in CI/CD environments and when run without a debugger.

Test Description

InvokeLeakTest is a stress test (not a unit test) located in Tests/StressTests/ApplicationStressTests.cs. It:

  1. Spawns multiple concurrent tasks that call Application.Invoke() from background threads
  2. Each invocation updates a TextField and increments a counter using Interlocked.Increment
  3. The test verifies that all invocations complete successfully (no "leaks")
  4. Runs for 50 passes with 500 increments each (25,000 total invocations)

Test Flow

// Main thread blocks in Application.Run()
Application.Run(top);

// Background thread spawns tasks
for (var j = 0; j < NUM_PASSES; j++) {
    for (var i = 0; i < NUM_INCREMENTS; i++) {
        Task.Run(() => {
            Thread.Sleep(r.Next(2, 4));  // Random 2-4ms delay
            Application.Invoke(() => {
                tf.Text = $"index{r.Next()}";
                Interlocked.Increment(ref _tbCounter);
            });
        });
    }
    // Wait for counter to reach expected value with 100ms polling
    while (_tbCounter != expectedValue) {
        _wakeUp.Wait(POLL_MS);  // POLL_MS = 100ms
        if (_tbCounter hasn't changed) {
            throw new TimeoutException("Invoke lost");
        }
    }
}

How Application.Invoke Works

Call Chain

  1. Application.Invoke(action) → calls ApplicationImpl.Instance.Invoke(action)
  2. ApplicationImpl.Invoke() checks if on main thread:
    • If on main thread: Execute action immediately
    • If on background thread: Add to _timedEvents with TimeSpan.Zero
  3. TimedEvents.Add():
    • Calculates timestamp: k = (DateTime.UtcNow + time).Ticks
    • For TimeSpan.Zero, subtracts 100 ticks to ensure immediate execution: k -= 100
    • Adds to sorted list: _timeouts.Add(NudgeToUniqueKey(k), timeout)
  4. MainLoop.RunIteration() calls TimedEvents.RunTimers() every iteration
  5. TimedEvents.RunTimers():
    • Takes a copy of _timeouts and creates a new list (under lock)
    • Iterates through copy, executing callbacks where k < now
    • Non-repeating callbacks (return false) are not re-added

Critical Code Paths

ApplicationImpl.Invoke (Terminal.Gui/App/ApplicationImpl.cs:306-322)

public void Invoke (Action action)
{
    // If we are already on the main UI thread
    if (Application.MainThreadId == Thread.CurrentThread.ManagedThreadId)
    {
        action ();
        return;
    }

    _timedEvents.Add (TimeSpan.Zero,
                      () =>
                      {
                          action ();
                          return false;  // One-shot execution
                      }
                     );
}

TimedEvents.AddTimeout (Terminal.Gui/App/Timeout/TimedEvents.cs:124-139)

private void AddTimeout (TimeSpan time, Timeout timeout)
{
    lock (_timeoutsLockToken)
    {
        long k = (DateTime.UtcNow + time).Ticks;

        // if user wants to run as soon as possible set timer such that it expires right away
        if (time == TimeSpan.Zero)
        {
            k -= 100;  // Subtract 100 ticks to ensure it's "in the past"
        }

        _timeouts.Add (NudgeToUniqueKey (k), timeout);
        Added?.Invoke (this, new (timeout, k));
    }
}

TimedEvents.RunTimersImpl (Terminal.Gui/App/Timeout/TimedEvents.cs:160-192)

private void RunTimersImpl ()
{
    long now = DateTime.UtcNow.Ticks;
    SortedList<long, Timeout> copy;

    lock (_timeoutsLockToken)
    {
        copy = _timeouts;
        _timeouts = new ();
    }

    foreach ((long k, Timeout timeout) in copy)
    {
        if (k < now)  // Execute if scheduled time is in the past
        {
            if (timeout.Callback ())  // Returns false for Invoke actions
            {
                AddTimeout (timeout.Span, timeout);
            }
        }
        else  // Future timeouts - add back to list
        {
            lock (_timeoutsLockToken)
            {
                _timeouts.Add (NudgeToUniqueKey (k), timeout);
            }
        }
    }
}

Hypothesis: Why It Fails Under Debugger on @BDisp's Machine

Primary Hypothesis: DateTime.UtcNow Resolution and Debugger Timing

The test failure likely occurs due to a combination of factors:

1. DateTime.UtcNow Resolution Issues

The code uses DateTime.UtcNow.Ticks for timing, which has platform-dependent resolution:

  • Windows: ~15.6ms resolution (system timer tick)
  • Some systems: Can be lower/higher depending on timer configuration
  • Debugger impact: Can affect system timer behavior

When TimeSpan.Zero invocations are added:

long k = (DateTime.UtcNow + TimeSpan.Zero).Ticks;
k -= 100;  // Subtract 100 ticks (10 microseconds)

The problem: If two Invoke calls happen within the same timer tick (< ~15ms on Windows), they get the SAME DateTime.UtcNow value. The NudgeToUniqueKey function increments by 1 tick each collision, but this creates a sequence of timestamps like:

  • First call: now - 100
  • Second call (same UtcNow): now - 99
  • Third call (same UtcNow): now - 98
  • ...and so on

2. Race Condition in RunTimersImpl

In RunTimersImpl, this check determines if a timeout should execute:

if (k < now)  // k is scheduled time, now is current time

The race: Between when timeouts are added (with k = UtcNow - 100) and when they're checked (with fresh DateTime.UtcNow), time passes. However, if:

  1. Multiple invocations are added rapidly (within same timer tick)
  2. The system is under debugger (slower iteration loop)
  3. The main loop iteration happens to sample DateTime.UtcNow at an unlucky moment

Some timeouts might have k >= now even though they were intended to be "immediate" (TimeSpan.Zero).

3. Debugger-Specific Timing Effects

When running under a debugger:

a) Slower Main Loop Iterations

  • Debugger overhead slows each iteration
  • More time between RunTimers calls
  • Allows more tasks to queue up between iterations

b) Timer Resolution Changes

  • Debuggers can affect OS timer behavior
  • May change quantum/scheduling of threads
  • Different thread priorities under debugger

c) DateTime.UtcNow Sampling

  • More invocations can accumulate in a single UtcNow "tick"
  • Larger batches of timeouts with near-identical timestamps
  • Higher chance of k >= now race condition

4. The "Lost Invoke" Scenario

Failure scenario:

Time T0: Background thread calls Invoke()
         - k = UtcNow - 100 (let's say 1000 ticks - 100 = 900)
         - Added to _timeouts with k=900

Time T1: MainLoop iteration samples UtcNow = 850 ticks (!)
         - This can happen if system timer hasn't updated yet
         - Check: is k < now? Is 900 < 850? NO!
         - Timeout is NOT executed, added back to _timeouts

Time T2: Next iteration, UtcNow = 1100 ticks
         - Check: is k < now? Is 900 < 1100? YES!
         - Timeout executes

But if the test's 100ms polling window expires before T2, it throws TimeoutException.

5. Why x64 Machines Specifically?

UPDATE: @tig confirmed he can reproduce on his x64 Windows machine but NOT on his ARM Windows machine, validating this hypothesis.

Architecture-specific factors:

  • CPU/Chipset: Intel/AMD x64 vs ARM have fundamentally different timer implementations
    • x64: Uses legacy TSC (Time Stamp Counter) or HPET (High Precision Event Timer)
    • ARM: Uses different timer architecture with potentially better resolution
  • VM/Virtualization: MacOS VM on Intel laptop may have timer virtualization quirks
  • OS Configuration: Windows timer resolution settings (can be 1ms to 15.6ms)
  • Debugger Version: Specific VS2022 build with different debugging hooks
  • System Load: Background processes affecting timer accuracy
  • Hardware: Specific timer hardware behavior on x64 architecture

Secondary Hypothesis: Thread Scheduling Under Debugger

The test spawns tasks with Task.Run() and small random delays (2-4ms). Under a debugger:

  • Thread scheduling may be different
  • Task scheduling might be more synchronous
  • More tasks could complete within same timer resolution window
  • Creates "burst" of invocations that all get same timestamp

Why It Doesn't Fail on ARM

CONFIRMED: @tig cannot reproduce on ARM Windows machine, only on x64 Windows.

ARM environments:

  • Run without debugger (no debugging overhead) in CI/CD
  • Different timer characteristics - ARM timer architecture has better resolution
  • Faster iterations (less time for race conditions)
  • ARM CPU architecture uses different timer implementation than x64
  • ARM timer subsystem may have higher base resolution or better behavior under load

Evidence Supporting the Hypothesis

  1. Test uses 100ms polling: _wakeUp.Wait(POLL_MS) where POLL_MS = 100

    • This gives a narrow window for all invocations to complete
    • Any delay beyond 100ms triggers failure
  2. Test spawns 500 concurrent tasks per pass: Each with 2-4ms delay

    • Under debugger, these could all queue up in < 100ms
    • But execution might take > 100ms due to debugger overhead
  3. Only fails under debugger: Strong indicator of timing-related issue

    • Debugger affects iteration speed and timer behavior
  4. Architecture-specific (CONFIRMED): @tig reproduced on x64 Windows but NOT on ARM Windows

    • This strongly supports the timer resolution hypothesis
    • x64 timer implementation is more susceptible to this race condition
    • ARM timer architecture handles the scenario more gracefully

Recommended Solutions

Solution 1: Use Stopwatch Instead of DateTime.UtcNow (Recommended)

Replace DateTime.UtcNow.Ticks with Stopwatch.GetTimestamp() in TimedEvents:

  • Higher resolution (typically microseconds)
  • More consistent across platforms
  • Less affected by system time adjustments
  • Better for interval timing

Solution 2: Increase TimeSpan.Zero Buffer

Change the immediate execution buffer from -100 ticks to something more substantial:

if (time == TimeSpan.Zero)
{
    k -= TimeSpan.TicksPerMillisecond * 10;  // 10ms in the past instead of 0.01ms
}

Solution 3: Add Wakeup Call on Invoke

When adding a TimeSpan.Zero timeout, explicitly wake up the main loop:

_timedEvents.Add(TimeSpan.Zero, ...);
MainLoop?.Wakeup();  // Force immediate processing

Solution 4: Test-Specific Changes

For the test itself:

  • Increase POLL_MS from 100 to 200 or 500 for debugger scenarios
  • Add conditional: if (Debugger.IsAttached) POLL_MS = 500;
  • This accommodates debugger overhead without changing production code

Solution 5: Use Interlocked Operations More Defensively

Add explicit memory barriers and volatile reads to ensure visibility:

volatile int _tbCounter;
// or
Interlocked.MemoryBarrier();
int currentCount = Interlocked.CompareExchange(ref _tbCounter, 0, 0);

Additional Investigation Needed

To confirm hypothesis, @BDisp could:

  1. Add diagnostics to test:

    var sw = Stopwatch.StartNew();
    while (_tbCounter != expectedValue) {
    _wakeUp.Wait(pollMs);
    if (_tbCounter != tbNow) continue;
        
    // Log timing information
    Console.WriteLine($"Timeout at {sw.ElapsedMilliseconds}ms");
    Console.WriteLine($"Counter: {_tbCounter}, Expected: {expectedValue}");
    Console.WriteLine($"Missing: {expectedValue - _tbCounter}");
        
    // Check if invokes are still queued
    Console.WriteLine($"TimedEvents count: {Application.TimedEvents?.Timeouts.Count}");
    }
    
  2. Test timer resolution:

    var samples = new List<long>();
    for (int i = 0; i < 100; i++) {
    samples.Add(DateTime.UtcNow.Ticks);
    }
    var deltas = samples.Zip(samples.Skip(1), (a, b) => b - a).Where(d => d > 0);
    Console.WriteLine($"Min delta: {deltas.Min()} ticks ({deltas.Min() / 10000.0}ms)");
    
  3. Monitor TimedEvents queue:

  4. Add logging in TimedEvents.RunTimersImpl to see when timeouts are deferred

  5. Check if k >= now condition is being hit

Conclusion

The InvokeLeakTest failure under debugger is likely caused by:

  1. Low resolution of DateTime.UtcNow combined with rapid invocations
  2. Race condition in timeout execution check (k < now)
  3. Debugger overhead exacerbating timing issues
  4. Platform-specific timer behavior on @BDisp's hardware/VM

The most robust fix is to use Stopwatch for timing instead of DateTime.UtcNow, providing:

  • Higher resolution timing
  • Better consistency across platforms
  • Reduced susceptibility to debugger effects

This is a timing/performance issue in the stress test environment, not a functional bug in the production code. The test is correctly identifying edge cases in high-concurrency scenarios that are more likely to manifest under debugger overhead.