Browse Source

Add comprehensive analysis of InvokeLeakTest debugger failure

Co-authored-by: tig <[email protected]>
copilot-swe-agent[bot] 1 month ago
parent
commit
7861a61a35
1 changed files with 344 additions and 0 deletions
  1. 344 0
      InvokeLeakTest_Analysis.md

+ 344 - 0
InvokeLeakTest_Analysis.md

@@ -0,0 +1,344 @@
+# InvokeLeakTest Failure Analysis
+
+## Issue Summary
+The `InvokeLeakTest` stress test fails **only on @BDisp's machine** and **only when running under a debugger**:
+- Visual Studio 2022 on Windows (x64)
+- Visual Studio 2022 on macOS (Intel-based VM)
+- Visual Studio Code on Windows
+
+The test passes in CI/CD environments and when run without a debugger.
+
+## Test Description
+`InvokeLeakTest` is a **stress test** (not a unit test) located in `Tests/StressTests/ApplicationStressTests.cs`. It:
+
+1. Spawns multiple concurrent tasks that call `Application.Invoke()` from background threads
+2. Each invocation updates a TextField and increments a counter using `Interlocked.Increment`
+3. The test verifies that all invocations complete successfully (no "leaks")
+4. Runs for 50 passes with 500 increments each (25,000 total invocations)
+
+### Test Flow
+```csharp
+// Main thread blocks in Application.Run()
+Application.Run(top);
+
+// Background thread spawns tasks
+for (var j = 0; j < NUM_PASSES; j++) {
+    for (var i = 0; i < NUM_INCREMENTS; i++) {
+        Task.Run(() => {
+            Thread.Sleep(r.Next(2, 4));  // Random 2-4ms delay
+            Application.Invoke(() => {
+                tf.Text = $"index{r.Next()}";
+                Interlocked.Increment(ref _tbCounter);
+            });
+        });
+    }
+    // Wait for counter to reach expected value with 100ms polling
+    while (_tbCounter != expectedValue) {
+        _wakeUp.Wait(POLL_MS);  // POLL_MS = 100ms
+        if (_tbCounter hasn't changed) {
+            throw new TimeoutException("Invoke lost");
+        }
+    }
+}
+```
+
+## How Application.Invoke Works
+
+### Call Chain
+1. `Application.Invoke(action)` → calls `ApplicationImpl.Instance.Invoke(action)`
+2. `ApplicationImpl.Invoke()` checks if on main thread:
+   - **If on main thread**: Execute action immediately
+   - **If on background thread**: Add to `_timedEvents` with `TimeSpan.Zero`
+3. `TimedEvents.Add()`:
+   - Calculates timestamp: `k = (DateTime.UtcNow + time).Ticks`
+   - For `TimeSpan.Zero`, subtracts 100 ticks to ensure immediate execution: `k -= 100`
+   - Adds to sorted list: `_timeouts.Add(NudgeToUniqueKey(k), timeout)`
+4. `MainLoop.RunIteration()` calls `TimedEvents.RunTimers()` every iteration
+5. `TimedEvents.RunTimers()`:
+   - Takes a copy of `_timeouts` and creates a new list (under lock)
+   - Iterates through copy, executing callbacks where `k < now`
+   - Non-repeating callbacks (return false) are not re-added
+
+### Critical Code Paths
+
+#### ApplicationImpl.Invoke (Terminal.Gui/App/ApplicationImpl.cs:306-322)
+```csharp
+public void Invoke (Action action)
+{
+    // If we are already on the main UI thread
+    if (Application.MainThreadId == Thread.CurrentThread.ManagedThreadId)
+    {
+        action ();
+        return;
+    }
+
+    _timedEvents.Add (TimeSpan.Zero,
+                      () =>
+                      {
+                          action ();
+                          return false;  // One-shot execution
+                      }
+                     );
+}
+```
+
+#### TimedEvents.AddTimeout (Terminal.Gui/App/Timeout/TimedEvents.cs:124-139)
+```csharp
+private void AddTimeout (TimeSpan time, Timeout timeout)
+{
+    lock (_timeoutsLockToken)
+    {
+        long k = (DateTime.UtcNow + time).Ticks;
+
+        // if user wants to run as soon as possible set timer such that it expires right away
+        if (time == TimeSpan.Zero)
+        {
+            k -= 100;  // Subtract 100 ticks to ensure it's "in the past"
+        }
+
+        _timeouts.Add (NudgeToUniqueKey (k), timeout);
+        Added?.Invoke (this, new (timeout, k));
+    }
+}
+```
+
+#### TimedEvents.RunTimersImpl (Terminal.Gui/App/Timeout/TimedEvents.cs:160-192)
+```csharp
+private void RunTimersImpl ()
+{
+    long now = DateTime.UtcNow.Ticks;
+    SortedList<long, Timeout> copy;
+
+    lock (_timeoutsLockToken)
+    {
+        copy = _timeouts;
+        _timeouts = new ();
+    }
+
+    foreach ((long k, Timeout timeout) in copy)
+    {
+        if (k < now)  // Execute if scheduled time is in the past
+        {
+            if (timeout.Callback ())  // Returns false for Invoke actions
+            {
+                AddTimeout (timeout.Span, timeout);
+            }
+        }
+        else  // Future timeouts - add back to list
+        {
+            lock (_timeoutsLockToken)
+            {
+                _timeouts.Add (NudgeToUniqueKey (k), timeout);
+            }
+        }
+    }
+}
+```
+
+## Hypothesis: Why It Fails Under Debugger on @BDisp's Machine
+
+### Primary Hypothesis: DateTime.UtcNow Resolution and Debugger Timing
+
+The test failure likely occurs due to a combination of factors:
+
+#### 1. **DateTime.UtcNow Resolution Issues**
+The code uses `DateTime.UtcNow.Ticks` for timing, which has platform-dependent resolution:
+- Windows: ~15.6ms resolution (system timer tick)
+- Some systems: Can be lower/higher depending on timer configuration
+- Debugger impact: Can affect system timer behavior
+
+When `TimeSpan.Zero` invocations are added:
+```csharp
+long k = (DateTime.UtcNow + TimeSpan.Zero).Ticks;
+k -= 100;  // Subtract 100 ticks (10 microseconds)
+```
+
+**The problem**: If two `Invoke` calls happen within the same timer tick (< ~15ms on Windows), they get the SAME `DateTime.UtcNow` value. The `NudgeToUniqueKey` function increments by 1 tick each collision, but this creates a sequence of timestamps like:
+- First call: `now - 100`
+- Second call (same UtcNow): `now - 99`
+- Third call (same UtcNow): `now - 98`
+- ...and so on
+
+#### 2. **Race Condition in RunTimersImpl**
+In `RunTimersImpl`, this check determines if a timeout should execute:
+```csharp
+if (k < now)  // k is scheduled time, now is current time
+```
+
+**The race**: Between when timeouts are added (with `k = UtcNow - 100`) and when they're checked (with fresh `DateTime.UtcNow`), time passes. However, if:
+1. Multiple invocations are added rapidly (within same timer tick)
+2. The system is under debugger (slower iteration loop)
+3. The main loop iteration happens to sample `DateTime.UtcNow` at an unlucky moment
+
+Some timeouts might have `k >= now` even though they were intended to be "immediate" (TimeSpan.Zero).
+
+#### 3. **Debugger-Specific Timing Effects**
+
+When running under a debugger:
+
+**a) Slower Main Loop Iterations**
+- Debugger overhead slows each iteration
+- More time between `RunTimers` calls
+- Allows more tasks to queue up between iterations
+
+**b) Timer Resolution Changes**
+- Debuggers can affect OS timer behavior
+- May change quantum/scheduling of threads
+- Different thread priorities under debugger
+
+**c) DateTime.UtcNow Sampling**
+- More invocations can accumulate in a single UtcNow "tick"
+- Larger batches of timeouts with near-identical timestamps
+- Higher chance of `k >= now` race condition
+
+#### 4. **The "Lost Invoke" Scenario**
+
+Failure scenario:
+```
+Time T0: Background thread calls Invoke()
+         - k = UtcNow - 100 (let's say 1000 ticks - 100 = 900)
+         - Added to _timeouts with k=900
+
+Time T1: MainLoop iteration samples UtcNow = 850 ticks (!)
+         - This can happen if system timer hasn't updated yet
+         - Check: is k < now? Is 900 < 850? NO!
+         - Timeout is NOT executed, added back to _timeouts
+
+Time T2: Next iteration, UtcNow = 1100 ticks
+         - Check: is k < now? Is 900 < 1100? YES!
+         - Timeout executes
+
+But if the test's 100ms polling window expires before T2, it throws TimeoutException.
+```
+
+#### 5. **Why @BDisp's Machine Specifically?**
+
+Possible factors:
+- **CPU/Chipset**: Intel vs ARM have different timer implementations
+- **VM/Virtualization**: MacOS VM on Intel laptop may have timer virtualization quirks
+- **OS Configuration**: Windows timer resolution settings (can be 1ms to 15.6ms)
+- **Debugger Version**: Specific VS2022 build with different debugging hooks
+- **System Load**: Background processes affecting timer accuracy
+- **Hardware**: Specific timer hardware behavior on his x64 machine
+
+### Secondary Hypothesis: Thread Scheduling Under Debugger
+
+The test spawns tasks with `Task.Run()` and small random delays (2-4ms). Under a debugger:
+- Thread scheduling may be different
+- Task scheduling might be more synchronous
+- More tasks could complete within same timer resolution window
+- Creates "burst" of invocations that all get same timestamp
+
+### Why It Doesn't Fail in CI/CD
+
+CI/CD environments:
+- Run without debugger (no debugging overhead)
+- Different timer characteristics
+- Faster iterations (less time for race conditions)
+- Different CPU architectures (ARM runners have different timer behavior)
+
+## Evidence Supporting the Hypothesis
+
+1. **Test uses 100ms polling**: `_wakeUp.Wait(POLL_MS)` where `POLL_MS = 100`
+   - This gives a narrow window for all invocations to complete
+   - Any delay beyond 100ms triggers failure
+
+2. **Test spawns 500 concurrent tasks per pass**: Each with 2-4ms delay
+   - Under debugger, these could all queue up in < 100ms
+   - But execution might take > 100ms due to debugger overhead
+
+3. **Only fails under debugger**: Strong indicator of timing-related issue
+   - Debugger affects iteration speed and timer behavior
+
+4. **Platform-specific**: Fails on specific hardware/VM configurations
+   - Suggests timer resolution/behavior differences
+
+## Recommended Solutions
+
+### Solution 1: Use Stopwatch Instead of DateTime.UtcNow (Recommended)
+Replace `DateTime.UtcNow.Ticks` with `Stopwatch.GetTimestamp()` in `TimedEvents`:
+- Higher resolution (typically microseconds)
+- More consistent across platforms
+- Less affected by system time adjustments
+- Better for interval timing
+
+### Solution 2: Increase TimeSpan.Zero Buffer
+Change the immediate execution buffer from `-100` ticks to something more substantial:
+```csharp
+if (time == TimeSpan.Zero)
+{
+    k -= TimeSpan.TicksPerMillisecond * 10;  // 10ms in the past instead of 0.01ms
+}
+```
+
+### Solution 3: Add Wakeup Call on Invoke
+When adding a TimeSpan.Zero timeout, explicitly wake up the main loop:
+```csharp
+_timedEvents.Add(TimeSpan.Zero, ...);
+MainLoop?.Wakeup();  // Force immediate processing
+```
+
+### Solution 4: Test-Specific Changes
+For the test itself:
+- Increase `POLL_MS` from 100 to 200 or 500 for debugger scenarios
+- Add conditional: `if (Debugger.IsAttached) POLL_MS = 500;`
+- This accommodates debugger overhead without changing production code
+
+### Solution 5: Use Interlocked Operations More Defensively
+Add explicit memory barriers and volatile reads to ensure visibility:
+```csharp
+volatile int _tbCounter;
+// or
+Interlocked.MemoryBarrier();
+int currentCount = Interlocked.CompareExchange(ref _tbCounter, 0, 0);
+```
+
+## Additional Investigation Needed
+
+To confirm hypothesis, @BDisp could:
+
+1. **Add diagnostics to test**:
+```csharp
+var sw = Stopwatch.StartNew();
+while (_tbCounter != expectedValue) {
+    _wakeUp.Wait(pollMs);
+    if (_tbCounter != tbNow) continue;
+    
+    // Log timing information
+    Console.WriteLine($"Timeout at {sw.ElapsedMilliseconds}ms");
+    Console.WriteLine($"Counter: {_tbCounter}, Expected: {expectedValue}");
+    Console.WriteLine($"Missing: {expectedValue - _tbCounter}");
+    
+    // Check if invokes are still queued
+    Console.WriteLine($"TimedEvents count: {Application.TimedEvents?.Timeouts.Count}");
+}
+```
+
+2. **Test timer resolution**:
+```csharp
+var samples = new List<long>();
+for (int i = 0; i < 100; i++) {
+    samples.Add(DateTime.UtcNow.Ticks);
+}
+var deltas = samples.Zip(samples.Skip(1), (a, b) => b - a).Where(d => d > 0);
+Console.WriteLine($"Min delta: {deltas.Min()} ticks ({deltas.Min() / 10000.0}ms)");
+```
+
+3. **Monitor TimedEvents queue**:
+- Add logging in `TimedEvents.RunTimersImpl` to see when timeouts are deferred
+- Check if `k >= now` condition is being hit
+
+## Conclusion
+
+The `InvokeLeakTest` failure under debugger is likely caused by:
+1. **Low resolution of DateTime.UtcNow** combined with rapid invocations
+2. **Race condition** in timeout execution check (`k < now`)
+3. **Debugger overhead** exacerbating timing issues
+4. **Platform-specific timer behavior** on @BDisp's hardware/VM
+
+The most robust fix is to use `Stopwatch` for timing instead of `DateTime.UtcNow`, providing:
+- Higher resolution timing
+- Better consistency across platforms
+- Reduced susceptibility to debugger effects
+
+This is a **timing/performance issue** in the stress test environment, not a functional bug in the production code. The test is correctly identifying edge cases in high-concurrency scenarios that are more likely to manifest under debugger overhead.