Last Updated: October 10, 2025
Status: Active Planning

Overview

This document outlines improvements, refactors, and optimizations for the yaze emulator core. These changes aim to enhance accuracy, performance, and code maintainability.

Items are presented in order of descending priority, from critical accuracy fixes to quality-of-life improvements.

Critical Priority: APU Timing Fix

Problem Statement

The emulator's Audio Processing Unit (APU) currently fails to load and play music. Analysis shows that the SPC700 processor gets "stuck" during the initial handshake sequence with the main CPU. This handshake is responsible for uploading the sound driver from ROM to APU RAM. The failure of this timing-sensitive process prevents the sound driver from running.

Root Cause: CPU-APU Handshake Timing

The process of starting the APU and loading a sound bank requires tightly synchronized communication between the main CPU (65816) and the APU's CPU (SPC700).

The Handshake Protocol

APU Ready: SPC700 boots, initializes, signals ready by writing $AA to port $F4 and $BB to port $F5
CPU Waits: Main CPU waits in tight loop, reading combined 16-bit value from I/O ports until it sees $BBAA
CPU Initiates: CPU writes command code $CC to APU's input port
APU Acknowledges: SPC700 sees $CC and prepares to receive data block
Synchronized Byte Transfer: CPU and APU enter lock-step loop to transfer sound driver byte-by-byte:
- CPU sends data
- CPU waits for APU to read data and echo back confirmation
- Only upon receiving confirmation does CPU send next byte

Point of Failure

The "stuck" behavior occurs because one side fails to meet the other's expectation. Due to timing desynchronization:

The SPC700 is waiting for a byte that the CPU has not yet sent (or sent too early), OR
The CPU is waiting for an acknowledgment that the SPC700 has already sent (or has not yet sent)

The result is an infinite loop on the SPC700, detected by the watchdog timer in Apu::RunCycles.

Technical Analysis

The handshake's reliance on precise timing exposes inaccuracies in the current SPC700 emulation model.

Issue 1: Incomplete Opcode Timing

The emulator uses a static lookup table (spc700_cycles.h) for instruction cycle counts. This provides a base value but fails to account for:

Addressing Modes: Different addressing modes have different cycle costs
Page Boundaries: Memory accesses crossing 256-byte page boundaries take an extra cycle
Branching: Conditional branches take different cycle counts depending on whether branch is taken

While some of this is handled (e.g., DoBranch), it is not applied universally, leading to small, cumulative errors.

Issue 2: Fragile Multi-Step Execution Model

The step/bstep mechanism in Spc700::RunOpcode is a significant source of fragility. It attempts to model complex instructions by spreading execution across multiple calls. This means the full cycle cost of an instruction is not consumed atomically. An off-by-one error in any step corrupts the timing of the entire APU.

Issue 3: Floating-Point Precision

The use of double for the apuCyclesPerMaster ratio can introduce minute floating-point precision errors. Over thousands of cycles required for the handshake, these small errors accumulate and contribute to timing drift between CPU and APU.

Proposed Solution: Cycle-Accurate Refactoring

Step 1: Implement Cycle-Accurate Instruction Execution

The Spc700::RunOpcode function must be refactored to calculate and consume the exact cycle count for each instruction before execution.

Calculate Exact Cost: Before running an opcode, determine its precise cycle cost by analyzing opcode, addressing mode, and potential page-boundary penalties
Atomic Execution: Remove the bstep mechanism. An instruction, no matter how complex, should be fully executed within a single call to a new Spc700::Step() function

Step 2: Centralize the APU Execution Loop

The main Apu::RunCycles loop should be the sole driver of APU time.

Cycle Budget: At the start of a frame, calculate the total "budget" of APU cycles needed
Cycle-by-Cycle Stepping: Loop, calling Spc700::Step() and Dsp::Cycle(), decrementing cycle budget until exhausted

Example of the new loop in Apu::RunCycles:

void Apu::RunCycles(uint64_t master_cycles) {
  // 1. Calculate cycle budget for this frame
  const uint64_t target_apu_cycles = ...; 
 
  // 2. Run the APU until the budget is met
  while (cycles_ < target_apu_cycles) {
    // 3. Execute one SPC700 cycle/instruction and get its true cost
    int spc_cycles_consumed = spc700_.Step();
    
    // 4. Advance DSP and Timers for each cycle consumed
    for (int i = 0; i < spc_cycles_consumed; ++i) {
      Cycle(); // This ticks the DSP and timers
    }
  }
}

Step 3: Use Integer-Based Cycle Ratios

To eliminate floating-point errors, convert the apuCyclesPerMaster ratio to a fixed-point integer ratio. This provides perfect, drift-free conversion between main CPU and APU cycles over long periods.

High Priority: Core Architecture & Timing Model

CPU Cycle Counting

Issue: The main CPU loop in Snes::RunCycle() advances the master cycle counter by a fixed amount (+= 2). Real 65816 instructions have variable cycle counts. The current workaround of scattering callbacks_.idle() calls is error-prone and difficult to maintain.
Recommendation: Refactor Cpu::ExecuteInstruction to calculate and return the precise cycle cost of each instruction, including penalties for addressing modes and memory access speeds. The main Snes loop should then consume this exact value, centralizing timing logic and dramatically improving accuracy.

Main Synchronization Loop

Issue: The main loop in Snes::RunFrame() is state-driven based on the in_vblank_ flag. This can be fragile and makes it difficult to reason about component state at any given cycle.
Recommendation: Transition to a unified main loop driven by a single master cycle counter. In this model, each component (CPU, PPU, APU, DMA) is "ticked" forward based on the master clock. This is a more robust and modular architecture that simplifies component synchronization.

Medium Priority: PPU Performance

Rendering Approach Optimization

Issue: The PPU currently uses a "pixel-based" renderer (Ppu::RunLine calls HandlePixel for every pixel). This is highly accurate but can be slow due to high function call overhead and poor cache locality.
Optimization: Refactor the PPU to use a scanline-based renderer. Instead of processing one pixel at a time, process all active layers for an entire horizontal scanline, compose them into a temporary buffer, and then write the completed scanline to the framebuffer. This is a major architectural change but is a standard and highly effective optimization technique in SNES emulation.

Benefits:

Reduced function call overhead
Better cache locality
Easier to vectorize/SIMD
Standard approach in accurate SNES emulators

Low Priority: Code Quality & Refinements

APU Code Modernization

Issue: The code in dsp.cc and spc700.cc, inherited from other projects, is written in a very C-like style, using raw pointers, memset, and numerous "magic numbers."
Refactor: Gradually refactor this code to use modern C++ idioms:
- Replace raw arrays with std::array
- Use constructors with member initializers instead of memset
- Define constexpr variables or enum class types for hardware registers and flags
- Improve type safety, readability, and long-term maintainability

Audio Subsystem & Buffering

Issue: The current implementation in Emulator::Run queues audio samples directly to the SDL audio device. If the emulator lags for even a few frames, the audio buffer can underrun, causing audible pops and stutters.
Improvement: Implement a lock-free ring buffer (or circular buffer) to act as an intermediary. The emulator thread would continuously write generated samples into this buffer, while the audio device (in its own thread) would continuously read from it. This decouples the emulation speed from the audio hardware, smoothing out performance fluctuations and preventing stutter.

Debugger & Tooling Optimizations

DisassemblyViewer Data Structure

Issue: DisassemblyViewer uses a std::map to store instruction traces. For a tool that handles frequent insertions and lookups, this can be suboptimal.
Optimization: Replace std::map with std::unordered_map for faster average-case performance.

BreakpointManager Lookups

Issue: The ShouldBreakOn... functions perform a linear scan over a std::vector of all breakpoints. This is O(n) and could become a minor bottleneck if a very large number of breakpoints are set.
Optimization: For execution breakpoints, use a std::unordered_set<uint32_t> for O(1) average lookup time. This would make breakpoint checking near-instantaneous, regardless of how many are active.

Completed Improvements

Audio System Fixes (v0.4.0)

Problem Statement

The SNES emulator experienced audio glitchiness and skips, particularly during the ALTTP title screen, with audible pops, crackling, and sample skipping during music playback.

Root Causes Fixed

Aggressive Sample Dropping: Audio buffering logic was dropping up to 50% of generated samples, creating discontinuities
Incorrect Resampling: Duplicate calculations in linear interpolation wasted CPU cycles
Missing Frame Synchronization: DSP's NewFrame() method was never called, causing timing drift
Missing Hermite Interpolation: Only Linear/Cosine/Cubic were available (Hermite is the industry standard)

Solutions Implemented

Never Drop Samples: Always queue all generated samples unless buffer critically full (>4 frames)
Fixed Resampling Code: Removed duplicate calculations and added bounds checking
Frame Boundary Synchronization: Added dsp.NewFrame() call before sample generation
Hermite Interpolation: New interpolation type matching bsnes/Snes9x standard

Interpolation options (src/app/emu/audio/dsp.cc):

Interpolation	Notes
Linear	Fastest; retains legacy behaviour.
Hermite	New default; balances quality and speed.
Cosine	Smoother than linear with moderate cost.
Cubic	Highest quality, heavier CPU cost.

Result: Manual testing on the ALTTP title screen, overworld theme, dungeon ambience, and menu sounds no longer exhibits audible pops or skips. Continue to monitor regression tests after the APU timing refactor lands.

Implementation Priority

Critical (v0.4.0): APU timing fix - Required for music playback
High (v0.5.0): CPU cycle counting accuracy - Required for game compatibility
High (v0.5.0): Main synchronization loop refactor - Foundation for accuracy
Medium (v0.6.0): PPU scanline renderer - Performance optimization
Low (ongoing): Code quality improvements - Technical debt reduction

Success Metrics

APU Timing Fix Success

[ ] Music plays in all tested games
[ ] Sound effects work correctly
[ ] No audio glitches or stuttering
[ ] Handshake completes within expected cycle count

Overall Emulation Accuracy

[ ] CPU cycle accuracy within ±1 cycle per instruction
[ ] APU synchronized within ±1 cycle with CPU
[ ] PPU timing accurate to scanline level
[ ] All test ROMs pass

Performance Targets

[ ] 60 FPS on modest hardware (2015+ laptops)
[ ] PPU optimizations provide 20%+ speedup
[ ] Audio buffer never underruns in normal operation

	yaze 0.3.2 Link to the Past ROM Editor

Overview

Critical Priority: APU Timing Fix

Problem Statement

Root Cause: CPU-APU Handshake Timing

The Handshake Protocol

Point of Failure

Technical Analysis

Issue 1: Incomplete Opcode Timing

Issue 2: Fragile Multi-Step Execution Model

Issue 3: Floating-Point Precision

Proposed Solution: Cycle-Accurate Refactoring

Step 1: Implement Cycle-Accurate Instruction Execution

Step 2: Centralize the APU Execution Loop

Step 3: Use Integer-Based Cycle Ratios

High Priority: Core Architecture & Timing Model

CPU Cycle Counting

Main Synchronization Loop

Medium Priority: PPU Performance

Rendering Approach Optimization

Low Priority: Code Quality & Refinements

APU Code Modernization

Audio Subsystem & Buffering

Debugger & Tooling Optimizations

DisassemblyViewer Data Structure

BreakpointManager Lookups

Completed Improvements

Audio System Fixes (v0.4.0)

Problem Statement

Root Causes Fixed

Solutions Implemented

Implementation Priority

Success Metrics

APU Timing Fix Success

Overall Emulation Accuracy

Performance Targets

Related Documentation