yaze 0.3.2
Link to the Past ROM Editor
 
Loading...
Searching...
No Matches
APU Timing Fix - Technical Analysis

Branch: feature/apu-timing-fix Date: October 10, 2025 Status: ✅ Implemented - Core Timing Fixed (Minor Audio Glitches Remain)


Implementation Status

✅ Completed:

  • Atomic Step() function for SPC700
  • Fixed-point cycle ratio (no floating-point drift)
  • Cycle budget model in APU
  • Removed bstep mechanism from instructions.cc
  • Cycle-accurate instruction implementations
  • Proper branch timing (+2 cycles when taken)
  • Dummy read/write cycles for MOV and RMW instructions

⚠️ Known Issues:

  • Some audio glitches/distortion during playback
  • Minor timing inconsistencies under investigation
  • Can be improved in future iterations

Note: The APU now executes correctly and music plays, but audio quality can be further refined.

Problem Summary

The APU fails to load and play music because the SPC700 gets stuck during the initial CPU-APU handshake. This handshake uploads the sound driver from ROM to APU RAM. The timing desynchronization causes infinite loops detected by the watchdog timer.


Current Implementation Analysis

1. <strong>Cycle Counting System</strong> (<tt>spc700.cc</tt>)

Current Approach:

// In spc700.h line 87:
int last_opcode_cycles_ = 0;
// In RunOpcode() line 80:
last_opcode_cycles_ = spc700_cycles[opcode]; // Static lookup

Problem: The spc700_cycles[] array provides BASELINE cycle counts only. It does NOT account for:

  • Addressing mode variations
  • Page boundary crossings (+1 cycle)
  • Branch taken vs not taken (+2 cycles if taken)
  • Memory access penalties

2. <strong>The <tt>bstep</tt> Mechanism</strong> (<tt>spc700.cc</tt>)

What is bstep?

bstep is a "business step" counter used to spread complex multi-step instructions across multiple calls to RunOpcode().

Example from line 1108-1115 (opcode 0xCB - MOVSY dp):

case 0xcb: { // movsy dp
if (bstep == 0) {
adr = dp(); // Save address for bstep=1
}
if (adr == 0x00F4 && bstep == 1) {
LOG_DEBUG("SPC", "MOVSY writing Y=$%02X to F4 at PC=$%04X", Y, PC);
}
MOVSY(adr); // Use saved address
break;
}
#define LOG_DEBUG(category, format,...)
Definition log.h:104

The MOVSY() function internally increments bstep to track progress:

  • bstep=0: Call dp() to get address
  • bstep=1: Actually perform the write
  • bstep=2: Reset to 0, instruction complete

Why this is fragile:

  1. Non-atomic execution: An instruction takes 2-3 calls to RunOpcode() to complete
  2. State leakage: If bstep gets out of sync, all future instructions fail
  3. Cycle accounting errors: Cycles are consumed incrementally, not atomically
  4. Debugging nightmare: Hard to trace when an instruction "really" executes

3. <strong>APU Main Loop</strong> (<tt>apu.cc:73-143</tt>)

Current implementation:

void Apu::RunCycles(uint64_t master_cycles) {
const double ratio = memory_.pal_timing() ? apuCyclesPerMasterPal : apuCyclesPerMaster;
uint64_t master_delta = master_cycles - g_last_master_cycles;
g_last_master_cycles = master_cycles;
const uint64_t target_apu_cycles = cycles_ + static_cast<uint64_t>(master_delta * ratio);
while (cycles_ < target_apu_cycles) {
spc700_.RunOpcode(); // Variable cycles
int spc_cycles = spc700_.GetLastOpcodeCycles();
for (int i = 0; i < spc_cycles; ++i) {
Cycle(); // Advance DSP/timers
}
}
}

Problems:

  1. Floating-point ratio: apuCyclesPerMaster is double (line 17), causing precision drift
  2. Opcode-level granularity: Advances by opcode, not by cycle
  3. No sub-cycle accuracy: Can't model instructions that span multiple cycles

4. <strong>Floating-Point Precision</strong> (<tt>apu.cc:17</tt>)

static const double apuCyclesPerMaster = (32040 * 32) / (1364 * 262 * 60.0);

Calculation:

  • Numerator: 32040 * 32 = 1,025,280
  • Denominator: 1364 * 262 * 60.0 = 21,437,280
  • Result: ~0.04783 (floating point)

Problem: Over thousands of cycles, tiny rounding errors accumulate, causing timing drift.


Root Cause: Handshake Timing Failure

The Handshake Protocol

  1. APU Ready: SPC700 writes $AA to $F4, $BB to $F5
  2. CPU Waits: Main CPU polls for $BBAA
  3. CPU Initiates: Writes $CC to APU input port
  4. APU Acknowledges: SPC700 sees $CC, prepares to receive
  5. Byte Transfer Loop: CPU sends byte, waits for echo confirmation, sends next byte

Where It Gets Stuck

The SPC700 enters an infinite loop because:

  • SPC700 is waiting for a byte from CPU (hasn't arrived yet)
  • CPU is waiting for acknowledgment from SPC700 (already sent, but missed)

This happens because cycle counts are off by 1-2 cycles per instruction, which accumulates over the ~500-1000 instructions in the handshake.


LakeSnes Comparison Analysis

What LakeSnes Does Right

1. Atomic Instruction Execution (spc.c:73-93)

void spc_runOpcode(Spc* spc) {
if(spc->resetWanted) { /* handle reset */ return; }
if(spc->stopped) { spc_idleWait(spc); return; }
uint8_t opcode = spc_readOpcode(spc);
spc_doOpcode(spc, opcode); // COMPLETE instruction in one call
}

Key insight: LakeSnes executes instructions atomically - no bstep, no step, no state leakage.

2. Cycle Tracking via Callbacks (spc.c:406-409)

static void spc_movsy(Spc* spc, uint16_t adr) {
spc_read(spc, adr); // Calls apu_cycle()
spc_write(spc, adr, spc->y); // Calls apu_cycle()
}

Every spc_read(), spc_write(), and spc_idle() call triggers apu_cycle(), which:

  • Advances APU cycle counter
  • Ticks DSP every 32 cycles
  • Updates timers

3. Simple Addressing Mode Functions (spc.c:189-275)

static uint16_t spc_adrDp(Spc* spc) {
return spc_readOpcode(spc) | (spc->p << 8);
}
static uint16_t spc_adrDpx(Spc* spc) {
uint16_t res = ((spc_readOpcode(spc) + spc->x) & 0xff) | (spc->p << 8);
spc_idle(spc); // Extra cycle for indexed addressing
return res;
}

Each memory access and idle call automatically advances cycles.

4. APU Main Loop (apu.c:73-82)

int apu_runCycles(Apu* apu, int wantedCycles) {
int runCycles = 0;
uint32_t startCycles = apu->cycles;
while(runCycles < wantedCycles) {
spc_runOpcode(apu->spc);
runCycles += (uint32_t) (apu->cycles - startCycles);
startCycles = apu->cycles;
}
return runCycles;
}

Problem: This approach tracks cycles by delta, which works because every memory access calls apu_cycle().

Where LakeSnes Falls Short (And How We Can Do Better)

1. No Explicit Cycle Return

  • LakeSnes relies on tracking cycles delta after each opcode
  • Doesn't return precise cycle count from spc_runOpcode()
  • Makes it hard to validate cycle accuracy per instruction

Our improvement: Return exact cycle count from Step():

int Spc700::Step() {
uint8_t opcode = ReadOpcode();
int cycles = CalculatePreciseCycles(opcode);
ExecuteInstructionAtomic(opcode);
return cycles; // EXPLICIT return
}

2. Implicit Cycle Counting

  • Cycles accumulated implicitly through callbacks
  • Hard to debug when cycles are wrong
  • No way to verify cycle accuracy per instruction

Our improvement: Explicit cycle budget model in Apu::RunCycles():

while (cycles_ < target_apu_cycles) {
int spc_cycles = spc700_.Step(); // Explicit cycle count
for (int i = 0; i < spc_cycles; ++i) {
Cycle(); // Explicit cycle advancement
}
}

3. No Fixed-Point Ratio

  • LakeSnes also uses floating-point (implicitly in SNES main loop)
  • Subject to same precision drift issues

Our improvement: Integer numerator/denominator for perfect precision.

What We're Adopting from LakeSnes

Atomic instruction execution - No bstep mechanism ✅ Simple addressing mode functions - Return address, advance cycles via callbacks ✅ Cycle advancement per memory access - Every read/write/idle advances cycles

What We're Improving Over LakeSnes

Explicit cycle counting - Step() returns exact cycles consumed ✅ Cycle budget model - Clear loop with explicit cycle advancement ✅ Fixed-point ratio - Integer arithmetic for perfect precision ✅ Testability - Easy to verify cycle counts per instruction


Solution Design

Phase 1: Atomic Instruction Execution

Goal: Eliminate bstep mechanism entirely.

New Design:

// New function signature
int Spc700::Step() {
if (reset_wanted_) { /* handle reset */ return 8; }
if (stopped_) { /* handle stop */ return 2; }
// Fetch opcode
uint8_t opcode = ReadOpcode();
// Calculate EXACT cycle cost upfront
int cycles = CalculatePreciseCycles(opcode);
// Execute instruction COMPLETELY
ExecuteInstructionAtomic(opcode);
return cycles; // Return exact cycles consumed
}

Benefits:

  • One call = one complete instruction
  • Cycles calculated before execution
  • No state leakage between calls
  • Easier debugging

Phase 2: Precise Cycle Calculation

New function:

int Spc700::CalculatePreciseCycles(uint8_t opcode) {
int base_cycles = spc700_cycles[opcode];
// Account for addressing mode penalties
switch (opcode) {
case 0x10: case 0x30: /* ... branches ... */
// Branches: +2 cycles if taken (handled in execution)
break;
case 0x15: case 0x16: /* ... abs+X, abs+Y ... */
// Check if page boundary crossed (+1 cycle)
if (will_cross_page_boundary(opcode)) {
base_cycles += 1;
}
break;
// ... more addressing mode checks ...
}
return base_cycles;
}

Phase 3: Refactor <tt>Apu::RunCycles</tt> to Cycle Budget Model

New implementation:

void Apu::RunCycles(uint64_t master_cycles) {
// 1. Calculate target using FIXED-POINT ratio (Phase 4)
uint64_t master_delta = master_cycles - g_last_master_cycles;
g_last_master_cycles = master_cycles;
// 2. Fixed-point conversion (avoiding floating point)
uint64_t target_apu_cycles = cycles_ + (master_delta * kApuCyclesNumerator) / kApuCyclesDenominator;
// 3. Run until budget exhausted
while (cycles_ < target_apu_cycles) {
// 4. Execute ONE instruction atomically
int spc_cycles_consumed = spc700_.Step();
// 5. Advance DSP/timers for each cycle
for (int i = 0; i < spc_cycles_consumed; ++i) {
Cycle(); // Ticks DSP, timers, increments cycles_
}
}
}

Phase 4: Fixed-Point Cycle Ratio

Replace floating-point with integer ratio:

// Old (apu.cc:17)
static const double apuCyclesPerMaster = (32040 * 32) / (1364 * 262 * 60.0);
// New
static constexpr uint64_t kApuCyclesNumerator = 32040 * 32; // 1,025,280
static constexpr uint64_t kApuCyclesDenominator = 1364 * 262 * 60; // 21,437,280

Conversion:

apu_cycles = (master_cycles * kApuCyclesNumerator) / kApuCyclesDenominator;

Benefits:

  • Perfect precision (no floating-point drift)
  • Integer arithmetic is faster
  • Deterministic across platforms

Implementation Plan

Step 1: Add <tt>Spc700::Step()</tt> Function

  • Add new Step() method to spc700.h
  • Implement atomic instruction execution
  • Keep RunOpcode() temporarily for compatibility

Step 2: Implement Precise Cycle Calculation

  • Create CalculatePreciseCycles() helper
  • Handle branch penalties
  • Handle page boundary crossings
  • Add tests to verify against known SPC700 timings

Step 3: Eliminate <tt>bstep</tt> Mechanism

  • Refactor all multi-step instructions (0xCB, 0xD0, 0xD7, etc.)
  • Remove bstep variable
  • Remove step variable
  • Verify all 256 opcodes work atomically

Step 4: Refactor <tt>Apu::RunCycles</tt>

  • Switch to cycle budget model
  • Use Step() instead of RunOpcode()
  • Add cycle budget logging for debugging

Step 5: Convert to Fixed-Point Ratio

  • Replace apuCyclesPerMaster double
  • Use integer numerator/denominator
  • Add constants for PAL timing too

Step 6: Testing

  • Test with vanilla Zelda3 ROM
  • Verify handshake completes
  • Verify music plays
  • Check for watchdog timeouts
  • Measure timing accuracy

Files to Modify

  1. src/app/emu/audio/spc700.h
    • Add int Step() method
    • Add int CalculatePreciseCycles(uint8_t opcode)
    • Remove bstep and step variables
  2. src/app/emu/audio/spc700.cc
    • Implement Step()
    • Implement CalculatePreciseCycles()
    • Refactor ExecuteInstructions() to be atomic
    • Remove all bstep logic
  3. src/app/emu/audio/apu.h
    • Update cycle ratio constants
  4. src/app/emu/audio/apu.cc
    • Refactor RunCycles() to use Step()
    • Convert to fixed-point ratio
    • Remove floating-point arithmetic
  5. test/unit/spc700_timing_test.cc (new)
    • Test cycle accuracy for all opcodes
    • Test handshake simulation
    • Verify no regressions

Success Criteria

  • [x] All SPC700 instructions execute atomically (one Step() call)
  • [x] Cycle counts accurate to ±1 cycle per instruction
  • [x] APU handshake completes without watchdog timeout
  • [x] Music loads and plays in vanilla Zelda3
  • [x] No floating-point drift over long emulation sessions
  • [ ] Unit tests pass for all 256 opcodes (future work)
  • [ ] Audio quality refined (minor glitches remain)

Implementation Completed

  1. ✅ Create feature branch
  2. ✅ Analyze current implementation
  3. ✅ Implement Spc700::Step() function
  4. ✅ Add precise cycle calculation
  5. ✅ Refactor Apu::RunCycles
  6. ✅ Convert to fixed-point ratio
  7. ✅ Refactor instructions.cc to be atomic and cycle-accurate
  8. ✅ Test with Zelda3 ROM
  9. ⏳ Write unit tests (future work)
  10. ⏳ Fine-tune audio quality (future work)

References: