Monday 19 January 2015

A New Game Console Project - Part 2

...continued


Take 4.1 - The PIC32 and PAL-based Approach (manual pixel push)

As  I said in my previous post, after reading about the PIC32-based Maximite and always having an interest in MIPS since learning about it at university I bought a PIC32 dev board called a UBW32. The UBW32 is the red board below.  The other board is my AD724 TV encoder board.

 A UBW32 (PIC32) + AD724 TV Encoder

The processor on the UBW32 is a PIC32MX795 processor which runs at 80MHz has 512K Flash and 128K RAM.  It can execute from RAM and has an instruction and data cache that hides flash read latencies.  It has a DMA peripheral built in that looked perfect for my project as I could stream pixels out of it without executing code other than DMA setup and the sync signal.  That would leave more time for game logic.  The Parallel Master Port peripheral also looked perfect for supporting an external frame buffer down the track as it supports programmable wait states for connecting all kinds of memory.  The tool chain is GCC-based as well which suits the way I like to work and fits in with the other tools I like to use well.

I spent months with this board.  I drew inspiration from a Microchip Application Note called LCC graphics or Low Cost Controller-less graphics.  After reading that I thought this chip had the DMA capabilities I needed for my application.  I'm not saying it doesn't - but I certainly couldn't work it out.  More on that later.  I still think the PIC32 is an excellent micro-controller and perhaps it could be made to work for this application, I just don't know how.

Anyway, the first experiment I tried was to see if I could get two black and white boxes (one left, one right) to appear on my TV.  I started researching how to generate a black and white PAL signal in software.  I came across this page from Martin Hinner.  This pretty much explains how to do it.  The following image from that page has guided my though much of this project:



This image explains the sequence of sync pulses required to generate a non-interlaced 50Hz (well 50.08Hz) PAL signal.  Since my microcontroller ran at 80MHz, each of these durations can be converted to an exact number of CPU cycles.

For example, the short sync is 2us which is 160 cycles at 80MHz because 1/80 = 0.0125us and 2/0.0125 = 160 cycles.

You may be thinking at this point that I'm crazy and I can't count on the CPU executing one instruction per cycle.  You'd be right but and I had foreseen this.  That is why I wanted to use DMA instead.  More on that later.  This CPU has a very small cache, a pipeline in the CPU and shared buses.  All of these things contribute to an execution rate of less than 80 million instructions/second.  But that didn't bother me because this was just a test.

My "design" used a timer interrupt and a simple state machine to take me through the sync pulses and visible lines.  On the PIC32 there are a few timers and I picked a 32bit timer and set it to run at 80Mhz synchronous with the CPU clock.  This timer keeps counting forever and I just set the next wake-up time base don how long the sleep is.  I have included some of the code below.

 #define BLACK()     mPORTEClearBits(BIT_7|BIT_6|BIT_5|BIT_4|BIT_3|BIT_2|BIT_1|BIT_0);  
 #define SYNC_ACTIVE()  mPORTCClearBits(BIT_1);NOP3();  
 #define SYNC_INACTIVE() mPORTCSetBits(BIT_1);NOP3();  
 #define SYNC_TOGGLE()  mPORTCToggleBits(BIT_1);NOP3();  
 #define FB_WIDTH  416  
 #define FB_HEIGHT  234  
 extern void renderLine(uint8_t *data, uint8_t *palette);  
 // current_next_sync  
 #define VISIBLE   0  
 #define SHORT_LONG_SYNC  1  
 #define SHORT_SHORT_SYNC 2  
 #define LONG_LONG_SYNC 3  
 #define LONG_SHORT_SYNC 4  
 #define SHORT_VISIBLE_SYNC 5  
 volatile uint32_t syncSequence[17] = {  
   SHORT_SHORT_SYNC, SHORT_SHORT_SYNC,  
   SHORT_SHORT_SYNC, SHORT_SHORT_SYNC,  
   SHORT_SHORT_SYNC, SHORT_LONG_SYNC,  
   LONG_LONG_SYNC, LONG_LONG_SYNC,  
   LONG_LONG_SYNC, LONG_LONG_SYNC,  
   LONG_SHORT_SYNC, SHORT_SHORT_SYNC,  
   SHORT_SHORT_SYNC, SHORT_SHORT_SYNC,  
   SHORT_SHORT_SYNC, SHORT_VISIBLE_SYNC,  
   VISIBLE  
 };  
 volatile uint32_t *currentNextSyncType;  
 volatile uint32_t frameCounter = 0;  
 volatile uint32_t line = 304;  
 void __ISR(_TIMER_23_VECTOR, IPL7SRS) timerInt(void) {  
   // This first bit wants to activate at exactly the same time so we use computed gotos - a GCC feature  
   const static void *dispatchTable[] = {  
     && visible, && syncShortLong, && syncShortShort, && syncLongLong, && syncLongShort, && syncShortVisible  
   };  
   // We have woken up  
   register uint32_t actualTime = ReadTimer45();  
   register uint32_t cnst = *currentNextSyncType;  
   register uint32_t nextSleep;  
   goto *dispatchTable[cnst];  
   do {  
 visible:  
     {  
       SYNC_INACTIVE(); //(sim 25892, 31012, 36132)  
 #define START  259  
 #define STOP  ((START-FB_HEIGHT)+1)  
       if (line > START || line < STOP) {  
         // blank lines off screen  
         delay10XCycles(478);  
         NOP2();  
       } else {  
         // delay for back porch 8uS  
         delay10XCycles(59);  
         NOP8();  
         renderLine(address, palette);  
         address += FB_WIDTH / 2;  
         BLACK();  
         // Front porch  
         delay10XCycles(9);  
         NOP4();  
       }  
       SYNC_ACTIVE();  
       line--;  
       nextSleep = 320 + 4800;  
       if (line == 0) {  
         // delay for the 2us of the short sync after last visible line  
         delay10XCycles(13);  
         NOP8();  
         // toggle sync  
         SYNC_INACTIVE();  
         // set next sync type = [0]  
         currentNextSyncType = &syncSequence[0];  
         // account for the delay above and the next sleep  
         nextSleep = 2400 + 160;  
         line = 304;  
       }  
       // go back to sleep  
       break;  
     }  
 syncShortLong:  
     {  
       SYNC_ACTIVE();  
       // Logging was here  
       address = frameBuffer;  
       // set next sync type  
       currentNextSyncType++;  
       // We expect the global clock will be at this value next interrupt  
       nextSleep = 2400;  
       // go back to sleep  
       break;  
     }  
 syncShortShort:  
     {  
       SYNC_ACTIVE();  
       // delay for the 2us  
       delay10XCycles(15);  
       NOP3();  
       // toggle sync  
       SYNC_INACTIVE();  
       // set next sync type  
       currentNextSyncType++;  
       // account for the delay above and the next sleep  
       nextSleep = 2400 + 160;  
       // go back to sleep  
       break;  
     }  
 syncLongLong:  
     {  
       SYNC_INACTIVE();  
       // delay for the 2us  
       delay10XCycles(15);  
       NOP2();  
       // toggle sync  
       SYNC_ACTIVE();  
       // set next sync type  
       currentNextSyncType++;  
       // account for the delay above and the next sleep  
       nextSleep = 2400 + 160;  
       // go back to sleep  
       break;  
     }  
 syncLongShort:  
     {  
       SYNC_INACTIVE();  
       // extra bit for long to short transitions  
       // delay for the 2us  
       delay10XCycles(15);  
       NOP2();  
       // toggle sync  
       SYNC_ACTIVE();  
       // delay for the 2us  
       delay10XCycles(15);  
       NOP2();  
       // toggle sync  
       SYNC_INACTIVE();  
       // set next sync type  
       currentNextSyncType++;  
       // account for the delay above and the next sleep  
       nextSleep = 2400 + 320;  
       // go back to sleep  
       break;  
     }  
 syncShortVisible:  
     {  
       SYNC_ACTIVE();  
       currentNextSyncType++;  
       // account for the delay above  
       nextSleep = 320;  
       // go back to sleep  
       break;  
     }  
   } while (0);  
   PR2 = nextSleep-1;  
   mT23ClearIntFlag();  
 }  

This implementation has a couple of interesting features.  Firstly, because it uses a timer interrupt to wake up a couple of times per line, the timing is pretty sharp - despite using NOPs to pad events out.  Next, I'm using a computed GOTO.  This allows my to wake up and jump to the correct state handler in the same number of cycles everytime.  A SELECT of IF-ELSE block doesn't have this property as the compiler tests each case.  Finally, the actual rendering function renderLine() is in a separate assembler file.  I hand rolled the assembly for this to achieve the 8MHz pixel clock.  This all worked after about a billion iterations of tweaking the timing and produced the desired display on my TV.  There was noise all over picture though and I bit the bullet and made a PCB with a AD724 on it.  I won't cover the circuit becuase I stuck largely to the reference design in the datasheet.

The colour space is IIBBGGRR.  I'll cover that in another post.  But for now it is 2 bits each for Red, Green, Blue and Intensity.

Here are some example images that I was able to generate with this configuration:




Can you spot the issue in the first couple of lines of the checkerboard image above?  They are skewed to the right.  I assume this to be due to the cache in the CPU "warming up" to the drawing code.  I'm not 100% sure about that though.



The third image was a true colour image I converted to 16 colours.  This let me actually have a 416x234 framebuffer because that consumed only 48,672 bytes of RAM.  I wrote the renderLine() function to unpack the pixels and lookup a palette to get the right colour.  Notice the colour of the sand?  Hmm - not quite right. 
So you if you got this far you might be wondering how this failed.  Actually it didn't really fail. I just couldn't bring myself to weave a graphics engine amongst the CPU instructions in renderLine().  If you look at the Uzebox source they cleverly weave sprite reading instructions amongst the gaps in the pixel pushing code.  That is fine on an AVR where the instruction rate is actually constant.  When I tried this on the PIC32 I failed.  Since the instruction rate is not constant (due to the factors above) you can NOT do this deterministically.  Non-determinism is not normally a software engineer's friend and so therefore this approach was dead to me.



Next stop: DMA.  To be continued.