So many ideas, so little time...: May 2011

Tuesday 31 May 2011

Video - Update 2

A little more progress made. I have routed the R2R DACs and HSYNC and VSYNC signals to a 4x2 pin header. This will go out to a chassis mounted VGA port.

I think I have mostly worked out how I'll coordinate between the two GPU. I have introduced a D-Type Flip-flop to perform the role of a hardware mutex. When the GPU AVRs start up, they will read the value on their PORTD5 pin. The left hand GPU PORTD5 pin will be connected to the Q output of the Flip Flop and PORTD5 on the other GPU will be connected to Q'. Whichever GPU reads a high value is going to render while the other will perform the RAMDAC role. Then the Flip-flop flips, the roles will reverse.

The RAMDAC GPU will be responsible for handing off to the other. The hand off process will look like this:

At the end of the video frame the RAMDAC GPU will drive the Flip-flop clock pin high causing:

The Flip-flop will toggle.
The VSYNC will go high (or high depending on how I wire it)
The Serial ports will be connected to the new Render GPU.
The SRAM data line buffers will toggle their enabled state.
The SRAM address line buffers will toggle their enabled state.

The new RAMDAC will wait for the VSYNC period and then drive the Flip-flop clock pin low.
The new Render GPU will set appropriate pins to High-Z and then can get on with rendering.

Things left to do on the video board:

Ground plane
Decoupling capacitors

Saturday 28 May 2011

Video - Update

I have completed a bit more of the routing work on the Video board. I still need to add decoupling capacitors and fix up a few other things. I don't have any ground plane yet, that's next - once I figure out how to do it properly in Eagle.

I changed plan slightly with the clear screen counter. In the end I chose an 8-bit counter. That means I can clear a contiguous run of 256 pixels without intervention from the AVR. There is also a jumper selectable 2x or 4x clock multiplier on the counter. Each AVR will need to increment address bits [8..16] on its own. That should be fine at 4x since the address will only increment every 64 AVR cycles; that is plenty of time to increment a variable and load a port.

Tuesday 17 May 2011

VMS - Getting somewhere

Have thought a lot about the Video board since the last post. Building a switch for this kind of thing out of discrete logic chips not practical. I'm limited in my board size in Eagle. I'm limited in my ability to route the thing anyway. I have an alternative design in mind. Instead of having a separate GPU and RAMDAC, I'll just have two GPUs which can also act as RAMDACs:

This removes the switching logic and now each GPU has a dedicated SRAM bank. Each GPU will be either in Render Mode or Scan-out Mode. They will never be in the same mode. A third AVR controls the two GPUs and tell them what mode to be in. This coordinator chip also controls which GPU is "writing" to the R2R DAC.

I have also included a clock multiplier and 24-bit counter which will be able to quickly clear the framebuffer. This will free up the GPU for useful drawing instead of spending 80%-90% of the time clearing the screen. A clever GPU firmware might be able to sync on this and use the counter for drawing pixel spans of solid colours - yum.

Here is the routing so far...

Sunday 8 May 2011

VMS - Back to the drawing board

Here is the result of 20 minutes work in Eagle on the VMS board. I have place the 14 74HC157 ICs required to form the switching logic. I have also placed a header for connection to the GPU IC. I still have to place a header for each bank of RAM and a header for the RAMDAC IC.

It doesn't take a Rocket Scientist to work out that this is not going to work. I'm already at the full board size Eagle support for the free version and I don't have any room for routing traces.

I need to re-think this. My options are:

Abandon the switch and RAMDAC and have the GPU to it all.
Come up with a better switch.
Use dual-port SRAM.

The first option doesn't appeal to me greatly. If I lose the switch then I lose the ability to render and scan-out at the same time. This is a big deal. Based on the figures from here, active video accounts for approximately 91% of a full second of video. That leaves only 9% of the AVR CPU time to actually render. My previous design allowed rendering 100% of the time.

The second option could be solved with something like a FPGA of CPLD. I need to look into the cost of these. An initial look reveals this one has 120 GPIO pins and is $31. The other cool thing about an FPGA is that I could move the CLEAR SCREEN command to the FPGA and free up the GPU too.

The third option appears infeasible at this point. I can find dual-port SRAM at good price. That said, I could de-solder some from an old video card. Hmmm, might look into this. This is the ultimate because it means I can put the GPU and RAMDAC on opposite side of a single RAM IC. This is likely to be a dead end though as I'm pretty sure all most video cards use a derivative of DRAM, and I'm not getting into that.

Saturday 7 May 2011

VMS - Video Memory Switcher

I drew out the VMS below:

Goodness me! This looks like PCB routing nightmare.

Friday 6 May 2011

Block Diagram

A basic architecture is emerging for this machine. This is basically what I want. The CPU can address any of the following devices:

GPU - Graphics Processing Unit
APU - Audio Processing Unit
PPU - Peripheral Processing Unit
RAM - A 512KBytes SRAM.

The interesting thing is that the CPU can't directly access the frame buffer. The reason I have done this is due to the result of the testing I did in my previous post. It appears that the GPU is going to monopolise the bus just to clear the screen. I worked out that if the GPU was in a fact a 16MHz ATMega328 (same as Arduino Duemilanove), then just clearing the screen at 60 FPS would consume 97% of the CPU time and as a consequence a similar proportion of the bus bandwidth. Now, my GPU will run at 20MHz and I'll probably refresh at 50Hz since but the situation is still pretty dire.

So, the direction I'll take will be to separate these buses. The CPU will just write commands to the GPU over the Main Address/Data bus and the GPU will render to the VRAM. This means filling single pixels is going to be much slower but other operations like solid fills should be much better.

There is a potential problem here. I probably don't have enough pins on the GPU for two address and data buses. I'll need to work something out here. I have an idea involving overlapping address spaces and tri-state data buses. More to follow.

Finally the VMS stands for Video Memory Switcher and is a fancy cross-bar switch. I got the idea from another project: Lazarus-64

Thursday 5 May 2011

Ok, I have worked out the problem earlier. The Arduino IDE uses the following compiler flags:

avr-g++ -c -g -Os -w -fno-exceptions -ffunction-sections -fdata-sections -mmcu=atmega328p -DF_CPU=16000000L -DARDUINO=21

This caused my loop to be optimised out!

I worked out how to build and upload outside the Arduino IDE and using an optimisation flag of -O0 I now get 39.6 seconds with the single pixel version. Writing 8 pixels per loop takes 3.9 seconds. 16 pixels is dispatched in 1.9 seconds and 32 pixels takes 0.97 seconds. At last now we have some real data! Here is a graph of those points.

We have certainly reached the point of diminishing returns at 32 pixels. We might be able squeeze a bit more bit more by increasing the pixels but I tried 64 pixels and I think the Arduino ran out of flash. So if all we are doing is clearing the screen at 60 FPS, we have a 97% duty cycle. At 20 Mhz this becomes 78%. That leaves some time for drawing other things.

Further improvement could be made. For example, in most games we won't need to clear the entire screen if we track the dirty regions. Finally, we can always over-clock ;-)

Wednesday 4 May 2011

Video Modes

I have been thinking about video modes.

I want to be able to offer both 4:3 modes and 16:9 mode (since most TVs are widescreen now). So I did up a little spreadsheet that computes the Video RAM and pixel clock requirements given that I want the following features:

Double buffering (2 frame buffers in VRAM with page flipping).
Clear entire screen at 60 fps.
8 bits per pixel (3 bits Red, 3 bits Green and 2 bits Blue)

Having 8 bits per pixel arranged in this way yields a palette like this:

This should look pretty tasty indeed. Here is the spreadsheet:

width	height	bytes/ page	rounded page	bytes total	SRAM Chip	total waste	A pins/ page	total pins	Pixel Clock	Mode
104	78	8,112	8,192	16,384	(16K x 8)	160	13	23	486,720	104 x 78		4:3 resolutions
144	108	15,552	16,384	32,768	(32K x 8)	1,664	14	24	933,120	144 x 108
208	156	32,448	32,768	65,536	(64K x 8)	640	15	25	1,946,880	208 x 156
288	216	62,208	65,536	131,072	(128K x 8)	6,656	16	26	3,732,480	288 x 216
320	240	76,800	131,072	262,144	(256K x 8)	108,544	17	27	4,608,000	320 x 240
360	270	97,200	131,072	262,144	(256K x 8)	67,744	17	27	5,832,000	360 x 270	1080 mode
416	312	129,792	131,072	262,144	(256K x 8)	2,560	17	27	7,787,520	416 x 312
480	360	172,800	262,144	524,288	(512K x 8)	178,688	18	28	10,368,000	480 x 360	720 mode
512	384	196,608	262,144	524,288	(512K x 8)	131,072	18	28	11,796,480	512 x 384	768 mode
584	438	255,792	262,144	524,288	(512K x 8)	12,704	18	28	15,347,520	584 x 438
720	540	388,800	524,288	1,048,576	(1024K x 8)	270,976	19	29	23,328,000	720 x 540	1080 mode
160	90	14,400	16,384	32,768	(32K x 8)	3,968	14	24	864,000	160 x 90		16:9 resolutions
224	126	28,224	32,768	65,536	(64K x 8)	9,088	15	25	1,693,440	224 x 126
320	180	57,600	65,536	131,072	(128K x 8)	15,872	16	26	3,456,000	320 x 180
480	270	129,600	131,072	262,144	(256K x 8)	2,944	17	27	7,776,000	480 x 270	1080 mode
640	360	230,400	262,144	524,288	(512K x 8)	63,488	18	28	13,824,000	640 x 360	720 mode
960	540	518,400	524,288	1,048,576	(1024K x 8)	11,776	19	29	31,104,000	960 x 540	1080 mode

I have computed the number of GPIO pins required to address one page of the framebuffer. For example, a video mode of 480x270 (Widescreen) consumes 230,400 bytes per page. Rounding up to the nearest power of 2 and multiplying by 2 (for 2 pages) requires a 512 KByte VRAM. In order to clear the screen at 60 fps, I need to be able to write 7,776,00 bytes per second. Now that is a lot of bandwidth for a 20Mhz 8-bit AVR.

Can it be done? Well, at that resolution I would need 17 address pins per page + 8 pins for the data + 1 pin for SRAM Write/Enable + 1 pin for the page select. So I'd need 27 GPIO pins in total. That rules out the ATMega328 (Arduino). An ATMega164A might to the trick (digikey.com.au) as it has 32 GPIO pins. So I have enough pins but can I write to the memory fast enough to clear the screen?

In order to test this I dug out my trust Arduino and ran some tests. Here is some sample code:

void loop() {
  time = millis();
  addr = 0;
  do {
   PORTC = (byte)(addr);
   PORTC = (byte)(addr>>8);
   PORTC = (byte)(addr>>16);
   addr++;
  }
  while (addr < 7776000);
  time = millis() - time;
  Serial.print("Time: ");
  Serial.println(time);
}

This is supposed to simulate walking through a 18bit address space (17bits per page + page select). The idea is that this is basically what is required to clear a 480x270 area of an SRAM chip 60 times. This runs in 11.2 seconds. Now the chip is running at 16Mhz in this case, so on a 20Mhz setup the time would be more like 8.9 seconds but I'll stick to 16Mhz for now.

11.2 seconds is far too slow. This needs to be less that 1 second in order to meet the constraint I set and ideally much less in order to allow some time to draw some other shapes! There are some things I can do here. I could unroll the loop a bit. Given that I'm clearing the screen to one colour I can safely unroll it as much as I like. This is speed/space trade-off though. Unrolling a loop uses more Flash but my whole Arduino sketch is on 2990 bytes right now and the target chip has 16Kbytes of Flash so I'm pretty safe. This is what the code looks like if I unroll the loop by a factor of 8:

void loop() {

time = millis();

addr = 0;

do {

PORTC = (byte)(addr);

PORTC = (byte)(addr>>8);

PORTC = (byte)(addr>>16);

addr++;

PORTC = (byte)(addr);

PORTC = (byte)(addr>>8);

PORTC = (byte)(addr>>16);

addr++;

PORTC = (byte)(addr);

PORTC = (byte)(addr>>8);

PORTC = (byte)(addr>>16);

addr++;

PORTC = (byte)(addr);

PORTC = (byte)(addr>>8);

PORTC = (byte)(addr>>16);

addr++;

PORTC = (byte)(addr);

PORTC = (byte)(addr>>8);

PORTC = (byte)(addr>>16);

addr++;

PORTC = (byte)(addr);

PORTC = (byte)(addr>>8);

PORTC = (byte)(addr>>16);

addr++;

PORTC = (byte)(addr);

PORTC = (byte)(addr>>8);

PORTC = (byte)(addr>>16);

addr++;

PORTC = (byte)(addr);

PORTC = (byte)(addr>>8);

PORTC = (byte)(addr>>16);

addr++;

}

while (addr < 7776000/8);

time = millis() - time;

Serial.print("Time: ");

Serial.println(time);

}

40: sts 0x0000, r24
44: sts 0x0000, r25
48: ldi r18, 0x00
4a: ldi r19, 0x00
4c: ldi r20, 0x00
4e: ldi r21, 0x00
50: ldi r24, 0x01
52: ldi r25, 0x00
54: ldi r26, 0x00
56: ldi r27, 0x00
58: std Y+9, r24
5a: std Y+10, r25
5c: std Y+11, r26
5e: std Y+12, r27
60: ldi r24, 0x02
62: ldi r25, 0x00
64: ldi r26, 0x00
66: ldi r27, 0x00
68: std Y+5, r24
6a: std Y+6, r25
6c: std Y+7, r26
6e: std Y+8, r27
70: ldi r24, 0x03
72: ldi r25, 0x00
74: ldi r26, 0x00
76: ldi r27, 0x00
78: std Y+1, r24
7a: std Y+2, r25
7c: std Y+3, r26
7e: std Y+4, r27
80: ldi r16, 0x04
82: mov r2, r16
84: mov r3, r1
86: mov r4, r1
88: mov r5, r1
8a: ldi r17, 0x05
8c: mov r6, r17
8e: mov r7, r1
90: mov r8, r1
92: mov r9, r1
94: ldi r27, 0x06
96: mov r10, r27
98: mov r11, r1
9a: mov r12, r1
9c: mov r13, r1
9e: ldi r26, 0x07
a0: mov r14, r26
a2: mov r15, r1
a4: mov r16, r1
a6: mov r17, r1
a8: out 0x08, r18
aa: eor r27, r27
ac: mov r26, r21
ae: mov r25, r20
b0: mov r24, r19
b2: out 0x08, r24
b4: movw r24, r20
b6: eor r26, r26
b8: eor r27, r27
ba: out 0x08, r24
bc: ldd r25, Y+9
be: out 0x08, r25
c0: ldd r24, Y+9
c2: ldd r25, Y+10
c4: ldd r26, Y+11
c6: ldd r27, Y+12
c8: mov r24, r25
ca: mov r25, r26
cc: mov r26, r27
ce: eor r27, r27
d0: out 0x08, r24
d2: ldd r24, Y+9
d4: ldd r25, Y+10
d6: ldd r26, Y+11
d8: ldd r27, Y+12
da: movw r24, r26

dc: eor r26, r26
de: eor r27, r27
e0: out 0x08, r24
e2: ldd r25, Y+5
e4: out 0x08, r25
e6: ldd r24, Y+5
e8: ldd r25, Y+6
ea: ldd r26, Y+7
ec: ldd r27, Y+8
ee: mov r24, r25
f0: mov r25, r26
f2: mov r26, r27
f4: eor r27, r27
f6: out 0x08, r24
f8: ldd r24, Y+5
fa: ldd r25, Y+6
fc: ldd r26, Y+7
fe: ldd r27, Y+8
100: movw r24, r26
102: eor r26, r26
104: eor r27, r27
106: out 0x08, r24
108: ldd r25, Y+1
10a: out 0x08, r25
10c: ldd r24, Y+1
10e: ldd r25, Y+2
110: ldd r26, Y+3
112: ldd r27, Y+4
114: mov r24, r25
116: mov r25, r26
118: mov r26, r27
11a: eor r27, r27
11c: out 0x08, r24
11e: ldd r24, Y+1
120: ldd r25, Y+2
122: ldd r26, Y+3
124: ldd r27, Y+4
126: movw r24, r26
128: eor r26, r26
12a: eor r27, r27
12c: std Y+13, r24
12e: std Y+14, r25
130: std Y+15, r26
132: std Y+16, r27
134: out 0x08, r24
136: out 0x08, r2
138: eor r27, r27
13a: mov r26, r5
13c: mov r25, r4
13e: mov r24, r3
140: out 0x08, r24
142: movw r24, r4
144: eor r26, r26
146: eor r27, r27
148: out 0x08, r24
14a: out 0x08, r6
14c: eor r27, r27
14e: mov r26, r9
150: mov r25, r8
152: mov r24, r7
154: out 0x08, r24
156: movw r24, r8
158: eor r26, r26
15a: eor r27, r27
15c: out 0x08, r24
15e: out 0x08, r10
160: eor r27, r27
162: mov r26, r13
164: mov r25, r12
166: mov r24, r11
168: out 0x08, r24
16a: movw r24, r12
16c: eor r26, r26
16e: eor r27, r27
170: out 0x08, r24
172: out 0x08, r14

174: eor r27, r27
176: mov r26, r17
178: mov r25, r16
17a: mov r24, r15
17c: out 0x08, r24
17e: movw r24, r16
180: eor r26, r26
182: eor r27, r27
184: out 0x08, r24
186: subi r18, 0xF8
188: sbci r19, 0xFF
18a: sbci r20, 0xFF
18c: sbci r21, 0xFF
18e: ldd r24, Y+9
190: ldd r25, Y+10
192: ldd r26, Y+11
194: ldd r27, Y+12
196: adiw r24, 0x08
198: adc r26, r1
19a: adc r27, r1
19c: std Y+9, r24
19e: std Y+10, r25
1a0: std Y+11, r26
1a2: std Y+12, r27
1a4: ldd r24, Y+5
1a6: ldd r25, Y+6
1a8: ldd r26, Y+7
1aa: ldd r27, Y+8
1ac: adiw r24, 0x08
1ae: adc r26, r1
1b0: adc r27, r1
1b2: std Y+5, r24
1b4: std Y+6, r25
1b6: std Y+7, r26
1b8: std Y+8, r27
1ba: ldd r24, Y+1
1bc: ldd r25, Y+2
1be: ldd r26, Y+3
1c0: ldd r27, Y+4
1c2: adiw r24, 0x08
1c4: adc r26, r1
1c6: adc r27, r1
1c8: std Y+1, r24
1ca: std Y+2, r25
1cc: std Y+3, r26
1ce: std Y+4, r27
1d0: ldi r24, 0x08
1d2: ldi r25, 0x00
1d4: ldi r26, 0x00
1d6: ldi r27, 0x00
1d8: add r2, r24
1da: adc r3, r25
1dc: adc r4, r26
1de: adc r5, r27
1e0: add r6, r24
1e2: adc r7, r25
1e4: adc r8, r26
1e6: adc r9, r27
1e8: add r10, r24
1ea: adc r11, r25
1ec: adc r12, r26
1ee: adc r13, r27
1f0: add r14, r24
1f2: adc r15, r25
1f4: adc r16, r26
1f6: adc r17, r27
1f8: cpi r18, 0xE0
1fa: ldi r25, 0xD4
1fc: cpc r19, r25
1fe: ldi r25, 0x0E
200: cpc r20, r25
202: ldi r25, 0x00
204: cpc r21, r25
206: brcc .+0
208: rjmp .+0

The loop time is now 1.8 seconds. Wow. That really helped. I have included the assembly output where you can see the effect of the unrolling. This assembly covers the do() loop only. Unrolling to 16 pixels per loop yields 545 milliseconds. Now we are in business.

I need to check these calculations. I seems unreal that an 8-bit micro can write 7.7 million bytes in 0.5 seconds. That is a bandwidth of 15.4 MiBytes/second on a 16Mhz part. Something must be wrong. I have certainly made a mistake somewhere...