CODING TUTORIAL

# -------------------------------------------------------------------
# CODING TUTORIAL                      (c) Copyright 1996 Nat! & KKP
# -------------------------------------------------------------------
# These are some of the results/guesses that Klaus and Nat! found
# out about the Jaguar with a few helpful hints by other people, 
# who'd prefer to remain anonymous. 
#
# Since we are not under NDA or anything from Atari we feel free to 
# give this to you for educational purposes only.
#
# Please note, that this is not official documentation from Atari
# or derived work thereof (both of us have never seen the Atari docs)
# and Atari isn't connected with this in any way.
#
# Please use this informationphile as a starting point for your own
# exploration and not as a reference. If you find anything inaccurate,
# missing, needing more explanation etc. by all means please write
# to us:
#    nat@zumdick.ruhr.de
# or
#    kp@eegholm.dk
#
# If you could do us a small favor, don't use this information for
# those lame flamewars on r.g.v.a or the mailing list.
#
# HTML soon ?
# -------------------------------------------------------------------
# $Id: coding.html,v 1.9 1997/11/16 18:14:40 nat Exp $
# -------------------------------------------------------------------

Coding Tips:
=-=-=-=-=-=

Just a few hints, that might be useful, when thinking about Jaguar 
performance and coding in general...

How to organize your bitmaps
Using the GPU as a blitter cache
For hasardeurs only...
Keeping everybody happy
Cutting the object list



How to setup your bitmaps:
=-=-=-=-=-=-=-=-=-=-=-=-=

A conventional setup with your bitmaps in contigous memory blocks
like this:

#define N_SCREENS 3
#define WIDTH     320L
#define HEIGHT    100L
#define pixtype   word

   pixtype  screen[ N_SCREENS][ HEIGHT][ WIDTH];  

   screen[ 0]    screen[ 1]    screen[ 2]
   0 0 0 0 ... 0 1 1 1 1 ... 1 2 2 2 2 ... 2

is not very good when you want to do copying blits, because you will
get a lot of page misses. It is better to
organize your bitmaps in an interleaved fashion like this:

   word  screen[ HEIGHT][ WIDTH][ N_SCREENS];  

   screen[ 0][ 0][ 0]
     screen[ 0][ 0][ 1]
       screen[ 0][ 0][ 2]
   0 1 2 0 1 2 0 1 2 0 1 2 ... ... ... 0 1 2

Of course this helps to speed up your blits only in cases where the source 
and destination coordinates are in real close vertical (and to a lesser 
degree) horizontal proximity.



THESE ARE UNVERIFIED MUSINGS, COULD VERY WELL BE COMPLETE BULLSHIT!

Using the GPU as a blitte cache:
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

A problem with blitting is the high ratio of page misses you're gonna get.
(See: Blittermodes). 
An idea to get around these page misses might be to use the GPU local RAM 
as an intermediate buffer for a two stage blit.

Unfortunately in theory for a straight copying blit you shouldn't be losing
or winning, because the single blit transfer would cost you 6 cycles per pixel,
whereas the transfer to and from GPU RAM would cost you 3 cycles each transfer.
(Total gain or loss == 0)

But if you do downscaling blits (pixelwise) into the GPU buffer and then write 
the data out in phrasemode, there's the possibility for a win (theoretically).
You would also "win" some cycles in downscaling pixelblits, because you can 
save superflous destination writes.

OK purely theoretically, you would expect for a 128x128 Cry Block
scaled down to 64x64

   128 * 128 * 6 cyles for a normal scaling blit (98304)

with the GPU buffer it should be:

   128 * 128 * 3 cycles  RAM -> scale -> GPU     (49152)
   64 * 64 * 3 cycles    GPU -> RAM (pixelmode)  (12288)
or
   64 * 64 / 4 * 3 cyles GPU -> RAM (phrasemode) (3072)

for a saving of ca. 38% in pixelmode or ca. 46% in phrasemode.



For hasardeurs only...:
=-=-=-=-=-=-=-=-=-=-=-=

Looking for this extra little bit of performance? Well you can get yourself
a little bit more bandwidth by turning off the refresh. Whoa that's stupid
you say, my RAM will forget everything it knows in a matter of seconds!

Well not necessarily. A refresh does nothing more then to select at times 
a row (RAS) which will have the immediate effect that all cells in that row 
are refreshed. 

But whenever you're reading from a row of your RAM, you're automatically 
refreshing this row also, so a lot of the time that the Jaguar hardware is
using to do the DRAM refresh is really wasted, because the 68K, the Blitter
the RISCs or the OP (most likely the OP), have already read (and refreshed)
that row very recently. So if you're totally sure that all the rows you've 
memory of interest in are accessed sufficiently often, you might
as well turn the refresh off (and you need to do the refresh manually for 
those rows, that aren't touched sufficiently often).



Keeping everybody happy:
=-=-=-=-=-=-=-=-=-=-=-=

A trap that you may fall into (aside from limited memory bandwith) is,
that one of the processors hogs the bus too much, which will in turn 
lead to problems elsewhere. Lemme illustrate with some examples:

1.)   You're 68000 is processing a nice big object list at every
      refresh. It is tied into the VC
      Interrupt. After you're done, you're setting back the 68000
      priority to normal by writing to INT2.

      BUT in the meantime, while the 68000 is processing the list, 
      the GPU will be almost idling, if it is waiting for main memory
      accesses, since it will very likely get not many (if at all).
      
      If you lower the 68000 priortiy too soon, you might see your  
      screen go weird, because the 68000 took too long to process the
      list, since now the GPU might be hogging the bus and the OP is
      not happy with the list he's got now.
      The GPU is usually higher priorised than the 68000, except when
      the 68000 is processing an IRQ

2.    The DSP is used for MPEG decoding. It is hitting the bus
      massively to load compressed data and to write back the decoded
      screen data. In this situation you will likely see your screen
      go flash (or crash) because the 68000 just doesn't have a chance
      of doing the object list properly. The DSP is always higher 
      priorised then the 68000.

3.    You're clearing a screen with the Blitter, and, since you want
      it done as quick as possible, you set BUSHI. Bad idea, the OP
      is lower priorised, than the Blitter with BUSHI and you'll see
      some screen ugliness. Using BUSHI is a no-no anyway.
      (See: Dangerous priorities)

And the moral of the story, think before you code. Don't use the 68000.
Don't use BUSHI and DMA modes. The sermon is over...



Cutting down the object list:
=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Probably nothing is as tempting as using the OP for every little
object on the screen. Why painfully draw something into screen memory,
(another object BTW) when the hardware can do this for you.
But the problem is, that every object costs something. Because the object
list is traversed by the OP every scanline a simpleminded object list
w/o any branching objects but 50 sprites would use already 12%
of the machines power for processing the object list alone!
The two branch objects in front of the object list, that guard the
visible area of the screen, already save you some processing time in 
our example they will reduce the load from 12% to 9%. We could save some
more percents, if we sort the objects vertically, so that
we can construct a tree, seperating the screen into seperate zones.
If we f.e. seperate the screen into eight zones, and assume that the 
sprites are evenly distributed, we'd be using only about 2% of the
system!

The math:

        525 lines * 30 Hz (NTSC) = 15750 lines/s
        50 bitmaps with 2 phrases each + 1 stop object = 101 phrases/line
        1 phrase/cycle OP speed
        15750 lines/s * 101 phrases/line * 1 phrase/cycle = 1590750 cycles/s
        = 11.96% of 13.3 mio cycles/s

Reduce scanline count by introducing two branch objects to guard the 
visible area of 200 scanlines:

        200 lines * 60 Hz = 12000 lines/s

        12000 lines/s with all objects = 12000 * 101 = 1212000 cycles/s
        15750-12000 lines with 3 objects = 11250 cycles / s

        together = 9,2 % of 13.3 Mio cycles/s


Now introduce a tree structure of three levels. 
This is how the list would look like (objects belonging spatially to 
two zones, must appear in each zone (and of course need to be adjusted
by the object list builder)).

In this example the screen is a 525 NTSC screen, with a visible area
of 400 lines from 65 to 465 (or 320x200 for all practical purposes)
        
        BRANCH  < 65    -> STOP
        BRANCH  < 465   -> START
        STOP

START:
        BRANCH  >= 265  -> ZONES_4567
        BRANCH  >= 165  -> ZONES_23
        BRANCH  >= 115  -> ZONE_1

ZONE_0:
        BITMAP
        ...
        ...
        STOP

ZONE_1:
        BITMAP
        ...
        ...
        STOP
        
        

ZONES_23:
        BRANCH  >= 215  -> ZONE_3

ZONE_2:
        BITMAP  
        ...

etc.

Lets assume the case that we have 20% duplication of bitmap data in
each zone, that would mean that the average pure object list time would
reduce to:

        50 + 50 * 1/5 = 60 bitmaps

        60 bitmaps / 8 zones = ca. 8 bitmaps each zone average

        + 6 phrases overhead, because of branches:

        8 bitmaps * 2 phrases/bitmap + 6 overhead = 24 phrases each zone

        and there fore 24 cycles each zone, yielding

        12000 zones/s * 24 cycles = 288000 

        + 11250 branch overhead in invisble zones

        = 2.25% of 13.3 Mio cycles

Nat! (nat@zumdick.ruhr.de)

Klaus (kp@eegholm.dk)

$Id: coding.html,v 1.9 1997/11/16 18:14:40 nat Exp $

Table of Contents