Forum archive
polytex p.asm
- hi, i'm working on polytex sources to make it c-only for a while. so i can make it cross-platform and ultimately make it a running constant-z texture mapper example for reference.
so, i have few questions:
* what was the purpose of the setup_m*line, m*line functions? i removed them and engine is running fine!
* i tried co convert setup_dline and dline to c, but it won't worked. can anybody help me? ken? Re: polytex p.asm
The 'm' stands for masked walls. The 'm' routines are exactly the same as the non-'m' routines, except they don't draw any pixels that have color 255. The included map, BOARDS.MAP, happens to have no masked walls.
Sorry, I haven't much interest in revisiting old assembler code. You should know that constant-Z mapping is really a lousy algorithm. You still end up with nasty rendering artifacts - just they look different from the usual interpolation methods. Also, you lose speed whenever you render in a direction that isn't horizontal.
If you're going for maximum speed, then I would suggest using 2D bilinear interpolation, where you do 1 divide per 8x8 or 16x16 box. If you use a power of 2 for your box size, the inner loop can be reduced to simple ALU instructions without any multiplies or divides.
If you're going for accuracy, then you might want to consider hyperbolic/Bresenham texture mapping (I'm not sure about the name since there's not much written about it). This method gives a perfect mapping without any multiplies or divides in the inner loop. As you scan from left to right, you use a 'while' loop to determine whether u should increment (or perhaps decrement depending on the polygon orientation). You do the same thing for v. This method works best when the source texture resolution is low, and branches are fast on your CPU. My KUBE program uses hyperbolic mapping by default. Unfortunately, you're not going to find the code any easier to read than P.ASM.- well, actually i have a 32-bit c-only software renderer engine that have flat/affine/subdiv mappers. i can switch mappers on the fly and see how fps changes. i wanted to add all known texture mapping methods (i need to done with polytex before) and see how well them perform side by side. constant-z was the last method i'm aware of but there was one more...
talking about old assembly code, it is quite interesting to decode really optimized asm code into c. but i reached my limit, i just can't figure out why my code didn't run while looking quite the same, probably i miss some little but important detail. it may nice to you have a little look at it, but no problem... i installed an old amd 386dx40 with 5mb ram to follow michael abrash's gpbb :)
anyway, thank you very much - Doing a speed test in pure C code is only sufficient for comparing the artifacts of each method. To be fair, you really need to write each method in assembler and optimize each one as much as possible.
If you post your buggy dline() function, I would be willing to look through it and see if I can spot the problem. - yes, speed testing in c isn't precise and far away to show what can be achieved with assembly. i'm quite aware of it. but i'm only competent in c.
it is very kind to look my c code to find the problem for an uninterested guy. thank you very much.
talking about c code, what i understand in p.asm is parameter passing to code is done through changing the code. like that:mov dword ptr ds:[hadd+2], eax
...
hadd: add edi, dword ptr [esi+88888888h]
'mov' line actually changes 88888888h with eax.
so i wrote global variables like 'long hadd=0x88888888;', set some of them in setupdline() function. remaining dline() code is below.
// #pragma aux dline parm [eax][ebx][ecx][edx][esi][edi];
// a=eax, b=ebx, c=ecx, d=edx, e=esi, f=edi
long dline(long a, long b, long c, long d, long e, long f)
{
long long ebp64, tmp64, e64;
short dpart;
if (c != 0)
goto startlongdline;
// shx3:
// d = 0; // edx assumed to start < 65536
rol (&a, shx3);
d &= 0xffff00ff; // mov dh, al
d |= (a & 0x000000ff) << 8;
// shy3:
rol (&b, shy3);
d &= 0xffffff00; // mov dl, bl
d |= b & 0x000000ff;
a = asm3;
// mach1b:
c &= 0xffffff00;
c |= *((char*)(d+mach1b)); // picoffs
a &= 0xffffff00;
a |= *((char*)(c+asm3)); // shadeoffs
*((char*)f) = a & 0x000000ff;
return a;
startlongdline:
e = d;
// shx1:
// d = 0; // edx assumed to start < 65536
rol (&a, shx1);
d |= (a & 0x000000ff) << 8;
a &= 0xffff0000;
// shy1:
rol (&b, shy1);
d |= a & 0x000000ff;
b &= 0xffff0000;
ebp64 = a;
c *= 5; // address
a = haddtable+c+e; // address
hadd = a; // kodu degistir
e = 0;
e -= c;
c = 0;
a = asm3;
mach6a = a;
a = asm1;
// shx2
rol (&a, shx2);
mach3a = a & 0x000000ff;
a &= 0xffff0000;
mach2a = a;
a = asm2;
// shy2
rol (&a, shy2);
mach5a = a & 0x000000ff;
a &= 0xffff0000;
mach4a = a;
goto begdline;
begdline:
// mach2a:
ebp64 += mach2a; // carry'si kullanilacak
// mach1a:
c &= 0xffffff00;
c |= *((char*)(d+mach1a)); // picoffs
// mach3a:
dpart = (d & 0x0000ff00) >> 8; // obtain dh
dpart += mach3a + (ebp64>>32);
d &= 0xffff00ff;
d |= (dpart << 8);
// mach4a:
tmp64 = b;
tmp64 += mach4a;
b = tmp64;
// mach5a:
dpart = (d & 0x000000ff); // obtain dl
dpart += mach5a + (tmp64>>32);
d &= 0xffffff00;
d |= dpart;
// mach6a:
a &= 0xffffff00; // mov al, ...
a |= *((char*)(c+mach6a)); // shadeoffs
*((char*)f) = a & 0x000000ff;
// mask1:
d &= mask1; // mask
// hadd
f += *((long*)(e + hadd)); // haddtable
e64 = e;
e64 += 4;
e = e64;
if (e64 >> 32) // jump not carry
goto begdline;
return a;
} - * Change all of your variables to be unsigned.
* Change "d |= (a & 0x000000ff) << 8;" to "d = (d & 0xffff00ff) | ((a & 0xff) <<8 );"
* Change "d |= a & 0x000000ff;" to "d = (d & 0xffffff00) | (b & 0xff);"
* Change "c *= 5;" to "c = c * 4 + 1";
* Insert "ebp64 &= 0xffffffff;" immediately before "ebp64 += mach2a;"
* Change "if (e64 >> 32)" to "if (!(e64 >> 32))"
Here's a suggestion: you don't need 64-bit integers to simulate the carry flag. Here's an example of how it can work with 32-bit integers:
Assembler code:
add eax, ebx
adc ecx, edx
Equivalent C code:
unsigned long a, b, c, d;
a += b;
c += d + (a < b); - well... i saw errors and patched all but one: ebp64 is for detecting carry. so anding it with (2^32)-1 just clears carry information.
but it won't worked. i think there's (or there are) some address miscalculation. you can sure that i'll pursue it. thank you very much again.
by the way, carry in c is a cool one ;) - i think i spot the problem. there are two problems actually.
first is, i can change video address from dline caller function in polytex.c but not in dline function in pasm.c. it simply crashes. as a solution i made some global variables and use them instead of dline() parameters. line that:
vidaddr = p;
dline(bx,by,x2-x1,x1<<2,p);
and in dline()
// mach1b:
c = (c & 0xffffff00) | *((unsigned char*)(d+mach1b)); // picoffs
a = (a & 0xffffff00) | *((unsigned char*)(c+asm3)); // shadeoffs
*(unsigned char*)vidaddr = a & 0x000000ff;
return a;
i used vidaddr instead of passed p (f in dline) and problem is gone
polytex only properly compiles with register based calling. i think it leads to this problem.
second is, in my rol() function there is a check for not shifting more than 32bits. but there are lots of them reaches here. removing the check function fixes it. - well... it's been quite some time. but i can finally convert dline() asm code to c :)
my first intention was convert polytex to fully c code and port it to sdl library, but now i'm not sure complete it. i simply don't have enough motivation and time. so i'm giving away code. maybe somebody find it useful.
code package has openwatcom ide dos32 project files. it builds polytex.exe and copies it to D:\DOS\polytex\ dir. you can remove copying operation with right-clicking onto project window/Target Options/Execute After and clearing inside.
i made lots of changes. but most important ones are, removing single pixel special case from dline asm function and converting it to c. it slowed down a lot and sky rendering is wrong. but i don't have random crashes appeared in original version.
as i wrote, i'm not sure to complete cross platform version. but if there are any significant advances, i'll post it here too.
license is Ken's original polytex license.
put dos4gw.exe in run directory and execute polytex.exe. - http://leventyavas.freehostia.com/polytex-20081022-1232.zip
- At last... cross-platform polytex! Runs in dos, win32, macosx. It's a good example of constant-z renderer. You can study, compile, run it in virtually any platform.
Thank you very much Ken. I'm quite curious about software renderers and I'm very happy to made this. And I have a question: Is there any difference between cubes5 and polytex renderers?
Executables and data files:
leventyavas.freehostia.com/polytex-20081120.zip
Sources:
leventyavas.freehostia.com/polytex-src-20081120.zip Is there any difference between cubes5 and polytex renderers?
Yes. Cubes5 is a slightly updated version of a constant-z renderer. As it draws each diagonal line, it has 2 sets of U/V increments, depending on the last direction (horizontal or vertical) moved. Using different increments cleans up the rough sawtooth edges that are normally seen with constant-z rendering. When viewing polygons at sharp angles, the algo fails, in which case I switch to using the old method.
You know, it's possible to do perfect 6dof texture mapping without any divides or multiplies per pixel. I'll show you how to transform the inner loop of a brute-force texture mapper:
Note that I am calculating only for U to simplify things. 'iu' is the integer index that addresses the texture map. Now see if you can follow along:
iu = u/d; u += ui; d += di;
while (u/d > iu) { iu++; } u += ui; d += di;
while (u > iu*d) { iu++; } u += ui; d += di;
while (u > iud) { iu++; iud += d; } u += ui; d += di; iud += iu*di;
while (u > iud) { iu++; iudi += di; iud += d; } u += ui; d += di; iud += iudi;
If you're looking for a demo of this in action, check out KUBE.EXE. Note that the algo performs best when the texture is low resolution. Higher resolution requires more iterations of the while() loops.- As I understand, you use while() loop for calculating u,v texture coordinates instead of divide! That's incredibly crazy and funny idea ;D
Just plugged and tried this algorithm in my engine. It allows to change renderers on the fly with a single key hit. Algorithm is not correct at this stage and gives pretty bad images. But it should give an insight to how it will perform:
on 1.6ghz intel atom: one divide per 8 pixel got 135fps, one divide per pixel got 52 fps. one while() per pixel got 102 fps. one while per 8 pixel got 127 fps.
on amd athlon 2500+: one divide per 8 pixel got 155fps, one while() per 8 pixel got 162 fps
on intel p200mmx: one divide per 8 pixel got 10.8fps, one while() per 8 pixel got 10.0 fps
Scores pretty close :) Because of it has only one divide per scanline, I think it can easiliy became fastest rendering method with loop unrolling and some asembly optimization.