Archive for the ‘John Carmack’ Category

John Carmack, creator of Doom and Quake

Thursday, September 9th, 2004

This blog is dedicated to John Carmack, the creator of Games like Doom and Quake. It consists of his .plan file updates starting from 1997, converted into blog format.

Machinima Music Video

Friday, February 7th, 2003

The machinima music video that Fountainhead Entertainment (my wife’s company)
produced with Quake based tools is available for viewing and voting on at:
http://www.mtv.com/music/viewers_pick/ (”In the waiting line”)

I thought they did an excellent job of catering to the strengths of the
medium, and not attempting to make a game engine compete (poorly) as a
general purpose renderer. In watching the video, I did beat myself up a
bit over the visible popping artifacts on the environment mapping, which are
a direct result of the normal vector quantization in the md3 format. While it
isn’t the same issue (normals are full floating point already in Doom), it was
the final factor that pushed me to do the per-pixel environment mapping for
the new cards in the current engine.

The neat thing about the machinima aspect of the video is that they also have
a little game you can play with the same media assets used to create the
video. Not sure when it will be made available publicly.

NV30 vs R300, current developments, etc

Wednesday, January 29th, 2003

NV30 vs R300, current developments, etc

At the moment, the NV30 is slightly faster on most scenes in Doom than the
R300, but I can still find some scenes where the R300 pulls a little bit
ahead. The issue is complicated because of the different ways the cards can
choose to run the game.

The R300 can run Doom in three different modes: ARB (minimum extensions, no
specular highlights, no vertex programs), R200 (full featured, almost always
single pass interaction rendering), ARB2 (floating point fragment shaders,
minor quality improvements, always single pass).

The NV30 can run DOOM in five different modes: ARB, NV10 (full featured, five
rendering passes, no vertex programs), NV20 (full featured, two or three
rendering passes), NV30 ( full featured, single pass), and ARB2.

The R200 path has a slight speed advantage over the ARB2 path on the R300, but
only by a small margin, so it defaults to using the ARB2 path for the quality
improvements. The NV30 runs the ARB2 path MUCH slower than the NV30 path.
Half the speed at the moment. This is unfortunate, because when you do an
exact, apples-to-apples comparison using exactly the same API, the R300 looks
twice as fast, but when you use the vendor-specific paths, the NV30 wins.

The reason for this is that ATI does everything at high precision all the
time, while Nvidia internally supports three different precisions with
different performances. To make it even more complicated, the exact
precision that ATI uses is in between the floating point precisions offered by
Nvidia, so when Nvidia runs fragment programs, they are at a higher precision
than ATI’s, which is some justification for the slower speed. Nvidia assures
me that there is a lot of room for improving the fragment program performance
with improved driver compiler technology.

The current NV30 cards do have some other disadvantages: They take up two
slots, and when the cooling fan fires up they are VERY LOUD. I’m not usually
one to care about fan noise, but the NV30 does annoy me.

I am using an NV30 in my primary work system now, largely so I can test more
of the rendering paths on one system, and because I feel Nvidia still has
somewhat better driver quality (ATI continues to improve, though). For a
typical consumer, I don’t think the decision is at all clear cut at the
moment.

For developers doing forward looking work, there is a different tradeoff –
the NV30 runs fragment programs much slower, but it has a huge maximum
instruction count. I have bumped into program limits on the R300 already.

As always, better cards are coming soon.

————-

Doom has dropped support for vendor-specific vertex programs
(NV_vertex_program and EXT_vertex_shader), in favor of using
ARB_vertex_program for all rendering paths. This has been a pleasant thing to
do, and both ATI and Nvidia supported the move. The standardization process
for ARB_vertex_program was pretty drawn out and arduous, but in the end, it is
a just-plain-better API than either of the vendor specific ones that it
replaced. I fretted for a while over whether I should leave in support for
the older APIs for broader driver compatibility, but the final decision was
that we are going to require a modern driver for the game to run in the
advanced modes. Older drivers can still fall back to either the ARB or NV10
paths.

The newly-ratified ARB_vertex_buffer_object extension will probably let me do
the same thing for NV_vertex_array_range and ATI_vertex_array_object.

Reasonable arguments can be made for and against the OpenGL or Direct-X style
of API evolution. With vendor extensions, you get immediate access to new
functionality, but then there is often a period of squabbling about exact
feature support from different vendors before an industry standard settles
down. With central planning, you can have “phasing problems” between
hardware and software releases, and there is a real danger of bad decisions
hampering the entire industry, but enforced commonality does make life easier
for developers. Trying to keep boneheaded-ideas-that-will-haunt-us-for-years
out of Direct-X is the primary reason I have been attending the Windows
Graphics Summit for the past three years, even though I still code for OpenGL.

The most significant functionality in the new crop of cards is the truly
flexible fragment programming, as exposed with ARB_fragment_program. Moving
from the “switches and dials” style of discrete functional graphics
programming to generally flexible programming with indirection and high
precision is what is going to enable the next major step in graphics engines.

It is going to require fairly deep, non-backwards-compatible modifications to
an engine to take real advantage of the new features, but working with
ARB_fragment_program is really a lot of fun, so I have added a few little
tweaks to the current codebase on the ARB2 path:

High dynamic color ranges are supported internally, rather than with
post-blending. This gives a few more bits of color precision in the final
image, but it isn’t something that you really notice.

Per-pixel environment mapping, rather than per-vertex. This fixes a pet-peeve
of mine, which is large panes of environment mapped glass that aren’t
tessellated enough, giving that awful warping-around-the-triangulation effect
as you move past them.

Light and view vectors normalized with math, rather than a cube map. On
future hardware this will likely be a performance improvement due to the
decrease in bandwidth, but current hardware has the computation and bandwidth
balanced such that it is pretty much a wash. What it does (in conjunction
with floating point math) give you is a perfectly smooth specular highlight,
instead of the pixelish blob that we get on older generations of cards.

There are some more things I am playing around with, that will probably remain
in the engine as novelties, but not supported features:

Per-pixel reflection vector calculations for specular, instead of an
interpolated half-angle. The only remaining effect that has any visual
dependency on the underlying geometry is the shape of the specular highlight.
Ideally, you want the same final image for a surface regardless of if it is
two giant triangles, or a mesh of 1024 triangles. This will not be true if
any calculation done at a vertex involves anything other than linear math
operations. The specular half-angle calculation involves normalizations, so
the interpolation across triangles on a surface will be dependent on exactly
where the vertexes are located. The most visible end result of this is that
on large, flat, shiny surfaces where you expect a clean highlight circle
moving across it, you wind up with a highlight that distorts into an L shape
around the triangulation line.

The extra instructions to implement this did have a noticeable performance
hit, and I was a little surprised to see that the highlights not only
stabilized in shape, but also sharpened up quite a bit, changing the scene
more than I expected. This probably isn’t a good tradeoff today for a gamer,
but it is nice for any kind of high-fidelity rendering.

Renormalization of surface normal map samples makes significant quality
improvements in magnified textures, turning tight, blurred corners into shiny,
smooth pockets, but it introduces a huge amount of aliasing on minimized
textures. Blending between the cases is possible with fragment programs, but
the performance overhead does start piling up, and it may require stashing
some information in the normal map alpha channel that varies with mip level.
Doing good filtering of a specularly lit normal map texture is a fairly
interesting problem, with lots of subtle issues.

Bump mapped ambient lighting will give much better looking outdoor and
well-lit scenes. This only became possible with dependent texture reads, and
it requires new designer and tool-chain support to implement well, so it isn’t
easy to test globally with the current Doom datasets, but isolated demos are
promising.

The future is in floating point framebuffers. One of the most noticeable
thing this will get you without fundamental algorithm changes is the ability
to use a correct display gamma ramp without destroying the dark color
precision. Unfortunately, using a floating point framebuffer on the current
generation of cards is pretty difficult, because no blending operations are
supported, and the primary thing we need to do is add light contributions
together in the framebuffer. The workaround is to copy the part of the
framebuffer you are going to reference to a texture, and have your fragment
program explicitly add that texture, instead of having the separate blend unit
do it. This is intrusive enough that I probably won’t hack up the current
codebase, instead playing around on a forked version.

Floating point framebuffers and complex fragment shaders will also allow much
better volumetric effects, like volumetric illumination of fogged areas with
shadows and additive/subtractive eddy currents.

John Carmack

More graphics card notes:

Thursday, June 27th, 2002

More graphics card notes:

I need to apologize to Matrox — their implementation of hardware displacement
mapping is NOT quad based. I was thinking about a certain other companies
proposed approach. Matrox’s implementation actually looks quite good, so even
if we don’t use it because of the geometry amplification issues, I think it
will serve the noble purpose of killing dead any proposal to implement a quad
based solution.

I got a 3Dlabs P10 card in last week, and yesterday I put it through its
paces. Because my time is fairly over committed, first impressions often
determine how much work I devote to a given card. I didn’t speak to ATI for
months after they gave me a beta 8500 board last year with drivers that
rendered the console incorrectly. :-)

I was duly impressed when the P10 just popped right up with full functional
support for both the fallback ARB_ extension path (without specular
highlights), and the NV10 NVidia register combiners path. I only saw two
issues that were at all incorrect in any of our data, and one of them is
debatable. They don’t support NV_vertex_program_1_1, which I use for the NV20
path, and when I hacked my programs back to 1.0 support for testing, an
issue did show up, but still, this is the best showing from a new board from
any company other than Nvidia.

It is too early to tell what the performance is going to be like, because they
don’t yet support a vertex object extension, so the CPU is hand feeding all
the vertex data to the card at the moment. It was faster than I expected for
those circumstances.

Given the good first impression, I was willing to go ahead and write a new
back end that would let the card do the entire Doom interaction rendering in
a single pass. The most expedient sounding option was to just use the Nvidia
extensions that they implement, NV_vertex_program and NV_register_combiners,
with seven texture units instead of the four available on GF3/GF4. Instead, I
decided to try using the prototype OpenGL 2.0 extensions they provide.

The implementation went very smoothly, but I did run into the limits of their
current prototype compiler before the full feature set could be implemented.
I like it a lot. I am really looking forward to doing research work with this
programming model after the compiler matures a bit. While the shading
languages are the most critical aspects, and can be broken out as extensions
to current OpenGL, there are a lot of other subtle-but-important things that
are addressed in the full OpenGL 2.0 proposal.

I am now committed to supporting an OpenGL 2.0 renderer for Doom through all
the spec evolutions. If anything, I have been somewhat remiss in not pushing
the issues as hard as I could with all the vendors. Now really is the
critical time to start nailing things down, and the decisions may stay with
us for ten years.

A GL2 driver won’t give any theoretical advantage over the current back ends
optimized for cards with 7+ texture capability, but future research work will
almost certainly be moving away from the lower level coding practices, and if
some new vendor pops up (say, Rendition back from the dead) with a next-gen
card, I would strongly urge them to implement GL2 instead of proprietary
extensions.

I have not done a detailed comparison with Cg. There are a half dozen C-like
graphics languages floating around, and honestly, I don’t think there is a
hell of a lot of usability difference between them at the syntax level. They
are all a whole lot better than the current interfaces we are using, so I hope
syntax quibbles don’t get too religious. It won’t be too long before all real
work is done in one of these, and developers that stick with the lower level
interfaces will be regarded like people that write all-assembly PC
applications today. (I get some amusement from the all-assembly crowd, and it
can be impressive, but it is certainly not effective)

I do need to get up on a soapbox for a long discourse about why the upcoming
high level languages MUST NOT have fixed, queried resource limits if they are
going to reach their full potential. I will go into a lot of detail when I
get a chance, but drivers must have the right and responsibility to multipass
arbitrarily complex inputs to hardware with smaller limits. Get over it.

The Matrox Parhelia Report

Tuesday, June 25th, 2002

The Matrox Parhelia Report:

The executive summary is that the Parhelia will run Doom, but it is not
performance competitive with Nvidia or ATI.

Driver issue remain, so it is not perfect yet, but I am confident that Matrox
will resolve them.

The performance was really disappointing for the first 256 bit DDR card. I
tried to set up a “poster child” case that would stress the memory subsystem
above and beyond any driver or triangle level inefficiencies, but I was
unable to get it to ever approach the performance of a GF4.

The basic hardware support is good, with fragment flexibility better than GF4
(but not as good as ATI 8500), but it just doesn’t keep up in raw performance.
With a die shrink, this chip could probably be a contender, but there are
probably going to be other chips out by then that will completely eclipse
this generation of products.

None of the special features will be really useful for Doom:

The 10 bit color framebuffer is nice, but Doom needs more than 2 bits of
destination alpha when a card only has four texture units, so we can’t use it.

Anti aliasing features are nice, but it isn’t all that fast in minimum feature
mode, so nobody is going to be turning on AA. The same goes for “surround
gaming”. While the framerate wouldn’t be 1/3 the base, it would still
probably be cut in half.

Displacement mapping. Sigh. I am disappointed that the industry is still
pursuing any quad based approaches. Haven’t we learned from the stellar
success of 3DO, Saturn, and NV1 that quads really suck? In any case, we can’t
use any geometry amplification scheme (including ATI’s truform) in conjunction
with stencil shadow volumes.

Shadow Volume

Friday, March 15th, 2002

Mark Kilgard and Cass Everitt at Nvidia have released a paper on shadow volume
rendering with several interesting bits in it. They also include a small
document that I wrote a couple years ago about my discovery process during
the development of some of the early Doom technology.

http://developer.nvidia.com/view.asp?IO=robust_shadow_volumes

8:50 pm addendum: Mark Kilgard at Nvidia said that the current drivers already
support the vertex program option to be invarint with the fixed function path,
and that it turned out to be one instruction FASTER, not slower.

Nvidia vs. ATI

Monday, February 11th, 2002

Last month I wrote the Radeon 8500 support for Doom. The bottom line is that
it will be a fine card for the game, but the details are sort of interesting.

I had a pre-production board before Siggraph last year, and we were discussing
the possibility of letting ATI show a Doom demo behind closed doors on it. We
were all very busy at the time, but I took a shot at bringing up support over
a weekend. I hadn’t coded any of the support for the custom ATI extensions
yet, but I ran the game using only standard OpenGL calls (this is not a
supported path, because without bump mapping everything looks horrible) to see
how it would do. It didn’t even draw the console correctly, because they had
driver bugs with texGen. I thought the odds were very long against having all
the new, untested extensions working properly, so I pushed off working on it
until they had revved the drivers a few more times.

My judgment was colored by the experience of bringing up Doom on the original
Radeon card a year earlier, which involved chasing a lot of driver bugs. Note
that ATI was very responsive, working closely with me on it, and we were able
to get everything resolved, but I still had no expectation that things would
work correctly the first time.

Nvidia’s OpenGL drivers are my “gold standard”, and it has been quite a while
since I have had to report a problem to them, and even their brand new
extensions work as documented the first time I try them. When I have a
problem on an Nvidia, I assume that it is my fault. With anyone else’s
drivers, I assume it is their fault. This has turned out correct almost all
the time. I have heard more anecdotal reports of instability on some systems
with Nivida drivers recently, but I track stability separately from
correctness, because it can be influenced by so many outside factors.

ATI had been patiently pestering me about support for a few months, so last
month I finally took another stab at it. The standard OpenGL path worked
flawlessly, so I set about taking advantage of all the 8500 specific features.
As expected, I did run into more driver bugs, but ATI got me fixes rapidly,
and we soon had everything working properly. It is interesting to contrast
the Nvidia and ATI functionality:

The vertex program extensions provide almost the same functionality. The ATI
hardware is a little bit more capable, but not in any way that I care about.
The ATI extension interface is massively more painful to use than the text
parsing interface from nvidia. On the plus side, the ATI vertex programs are
invariant with the normal OpenGL vertex processing, which allowed me to reuse
a bunch of code. The Nvidia vertex programs can’t be used in multipass
algorithms with standard OpenGL passes, because they generate tiny differences
in depth values, forcing you to implement EVERYTHING with vertex programs.
Nvidia is planning on making this optional in the future, at a slight speed
cost.

I have mixed feelings about the vertex object / vertex array range extensions.
ATI’s extension seems more “right” in that it automatically handles
synchronization by default, and could be implemented as a wire protocol, but
there are advantages to the VAR extension being simply a hint. It is easy to
have a VAR program just fall back to normal virtual memory by not setting the
hint and using malloc, but ATI’s extension requires different function calls
for using vertex objects and normal vertex arrays.

The fragment level processing is clearly way better on the 8500 than on the
Nvidia products, including the latest GF4. You have six individual textures,
but you can access the textures twice, giving up to eleven possible texture
accesses in a single pass, and the dependent texture operation is much more
sensible. This wound up being a perfect fit for Doom, because the standard
path could be implemented with six unique textures, but required one texture
(a normalization cube map) to be accessed twice. The vast majority of Doom
light / surface interaction rendering will be a single pass on the 8500, in
contrast to two or three passes, depending on the number of color components
in a light, for GF3/GF4 (*note GF4 bitching later on).

Initial performance testing was interesting. I set up three extreme cases to
exercise different characteristics:

A test of the non-textured stencil shadow speed showed a GF3 about 20% faster
than the 8500. I believe that Nvidia has a slightly higher performance memory
architecture.

A test of light interaction speed initially had the 8500 significantly slower
than the GF3, which was shocking due to the difference in pass count. ATI
identified some driver issues, and the speed came around so that the 8500 was
faster in all combinations of texture attributes, in some cases 30+% more.
This was about what I expected, given the large savings in memory traffic by
doing everything in a single pass.

A high polygon count scene that was more representative of real game graphics
under heavy load gave a surprising result. I was expecting ATI to clobber
Nvidia here due to the much lower triangle count and MUCH lower state change
functional overhead from the single pass interaction rendering, but they came
out slower. ATI has identified an issue that is likely causing the unexpected
performance, but it may not be something that can be worked around on current
hardware.

I can set up scenes and parameters where either card can win, but I think that
current Nvidia cards are still a somewhat safer bet for consistent performance
and quality.

On the topic of current Nvidia cards:

Do not buy a GeForce4-MX for Doom.

Nvidia has really made a mess of the naming conventions here. I always
thought it was bad enough that GF2 was just a speed bumped GF1, while GF3 had
significant architectural improvements over GF2. I expected GF4 to be the
speed bumped GF3, but calling the NV17 GF4-MX really sucks.

GF4-MX will still run Doom properly, but it will be using the NV10 codepath
with only two texture units and no vertex shaders. A GF3 or 8500 will be
much better performers. The GF4-MX may still be the card of choice for many
people depending on pricing, especially considering that many games won’t use
four textures and vertex programs, but damn, I wish they had named it
something else.

As usual, there will be better cards available from both Nvidia and ATI by the
time we ship the game.

8:50 pm addendum: Mark Kilgard at Nvidia said that the current drivers already
support the vertex program option to be invarint with the fixed function path,
and that it turned out to be one instruction FASTER, not slower.

Quake 2 Source Code

Friday, December 21st, 2001

The Quake 2 source code is now available for download, licensed under the GPL.

ftp://ftp.idsoftware.com/idstuff/source/quake2.zip

As with previous source code releases, the game data remains under the
original copyright and license, and cannot be freely distributed. If you
create a true total conversion, you can give (or sell) a complete package
away, as long as you abide by the GPL source code license. If your projects
use the original Quake 2 media, the media must come from a normal, purchased
copy of the game.

I’m sure I will catch some flack about increased cheating after the source
release, but there are plenty of Q2 cheats already out there, so you are
already in the position of having to trust the other players to a degree. The
problem is really only solvable by relying on the community to police itself,
because it is a fundamentally unwinnable technical battle to make a completely
cheat proof game of this type. Play with your friends.

Driver Optimization

Friday, November 16th, 2001

Driver optimizations have been discussed a lot lately because of the quake3
name checking in ATI’s recent drivers, so I am going to lay out my
position on the subject.

There are many driver optimizations that are pure improvements in all cases,
with no negative effects. The difficult decisions come up when it comes to
“trades” of various kinds, where a change will give an increase in
performance, but at a cost.

Relative performance trades. Part of being a driver writer is being able to
say “I don’t care if stippled, anti-aliased points with texturing go slow”,
and optimizing accordingly. Some hardware features, like caches and
hierarchical buffers, may be advantages on some apps, and disadvantages on
others. Command buffer sizes often tune differently for different
applications.

Quality trades. There is a small amount of wiggle room in the specs for pixel
level variability, and some performance gains can be had by leaning towards
the minimums. Most quality trades would actually be conformance trades,
because the results are not exactly conformant, but they still do “roughly”
the right thing from a visual standpoint. Compressing textures automatically,
avoiding blending of very faint transparent pixels, using a 16 bit depth
buffer, etc. A good application will allow the user to make most of these
choices directly, but there is good call for having driver preference panels
to enable these types of changes on naive applications. Many drivers now
allow you to quality trade in an opposite manner — slowing application
performance by turning on anti-aliasing or anisotropic texture filtering.

Conformance trades. Most conformance trades that happen with drivers are
unintentional, where the slower, more general fallback case just didn’t get
called when it was supposed to, because the driver didn’t check for a certain
combination to exit some specially optimized path. However, there are
optimizations that can give performance improvements in ways that make it
impossible to remain conformant. For example, a driver could choose to skip
storing of a color value before it is passed on to the hardware, which would
save a few cycles, but make it impossible to correctly answer
glGetFloatv( GL_CURRENT_COLOR, buffer ).

Normally, driver writers will just pick their priorities and make the trades,
but sometimes there will be a desire to make different trades in different
circumstances, so as to get the best of both worlds.

Explicit application hints are a nice way to offer different performance
characteristics, but that requires cooperation from the application, so it
doesn’t help in an ongoing benchmark battle. OpenGL’s glHint() call is the
right thought, but not really set up as flexibly as you would like. Explicit
extensions are probably the right way to expose performance trades, but it
isn’t clear to me that any conformant trade will be a big enough difference
to add code for.

End-user selectable optimizations. Put a selection option in the driver
properties window to allow the user to choose which application class they
would like to be favored in some way. This has been done many times, and is a
reasonable way to do things. Most users would never touch the setting, so
some applications may be slightly faster or slower than in their “optimal
benchmark mode”.

Attempt to guess the application from app names, window strings, etc. Drivers
are sometimes forced to do this to work around bugs in established software,
and occasionally they will try to use this as a cue for certain optimizations.

My positions:

Making any automatic optimization based on a benchmark name is wrong. It
subverts the purpose of benchmarking, which is to gauge how a similar class of
applications will perform on a tested configuration, not just how the single
application chosen as representative performs.

It is never acceptable to have the driver automatically make a conformance
tradeoff, even if they are positive that it won’t make any difference. The
reason is that applications evolve, and there is no guarantee that a future
release won’t have different assumptions, causing the upgrade to misbehave.
We have seen this in practice with Quake3 and derivatives, where vendors
assumed something about what may or may not be enabled during a compiled
vertex array call. Most of these are just mistakes, or, occasionally,
laziness.

Allowing a driver to present a non-conformant option for the user to select is
an interesting question. I know that as a developer, I would get hate mail
from users when a point release breaks on their whiz-bang optimized driver,
just like I do with overclocked CPUs, and I would get the same “but it works
with everything else!” response when I tell them to put it back to normal. On
the other hand, being able to tweak around with that sort of think is fun for
technically inclined users. I lean towards frowning on it, because it is a
slippery slope from there down in to “cheating drivers” of the see-through-
walls variety.

Quality trades are here to stay, with anti-aliasing, anisotropic texture
filtering, and other options being positive trades that a user can make, and
allowing various texture memory optimizations can be a very nice thing for a
user trying to get some games to work well. However, it is still important
that it start from a completely conformant state by default. This is one area
where application naming can be used reasonably by the driver, to maintain
user selected per-application modifiers.

I’m not fanatical on any of this, because the overriding purpose of software
is to be useful, rather than correct, but the days of game-specific mini-
drivers that can just barely cut it are past, and we should demand more from
the remaining vendors.

Also, excessive optimization is the cause of quite a bit of ill user
experience with computers. Byzantine code paths extract costs as long as they
exist, not just as they are written.

Doom 3 on a GeForce 3

Thursday, February 22nd, 2001

I just got back from Tokyo, where I demonstrated our new engine
running under MacOS-X with a GeForce 3 card. We had quite a bit of
discussion about whether we should be showing anything at all,
considering how far away we are from having a title on the shelves, so
we probably aren’t going to be showing it anywhere else for quite
a while.

We do run a bit better on a high end wintel system, but the Apple
performance is still quite good, especially considering the short amount
of time that the drivers had before the event.

It is still our intention to have a simultaneous release of the next
product on Windows, MacOS-X, and Linux.

Here is a dump on the GeForce 3 that I have been seriously working
with for a few weeks now:

The short answer is that the GeForce 3 is fantastic. I haven’t had such an
impression of raising the performance bar since the Voodoo 2 came out, and
there are a ton of new features for programmers to play with.

Graphics programmers should run out and get one at the earliest possible
time. For consumers, it will be a tougher call. There aren’t any
applications our right now that take proper advantage of it, but you should
still be quite a bit faster at everything than GF2, especially with
anti-aliasing. Balance that against whatever the price turns out to be.

While the Radeon is a good effort in many ways, it has enough shortfalls
that I still generally call the GeForce 2 ultra the best card you can buy
right now, so Nvidia is basically dethroning their own product.

It is somewhat unfortunate that it is labeled GeForce 3, because GeForce
2 was just a speed bump of GeForce, while GF3 is a major architectural
change. I wish they had called the GF2 something else.

The things that are good about it:

Lots of values have additional internal precision, like texture coordinates
and rasterization coordinates. There are only a few places where this
matters, but it is nice to be cleaning up. Rasterization precision is about
the last thing that the multi-thousand dollar workstation boards still do
any better than the consumer cards.

Adding more texture units and more register combiners is an obvious
evolutionary step.

An interesting technical aside: when I first changed something I was
doing with five single or dual texture passes on a GF to something that
only took two quad texture passes on a GF3, I got a surprisingly modest
speedup. It turned out that the texture filtering and bandwidth was the
dominant factor, not the frame buffer traffic that was saved with more
texture units. When I turned off anisotropic filtering and used
compressed textures, the GF3 version became twice as fast.

The 8x anisotropic filtering looks really nice, but it has a 30%+ speed
cost. For existing games where you have speed to burn, it is probably a
nice thing to force on, but it is a bit much for me to enable on the current
project. Radeon supports 16x aniso at a smaller speed cost, but not in
conjunction with trilinear, and something is broken in the chip that
makes the filtering jump around with triangular rasterization
dependencies.

The depth buffer optimizations are similar to what the Radeon provides,
giving almost everything some measure of speedup, and larger ones
available in some cases with some redesign.

3D textures are implemented with the full, complete generality. Radeon
offers 3D textures, but without mip mapping and in a non-orthogonal
manner (taking up two texture units).

Vertex programs are probably the most radical new feature, and, unlike
most “radical new features”, actually turn out to be pretty damn good.
The instruction language is clear and obvious, with wonderful features
like free arbitrary swizzle and negate on each operand, and the obvious
things you want for graphics like dot product instructions.

The vertex program instructions are what SSE should have been.

A complex setup for a four-texture rendering pass is way easier to
understand with a vertex program than with a ton of texgen/texture
matrix calls, and it lets you do things that you just couldn’t do hardware
accelerated at all before. Changing the model from fixed function data
like normals, colors, and texcoords to generalized attributes is very
important for future progress.

Here, I think Microsoft and DX8 are providing a very good benefit by
forcing a single vertex program interface down all the hardware
vendor’s throats.

This one is truly stunning: the drivers just worked for all the new
features that I tried. I have tested a lot of pre-production 3D cards, and it
has never been this smooth.

The things that are indifferent:

I’m still not a big believer in hardware accelerated curve tessellation.
I’m not going to go over all the reasons again, but I would have rather
seen the features left off and ended up with a cheaper part.

The shadow map support is good to get in, but I am still unconvinced
that a fully general engine can be produced with acceptable quality using
shadow maps for point lights. I spent a while working with shadow
buffers last year, and I couldn’t get satisfactory results. I will revisit
that work now that I have GeForce 3 cards, and directly compare it with my
current approach.

At high triangle rates, the index bandwidth can get to be a significant
thing. Other cards that allow static index buffers as well as static vertex
buffers will have situations where they provide higher application speed.
Still, we do get great throughput on the GF3 using vertex array range
and glDrawElements.

The things that are bad about it:

Vertex programs aren’t invariant with the fixed function geometry paths.
That means that you can’t mix vertex program passes with normal
passes in a multipass algorithm. This is annoying, and shouldn’t have
happened.

Now we come to the pixel shaders, where I have the most serious issues.
I can just ignore this most of the time, but the way the pixel shader
functionality turned out is painfully limited, and not what it should have
been.

DX8 tries to pretend that pixel shaders live on hardware that is a lot
more general than the reality.

Nvidia’s OpenGL extensions expose things much more the way they
actually are: the existing register combiners functionality extended to
eight stages with a couple tweaks, and the texture lookup engine is
configurable to interact between textures in a list of specific ways.

I’m sure it started out as a better design, but it apparently got cut and cut
until it really looks like the old BumpEnvMap feature writ large: it does
a few specific special effects that were deemed important, at the expense
of a properly general solution.

Yes, it does full bumpy cubic environment mapping, but you still can’t
just do some math ops and look the result up in a texture. I was
disappointed on this count with the Radeon as well, which was just
slightly too hardwired to the DX BumpEnvMap capabilities to allow
more general dependent texture use.

Enshrining the capabilities of this mess in DX8 sucks. Other companies
had potentially better approaches, but they are now forced to dumb them
down to the level of the GF3 for the sake of compatibility. Hopefully
we can still see some of the extra flexibility in OpenGL extensions.

The future:

I think things are going to really clean up in the next couple years. All
of my advocacy is focused on making sure that there will be a
completely clean and flexible interface for me to target in the engine
after DOOM, and I think it is going to happen.

The market may have shrunk to just ATI and Nvidia as significant
players. Matrox, 3D labs, or one of the dormant companies may surprise
us all, but the pace is pretty frantic.

I think I would be a little more comfortable if there was a third major
player competing, but I can’t fault Nvidia’s path to success.