http://www.byte.com/abrash/chapters/gpbb20.pdf
Abrash, Michael
Pentium Rules
Graphics Programming Black Book
Coriolis Group Books, 1997
The journey of x86 superscalar optimization starts with direct assembly language optimization for multiple pipelines on the original Pentium superscalar architecture. Michael Abrash walks through many of the steps that modern compilers probably already take in compiling code suited for simultaneous execution across various CPU components. This chapter is a good introduction to dependency difficulties encountered in superscalar optimization, as well as a few notes on resource conflicts where it involves cached memory. A dated work now, it still highlights the first widely available superscalar processor, the Intel Pentium, and the foibles of obtaining maximum performance.
http://byte.com/art/9401/sec7/art3.htm
Ryan, Bob
M1 Challenges Pentium
Byte Magazine, Jan. 1994
Foreshadowing the ongoing wars between Intel and AMD is this article from Byte Magazine that predates it by almost 15 years. While superscalar optimization had seen great success in mainframes and high-performance computing, Cyrix's M1 offering proved that there was not a tremendous drive in the consumer market to improve substantially upon released products. Presented as a direct comparison to the Intel Pentium, the Cyrix M1 is detailed in Byte Magazine to directly contrast and refute many particular points of architecture of the Intel Pentium. In particular, Cyrix counted upon the reliability of the Pentium on CPU-specific compiled code to be a weakness, and contested that by increasing the size and number of pipelines in a CPU efficiency would increase in pre-compiled code. The Cyrix M1 made many correct predictions as to the direction of x86 growths, including integrated CPU cache, the need for CPU code to run optimized without re-compiling, and branch prediction enhancements.
http://www.pattosoft.com.au/Articles/ModernMicroprocessors/
Patterson, Jason
Modern Microprocessors: A 90 Minute Guide
Sept. 2003
A comprehensive introduction to processors currently in use, as well as the hardware of the prior decade, this article covers nearly all of the bases regarding internal CPU architecture in an easy to digest cumulative manner. By dealing with only theoretical units of instruction performance and comparison, the article allows an unbiased review of the superscalar and super pipelining enhancements that have been introduced into the x86 family over the years. Highlighting several instruction difficulties that slow down superscalar execution, it also informs of the still-ongoing debate between intelligent CPUs and intelligent compilers with respect to program efficiency.
http://citeseer.ist.psu.edu/wang94precise.html
Wang, Ko-Yang
Precise Compile-Time Performance Prediction for Superscalar-Based Computers
IBM T.J. Watson Research Center
1994
This research paper discusses an approach to benchmarking superscalar processors by simulating a compiler's optimizations as well as translating high-level code into atomic operations which have defined limits in regards to their consumption of various components of a system. In addition, sections of code that are ambiguous in their running time/branch probabilities can be treated as algebraic equations to increase accuracy. Many of these techniques are the underpinnings of hardware based superscalar architecture, thusly it is the most accurate way to rate these processors for efficiency.
http://citeseer.ist.psu.edu/tullsen95simultaneous.html
Tullsen, Dean M., Eggers, Susan J., Levy, Henry M.
Simultaneous Multithreading: Maximizing On-Chip Parallelism
Proceedings of the 22nd Annual International Symposium on Computer Architecture, 1995
This research paper tried to get ahead of the curve of technology and predict the likely successor to superscalar processors. While the superscalar processor is a powerful tool to increase productivity, it gains a large boost by combining it with multi-threading, an extremely flexible approach to inherently parallel code. By allowing a minimizing of waste both in which processor elements are being utilized, as well as waste of processor cycles waiting on data dependency, a 4-fold gain in efficiency is theoretically possible over typical multi-threaded or superscalar processors alone. This technology is most recently visible in Intel's Xeon processor line, and powering the Sony PlayStation 3, in the form of the IBM Cell processor.
http://cdrom.amd.com/21860/18522f.pdf
Advanced Micro Devices
AMD-K5 Processor Data Sheet
Jan. 1997
Probably considered dry reading by some, the original data sheets for the AMD K5 processor are a wealth of information from one of the premiere suppliers of x86 superscalar processors. When originally introduced, this chip family supported a very minimal superscalar core, limited to 2 ALU units, an FPU unit, a branch unit, and 2 load/store units. Even with this limited architecture, inherent parallelism in pre-compiled code enjoyed a large performance increase compared to prior generations of x86 processors. Various pre-execution optimizations are made to x86 code in order to efficiently convert them to RISC-like Operations (ROPs) which can then be easily distributed to the processor work units. Well worth the read.
Here You Are
I don't care much for writing, nor for webpages. Please read the articles / papers / books linked in descending order for the full effect.