Professor Cullen Lecture Notes

Computer Architecture and Assembly Language

Topic 1
Introduction to Computer Technology

Computing Machine Types

Special Purpose -- Machines that are designed for a single specialized task and are not easily re-programmable. Hardware usually must be modified to reprogram.

General Purpose -- These machines can be re-programmed for multiple disparate applications. Reprogramming without hardware modifications.

Programming Languages

Microcode or Firmware -- Programming code that instructs the hardware on how to carry out a very low level task, such as opening a circuit gate, ORing and ANDing and NEGating current flow.

Machine -- Code that is understandable by the hardware. All information, which includes instructions, instruction operands and the data operated upon, is reduced to binary strings of ones and zeroes.

Assembly -- (such as IBM Basic Assembly and Intel x86 Assembly). Low-level programming language in which primitive mnemonic instruction codes are used to represent machine language. The mnemonics make the code "easier" to read and remember, versus the tedious encoding binary patterns of ones and zeroes. Assembly code is still somewhat difficult for humans to read or write, but after translation via an “assembler” into machine language it runs extremely fast.

High-Level -- (such as C++, COBOL, Fortran, Python). Human readable source code gets translated via a compiler ( or interpreter: BASH, REXX ) into assembly language of the architecture; which then is further assembled into machine language of ones and zeroes.

JAVA. -- JAVA is a "pivot" language, compiled into an architecture neutral code (known as byte-codes). JAVA employs an interpreter (called a Java Virtual Machine) on each of the different machine instruction architectures. The byte codes are translated at execution time into the machine language instructions that are unique to the physical hardware environment.

Translators

Assemblers translate symbolic mnemonic code programming (source code) to binary Machine Language (object code) suitable for the executing instructions for the architecture on which the program will run. Machine language consists of strings of ONES and ZEROES, whereas assembly code consists of a memorable instruction (ie. MOV) and operands that provide symbolic names and numbers referencing objects in memory.

Compilers are programs that translate human readable source programming code into the assembly language of the machine on which a program will run. First compiler was invented by Grace Hopper for the COBOL language.

Interpreters translate human readable source programming into machine code at execution time. This is performed “on-the-fly” as each statement is read from the source code.

The CPU interprets the binary machine language code as instructions, then executes.

Processes (post Translation)

Linkers, or Linkage-Editors will take all of the independently assembled machine language object programs and bind them together into a single executable program or load module. The linkers resolve the other programming (, written by others in the past), referenced in your object code. These object code pieces are stitched together to build a machine readable object. The final product is sometimes called: an executable, a binary, or a load module.

The Loader locates the executable load module that resides as a file in auxiliary storage (hard drive, flash drive, etc.); then, allocates memory of sufficient size, next loads the program into memory, and finally makes the program addressable by providing the caller with the entry point into its executable code.

Q: How was the first assembler written without an assembler to perform the assembly of the source code?

The Instruction

An instruction is the smallest fundamental unit of software programming. This is the interface between hardware and software. The instruction tells the hardware via simple electronic signals and circuits, how to carry out some function (such as addition or division). The instruction comes with a proof of correctness given its defined inputs that it will result in a valid output.

Instructions form the basis for creating greater functional programs.

Anything implementing logic smaller than an instruction is electrical engineering.

Anything creating logic composed of instructions is software engineering.

Programs are strings of instructions that tell the computer to carry out some task.

Major Software Categories

Operating System (supervisory tasks)

Permitted by the hardware to execute any instruction of the full instruction set. This programming is said to run in privileged mode. In this mode no instruction is restricted by the computer architecture from being executed.

(Synonyms include: system, supervisory, privileged)

Applications (application tasks)

Permitted by the architecture hardware to execute only a subset of the instruction set. This subset consists of the non-privileged instructions. Since privileged instructions could adversely affect the operation of the Operating System (OS) and other application tasks, only the native OS is permitted by the hardware to execute privileged instructions.

The set of privileged and unprivileged instructions are defined by the architecture.

Another way of looking at non-privileged instructions are those in the instruction set that when executed can have no adverse effect on other applications, the operating system nor the hardware. Note that the application can still have adverse effects on itself.

(Synonyms include: user, problem, unprivileged, application, app)

Architecture

"The truth always turns out to be simpler than you thought."
Richard P. Feynman

Architecture is the process of planning, designing and structure of some artifact. The practice of architecture is to fulfill both practical and expressive requirements serving both utilitarian and aesthetic ends, while providing the greatest stable functionality with the least expense.

Computer architecture is the conceptual design of the machine organization that specifies the fundamental operational structure of a computing system. It describes the requirements and design implementation for both the hardware components and the native operating software to perform instruction execution, input/output and memory access.

Components of a Computer Architecture:

Central Processing Unit (CPU) –

Central Processing Unit. The CPU is where instructions are obtained from memory, validated for content and executed; then results stored. In terms of computing, the CPU is the most important element of a computer.

Two typical components of a CPU are the following:

(1) The Arithmetic Logic Unit (ALU), which performs arithmetic and logical operations. Such as addition, division, comparisons, logical relationships ( AND'ing and OR'ing).

(2) The Control Unit (CU), which extracts instructions from real memory, then decodes and executes them, calling on the ALU when necessary.

Decoding and validating the bit strings consumes the greatest portion of the clock cycles used by the CPU.

All arithmetic and logical operations must occur in the CPU.

Registers –

Special, high-speed storage areas packaged with the CPU. These are the storage areas physically closest to the CPU. Accessibility is virtually instantaneous.

All data must be represented in a register before it can be processed. For example, if two numbers are to be added then both values must reside in registers (or be addressed by a register), and the result is also placed in a register (or is pointed to by an address contained in a register).

("Pointed to " means that the register contains the address of a memory location where data resides, rather than containing the actual value of the data itself.)

The number of registers available to the CPU, as well as the size of each register, (their length in bits) determine the addressability architecture of the computer.

For example, a 64-bit machine is one in which each register is 64 bits wide. Therefore, each instruction can accommodate 64 bits of data as operands; furthermore can address (reach) out to locations up to 2**64 bytes away. That addressability in decimal is about 19 quintillion.

Program Status Word (PSW) or Program Control Register (PC) –

The PSW is a hardware register which contains information about the current state of processing. This register is relied on the operating system and the underlying hardware. It will contain the address to the next instruction to be fetched for processing by the CPU.

The PSW also contains the information required for proper program execution. Some of the many fields:

--an error status field,
--condition code set by prior instruction execution,
--interrupt enable/disable masking bit,
--and a supervisor/problem state mode bit.

In general, the PSW is used to control instruction sequencing and to indicate the status of the system in relation to the program currently being executed.

The active or controlling PSW is called the CURRENT PSW. The current PSW is represented by a hardware register. There are other psw(s) located in fixed positions in real memory, one OLD PSW and one NEW PSW for each type of interruption possible. These PSWs are swapped into/from the CURRENT PSW upon a particular interruption event.

By storing (saving) the current PSW during an interruption, the status of the CPU can be preserved for subsequent inspection by the OS. By loading a NEW PSW (restoring) or modifying the contents of some part of a PSW, the state of the CPU can be initialized or changed.

Word Length and Data Alignment –

8 bits constitute a byte. Each byte must begin on every 8th bit, that is 0, 8, 16, 24, etc. The byte is a conventional architectural boundary.

A Word of data, by convention, is usually the number of bits determining the architecture (16, 32, 64) aligned on a byte boundary whose address is divisible by the number of bytes in the Word.

A Half-Word is half the length of a Word, aligned along a byte boundary whose address is divisible by halfword length.

A Double-Word is twice the length of a word aligned along a byte boundary whose address is divisible by 2*Word_length.

An architectural "even" byte boundary, is any location divisible by 2. Such as 0, 2, 4, 6, 8. Early architectures required that all instructions originate only on an even boundary.

Except for the bit location ZERO, the bit locations of 1 thru 7 are not checked for instructions or data, for simplicity reasons. Doing so would require a much more complex and expensive CPU with its associated logic.

Since all data other than single bits must begin on a byte boundary the CPU can therefore ignore most of the bit locations of every byte insofar as looking for an address or the start of a data field.

Should instructions begin on an "odd numbered" boundary location then extra work would have to be performed by the CPU.

The CPU loads the number of bytes that could be accommodated by it largest instruction format with each fetch.

On IBM systems, bytes containing instructions must be aligned on an “even numbered” boundary. This relieves the CPU from looking for instructions on half of all the possible addressable memory locations.

Addressability --

Each individually accessible unit (byte) of information in storage is selected using its numerical memory address (location). In modern computers, location-addressable storage is limited to the range of real storage.

The ability to reach a particular location in memory is limited by the numerical range that can be accommodated by a register.

How high can the CPU count and have the result fit in a register?

Examples:
16-bit length registers allow access to approximately 65,536 bytes.
32-bit registers expand that addressability to the neighborhood above 4 billion bytes.

The greater the addressability, the more information (programming and data) that can be fit into a sufficiently larger real memory.

Fixed Storage Locations --

Areas in the computer memory that the hardware will place status information regarding external and internal exception events; and cause the storing and loading of program status words (PSW) when an interruption occurs.

The first 4096 bytes of real memory is known as the “Prefixed Storage Area” or “Page Zero”. The manufacturers of operating systems must always adhere to this interface in order that the hardware can correctly communicate with the operating system. The owner of Page Zero will be "recognized" by the hardware as the owner of the computer. The architecture recognizes the owner as the only program that can can issue privileged instructions.

Note that it is the owning program (the operating system) that sets its initial PSW mode field to indicate that it executes in privileged mode. This is done at the initial program loading of the OS by the boot mechanism. It is also the OS manufacturer that determines the architectural privilege mode following each interruption, by any mechanism, simply by preparing each NEW PSW loaded at initial program loading time (booting).

Instruction Set –

The instruction set comprises the operations that are the smallest programmed units that the CPU can interpret as the functional guidelines to accomplish some operation or manipulative task. It represents the abstract programming interface between the hardware and the lowest-level software. The instruction set encompasses all the information necessary to write a machine language program that will run correctly, and can be re-purposed for different tasks.

You can think of each instruction as a provably reliable function or procedure that is executed by the hardware. An instruction is composed of electrically engineered circuits that correctly carry out the function using the hardware.

The instruction set layer abstracts away (hides) the details of the function of the underlying hardware. This is the boundary between software engineering and electrical engineering.

Each instruction is documented by its manufacturer regarding use, inputs expected and outputs provided. Furthermore, each instruction is also documented in the ways in which it could fail to operate.

The software engineer need not be concerned with the details of the electrical engineering.

Program Performance Measures and Factors

CPU Time --

Supervisory or System consumption time of the CPU.
Time consumed by architecture on behalf of the operating system.
Also known as computing “overhead”.

Problem or User application use of CPU time.
Time consumed by the architecture executing applications or by the OS on behalf of the application.

Wall Clock Time --
This is a measure of elapsed real time.

I/O execution time vs. CPU execution time – Time durations for moving data from place to place (known as I/O)
vs.
Time spent consuming CPU cycles in computation (Add, Multiply, Shift, etc.)

What are architectural ways to affect and improve performance?

Throw more, better and faster hardware at the problem.

Reduce thermal energy loss.

Improve the algorithm to more quickly arrive at a solution, by using less instructions.

Use faster instructions (those that use less time and power) which produce the same result as other instructions.

Reduce the size of the Instruction Set.

Identify and provide a means to run instructions in parallel.

Provide a cache between real memory and the CPU for most used instructions.

Remove bottlenecks occurring with CPU speed, Memory availability or Input-Output traffic.

Interruption Processing,

What is an Interrupt?

An interrupt is a signal to the machine emitted by hardware (or software in the event of a Service Call) indicating an event that needs immediate attention. Interruption is the method that the outside world communicates with the operating system. There is a special software routine in each operating system associated with each type of interruption that could occur. They are called Interrupt Handlers. You can think of the interrupt handlers as the entry points into the operating system. They are the points where control is transferred by the computer architecture from Application processing to Operating System processing.

What happens during Interrupt Processing?

The proper operation of the interrupt system depends on an intimate interaction between a specifically designed hardware subsystem behavior and a set of sophisticated software operating system routines. Successful operation depends on a clear channel of communication between the two separate functions. Central to this requirement is the permanent allocation of certain storage locations for specific uses (Fixed Storage Area). All of these locations are at the extreme low end of memory (Real Address Zero).

The machine can be in one of two states, enabled or disabled for interruption.

When the machine is disabled for interruptions the operating system cannot be notified of any request for work.

When the machine is enabled for interruptions the machine is currently either:

a) waiting for work (IDLE), or
b) currently executing an instruction (RUNNING).

When an interruption occurs and the machine is running, the current executing instruction is either:

a) completed,
b) terminated, or
c) suppressed

before the interrupt is honored. Thus, if an interrupt condition exists and the source causing the interruption is not masked, then the interrupt is said to be taken.

Taking an interrupt is accomplished as a hardware function by causing the current PSW to be stored in the permanently assigned storage location for the old PSW that is associated with the interrupt level causing the interrupt. A code is also stored with the old PSW identifying the specific interrupt condition. The contents of location assigned to the same level for a new PSW are then loaded into the current PSW, thus becoming the active PSW. By the introduction of this new PSW, a different program operating in the privileged machine state can be brought into service through the specification of a new instruction address, condition code, masks and key. This new program is usually a program module of the operating system written by its manufacturer to specifically service the interrupt that has just occurred.

This servicing of an interrupt is performed by a software routine called the First-Level Interrupt Handler, rather than a hardware function. Once the interrupt condition has been analyzed and serviced by the interrupt handling routine, either the previous routine or an entirely different one can be re-initiated by the OS service routine by execution of another specific privileged instruction, that not only dispatches the next task, but switches the machine operating states.

Thus the next software module loaded, likely an application will operate in the non-privileged problem state.

What are the different types of Interruptions?

There are six classes of interruptions. We will study four of them:

1) Program-Check. This applies to exceptions due to improper use of an instruction. Generally, as software engineers these are viewed as undesirable since it is related to a design flaw in programming.

2) External. These apply to events caused by Timer Pops, Operator Attention, and the signaling messages between CPUs in a multiprocessing environment.

3) Input-Output. These are caused by the signaling from external devices that an I/O operation has completed successfully, or has failed.

4) Supervisor (or Service) Call. These apply to an application requesting that a privileged operation to be performed on the application's behalf.

The other two that we are not studying:

5) Restart

6) Machine-Check

Topic 2
Language and Instructions

Binary Number Representation

Some History

4th-3rd Century BC -- Plato described the components of the world to be made of "form" and "matter". Form being the information and rules that govern the world; and matter being the physical tangible environment.

9th Century -- The word "algorithm" namesake is the Latinized Algorithmi, the name of the Persian textbook author Abū ‘Abd Allāh Muhammad ibn Mūsā al-Khwārizimī, from Uzbekistan. He was known for his step-by-step procedural instructions. The term algebra also comes from his name.

1605 -- Francis Bacon discussed a system whereby letters of the alphabet could be reduced to sequences of binary digits. This method could be used with any objects provided the objects employed be capable of only a twofold difference.

1679 -- Gottfried Leibniz is credited with the modern binary number system in his publication "Explanation of Binary Arithmetic" in which only the characters 1 and 0 are used.

1801 -- Joseph Jacquard invented a loom which employed punched cards that encoded the designs for the reproducible textile products.

1831 -- Charles Babbage designs the Difference Engine. Later, in 1837, the Analytical Engine incorporated an arithmetic logic unit, control flow in the form of conditional branching and loops, and an integrated memory, making it the first design for a general-purpose computer. The machine was never completed because this era did not have the mechanical engineering needed to build the thousands of gears, cams, pins, rollers, springs, etc. with sufficient precision and tight tolerances required to keep a computation from skipping or jamming. The program ("form") was separate from the hardware ("matter").

1854 -- George Boole, a British mathematician, published a landmark paper "Laws of Thought", detailing an algebraic system of logic that would become known as Boolean algebra. His objective was to show how complex human reasoning could be represented in a logical mathematical form. He demonstrated that a natural language statement could be reduced to a mathematical one of variables resulting in true or false. His logical calculus which were formulas of logical expressions connected by operators of OR, AND and NOT. This was to become instrumental in the design of both digital electronic circuitry and symbolic logic that laid the foundation of computing languages.

1880 -- John Venn, conceived diagrams to illustrates simple set relationships in logic and later computer science.

1936 -- Alan Turing linked logic and computational machines. Turing investigated the "Decision Problem", a mechanical process of computation that could, in a finite number of steps, determine if a proposition in predicate logic was true or false. Turing advanced an analytic engine that processes a strip of tape with logical symbols written on it. Turing showed that the simplest computing machine can be built using zeroes and ones. Anything that can be computed, can be, by employing only a handful of basic rules or operations with a sufficiently long tape acting as memory.

Read the contents of the memory under the tape head.
Write (1 or 0) or Erase the contents at the location under the tape head.
Shift the tape position left one location.
Shift the tape position right one location.
Increment or Decrement by one.

A "Turing-Complete" machine would become to be known as any computer that could simulate the operation of any other computer. Furthermore, Turing proved that there was no way to inspect a program to determine if its loops were infinite.

Turing invented the mechanical computer, the "Bombe", which successfully deciphered the German Nazi "Enigma" code.

1937 -- Claude Shannon implemented Boolean algebra and binary arithmetic using electronic relays and switches for the first time. Shannon's thesis founded practical digital circuit design. Shannon combined logic with electronics, showing how Boolean logical can be applied to electronic circuits to construct and resolve logical numeric relationships. Shannon is credited as the Father of Information Processing.

1944 -- Tommy Flowers, engaged by Turing, creates the first digital electrical computer using electric valves, known as tubes, used for rapid decrypting the German Lawrence cipher code used by the German High Command to send messages to its field commanders. (First deployed June 1, 1944. Kept secret for the next 30 years). His machine called Colossus was Turing-Complete.

1945 -- John Mauchly and Presper Eckert build the first electronic vacuum-tube general purpose re-programmable computer (ENIAC).

1946 -- John Von Neumann described a design architecture for the electronic computer called EDVAC, still in use today. Though first posited by Presper Eckert, Von Neumann is credited with the "stored program" model that keeps both its instructions and data in memory. A central processor fetches instructions and data from memory for processing then places the result back to memory. Information to/from the outside world was called I/O, for input and output. Von Neumann computers are easily re-programmed without the need of laborious setting of switches and re-wiring.

1947 -- William Schockley, John Bardeen and Walter Brattain of Bell Labs in New Jersey, invented the transistor.

Transistors (formed by words "transmitter” and “resistor”) are physical switches that rely on a quirk of quantum mechanics known as an “electron hole.” A hole is the lack of an electron at a spot where one could exist in semiconducting material. By introducing an electric signal to a transistor, electric fields are created that force holes and electrons to swap places. This allows regions of the transistor that normally insulate to conduct (or vice- versa). The transistor is the analog of a "valve" or “switch”.

1956 -- William Schockley leaves Bell Labs to start Schockley Semiconductor Lab in Mountain View, CA. He seeks out the eight brightest new PhDs from engineering schools across the country to develop the next generation of transistors. The eight men were Julius Blank, Victor Grinich, Jean Hoerni, Eugene Kleiner, Jay Last, Gordon Moore, Robert Noyce and Sheldon Roberts.

1957 -- Noyce and "the group of eight" leave their erratic mentor, Schockley, to create a their own company that became a new division of Fairchild Camera and Instrument.

1959 -- Robert Noyce invents the integrated circuit.

1968 -- Robert Noyce and Gordon Moore leave Fairchild to found Intel.

1971 -- Ted Hoff invents the microprocessor at Intel. The 4-bit addressable 4004, followed by the 8-bit 8008 processor in 1972.

Binary and Hexadecimal Number Representation

For data to be "process-able" it must be represented in the machine as signals or a measurable physical disturbance in the storage media.

All information, which includes data, computer programs, instructions and numbers are represented in the computer as BINARY digits (called bits).

The entire operating system (OS) and its application programs (APPS) along with their data can be represented as a single string of ones and zeros.

The bit is a digit in the base 2 numbering system that represents the fundamental unit of information.

Bits are physically grouped into strings of 8 units called bytes. Each byte is addressable by the computer architecture. ( Note: bits are NOT addressable )

The byte has a corresponding hex code in the EBCDIC or ASCII code tables which provides its binary code equivalent.

Binary is base 2. Functional for machines.

BINARY: 0 | 1 times 2**i

Example: 0111 = 0x8 + 1x4 + 1x2 + 1x1 = 7,
1000 = 8,
1001 = 9

Decimal is base 10. Comfortable for humans.

DECIMAL: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 times 10**i

Example: 456 = 4x100 + 5x10 + 6x1

Hexadecimal is base 16. Helps humans interpret binary representation.

HEXADECIMAL: 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F times 16**i

Example: x15 = 21, x2F = 47, xFF = 255

2AF = 2x256 + 10x16 + 15x1 = 512 + 160 + 15 = 687

Binary code is easily converted to Hexadecimal code because both are powers of 2. Each consecutive 4-binary bits make up a hexadecimal character. Hex code is used by engineers because it poses much less of a challenge to read for humans, where binary is extremely difficult to read and translate. Machine code when printed is displayed in hexadecimal so that humans can have a fighting chance at reading.

Why use Binary Numbers ?

Binary coded numbers and its associated mechanical arithmetic used fewer components than the same in decimal code. Circuits that need only to distinguish between two possible values were more reliable than circuits that need to discern more than two values. Two-state circuits were less difficult to construct and prove more reliable.

The binary system is a two-state natural system; that is, it is well-suited for describing physical phenomena that we find to occur in nature:

North and South poles of a magnet,
On/Off or Positive/Negative voltages or currents for electrical switches,
The peak and trough of a radio frequency signal,
Pits and Lands to encode DVDs.

There exists no partial state between the two discreet values and therefore little room for error.

Employing Boolean logic of two input signals, and a simple logical operation on those signals, you produce a single output (a value of ONE or a ZERO). Only two possible outputs can result from the four permutations of the two inputs.

Binary digital circuits can be engineered with a proof of correctness, they are virtually “bulletproof”.

Strings of binary digits is sufficient to represent any possible information.

Binary information can account for any two digit arithmetical operations or the true/false evaluation of conditional premise statements.

Computer Limited Representation of Numbers

Not every real number is computable. Numbers are represented as approximations in the computer architecture. There is a limited amount of bits to hold any number.

A real number is one which can be expressed in the form of:

p / q

The decimal representations of most real numbers are so completely lacking in pattern that there simply is no finite table of instructions that can be followed for calculating the nth digit of the representation for an arbitrary number. If you attempted to perform the calculation by hand you would run out of paper to write your results.

Decimal representations for irrational numbers such as PI and e; also rational numbers such as 1/7 and 1/3, will therefore be approximations.

For precise finite representation, the number must be divisible by 1/10**n

a*(1/10) + b*(1/100) + c*(1/1000) + d(1/10000) + ......

In Binary the number must be divisible by 1/2**n

a*(1/2) + b*(1/4) + c*(1/8) + d(1/16) + ......

In fact, computable numbers are relatively scarce among the real numbers. There are only countably many computable numbers, whereas there are uncountably many real numbers. (A set of numbers is countable if and only if either it is finite or its members can be put into a one-to-one correspondence with the integers.)

Endianess

The term "Endian" refers to the order of storing bytes in computer memory.

Each byte of data in memory has its own address.

There are two different conventions for ordering bytes within a larger data object:

Big-endian systems (such as IBM), store the most significant byte of a word in the lowest address and the least significant byte is stored in the highest address.

Take the hexadecimal number x'06971A2F'.

This value can be represented in memory as 06971A2F ( a set of 4 bytes called a “word”, written out using left-to-right positional, hexadecimal notation)

The four memory locations with addresses: a, a+1, a+2 and a+3;

Then, in big-endian systems, byte x'06' is stored in location a, x'97' in a+1, x'1A' in a+2 and x'2F' in a+3. The most significant part of the number is stored in the lowest address.

It looks like this:

06971A2F

Little-endian systems, in contrast, store the least significant byte in the lowest address

In little-endian systems (such as Intel), the order is reversed with x'2F'; stored in location a, x'1A' in a+1, x'97' in a+2, and x'06' in location a+3.

It looks like this:

2F1A9706

Instruction Set

The alphabet of computer generated information is the ASCII or EBCDIC conventions. This is the code of symbols that is understood by the computing operating environment.

ASCII = American Standard Code for Information Interchange
EBCDIC = Extended Binary Coded Decimal Interchange Code

All computers must be able to understand the same alphabet of symbols for effective communication.

The language of the computer architecture is the instruction set. This is the set of binary strings representing instructions that are understood by the particular machine on which the instructions execute. Each instruction represents an item in the vocabulary of the computer architecture’s language.

Both instructions and data to be operated upon are stored in the computer memory as binary strings. Known as the stored-program concept, this allowed computers to be reprogrammed relative easily, for a variety of different applications.

Q: How does the computer differentiate between data and instruction programming?

The stored program concept of computing makes it possible to switch the interpretation of a string of bits found in memory between data and instruction. That is, the bits that represent "things" such as data, and bits that represent instructions, to "do things".

Information comes in two flavors:

(1) data which represents objects to be used and manipulated,

(2) and executables which represent functions that operate on data.

Every computer architecture must provide a set of instructions for performing fundamental operations, such as arithmetic, comparing, loading and storing data.

The binary instructions, useful to machine hardware, are translated to hexadecimal when rendered on paper or a display surface for human readability. Each instruction is assigned a useful mnemonic so that they are more easily recalled by programmers.

IBM examples:

Binary Machine code = Hex Machine Code = Assembly mnemonics

1101 0010 = x'D2' = MVC ( Copies characters to an address

0101 1000 = x'58' = L ( LOAD from addressable storage

0101 0000 = x'50' = ST ( STORE to addressable storage

0001 1010 = x'1A' = AR ( ADD register contents

0001 1001 = x'19' = CR (COMPARE register contents

Register Usage

The operands of many instructions are restricted to using a limited number of special locations built into the hardware called registers. These primitive features are different from real memory locations in that:

their number is limited;
they are physically close to the CPU;
access is extremely fast;
addressing is simple, needing only 4 bits to represent the hex numbers 0-F;
and their data can never be paged out from memory.

The size of the register (its width in bits) is limited to that of the underlying architecture (16, 32, 64-bit).

Smaller width registers and less of them will result in faster access and manipulation of data contained in them. But wider registers allow for a much longer reach (addressability) into real memory.

Registers are of several types:

General Purpose -- Used for numeric data and address arithmetic.

Control -- Used for controlling reconfiguration of the CPU functions

I / O -- Used for managing input and output operations

Program Status -- Used for maintaining state of processor, next instruction and the results of the last instruction, interrupt masking, etc.

We will study the General Purpose Registers and the PSW.

Instruction Operands

Each instruction represents a function that has at least one input operand. The operand either designates a register that contains a numeric value, or represents the effective address of some data in real memory. That effective address points to the location of some byte in real memory that represents the starting point of an information field.

Please note that the effective address does not tell us how much data originating from that memory location will be used in computation. We also need to know the length of data on which to operate starting from that location.

The operand(s) gives both the starting location of the data to be used by the function, and the length of data to be manipulated. The length of the data which starts at the effective address is either explicitly stated, or architecturally implicit.

Architecturally implicit means that the length is designed into the system. The CPU "knows" the length based upon the definition of the instruction being processed. That length is known as a byte, half-word, word, and double-word.

Example: MOV [opbuf], RAX Copy the full reg contents to opbuf

If the length of data is not architecturally known by the CPU then the operand will include an explicit length code used with the instruction.

Example: MOV AL, byte [RDX] Copy a single byte to AL

The instruction is composed of an operation code and operands.

The operation code is also known as a combinational element. They tell the CPU what function to perform, such as add, divide, move, etc.

The operands are known as state elements. They hold or point to a value.

Note: the CPU does not see the names that we assign to data variables or methods; those concepts are for we humans. The CPU is concerned only with addresses and lengths of the data values to be manipulated. The assembler converts each procedure or variable name to an address.

The operands of an instruction may be:

One or more registers containing a numerical data value or
the address (or offset) of data in real memory (ie. a pointer), or
the length of some data, or
a literal data value acting as an immediate operand.

The effective address in memory identifies the location in real memory where data is loaded from or to be stored into.

This location is calculated from the contents of the operands which consists of the register(s) contents and optionally a displacement value.

Effective address = contents of register(s) + [some displacement in memory]

While a register may contain a numerical value it generally points to a location in memory that in turn contains a value. The effective address is the sum of the following values:

( Base register + [ Index register ] + [ Displacement ] )

If the instruction relies on an implicit length then the amount of data acted upon is an architecture defined fixed amount such as a:

bit -- a single fundamental unit of information

byte -- eight bits aligned on a bit boundary

halfword -- two bytes aligned on a byte boundary that is a multiple of 2

word -- four bytes aligned on a byte boundary that is a multiple of 4

doubleword -- eight bytes aligned on a boundary that is a multiple of 8

Otherwise, the instruction uses and explicit length code that is included on one or both operands.

Immediate operands do not have a register or memory location association, therefore, no need to fetch information from memory or a register. Immediate operands are supplied in the program code itself. They are used to indicate a literal data value or a bit-pattern (mask) to be applied directly to the computation from the instruction itself.

The OR IMMEDIATE (OI) and AND IMMEDIATE (NI) instructions will alter the bit contents of real storage at some displacement from the location pointed to by a given register.

Example: OI 0(R2), X'0F' Turn ON the low-order 4 bits

Example: NI 0(R2), X'FE' Turn OFF the low-order single bit

Instruction Interpretation

For each instruction defined in the computer architecture the CPU must be able to resolve:

1) Is the set of bits being examined truly an instruction, or just non-executable data.

2) what service or operation the instruction is to provide,

3) where are the operands are located, (in memory, or a register, the PSW or immediately available within the program code).

4) the length of the data to be manipulated (the address only tells us where the first byte of data starts),

5) and finally, where to place the result. (memory, a register, a condition code)

Some Examples:

A load data transfer operation copies data from real memory to a register.

L R7, 1024 ( R2, RA ) load reg7 with the data from memory location at effective address of register2 + register10 + displacement 1024.

5872A400 (in hex)
01011000011100101010010000000000 (in binary)

If R2 contained 8192, and RA contained 100 then the effective address would be location 9316, (8192+100+1024).

R7 would then be loaded with the contents that start at location 9316 for a length of 8 bytes.

The length is architecturally implicit; the width of the register.

A store data transfer operation copies data from a register to real memory.

ST R7, 2048 ( R2, RA ) store contents of reg7 into memory location at effective address of register2 + register10 + 2048 displacement.

5072A800 (in hex)
01010000011100101010100000000000 (in binary)

If R2 continues to contain 8192, and RA contained 100 then the effective address would be location 10340, (8192+100+2048).

R7 would then store its contents at the location 10340 for 8 bytes.

The length is architecturally implicit; the width of the register.

A move characters operation copies data from one memory location to another.

MVC 512 ( 80, R2 ), 0 ( R4 ) move the contents from the memory location at effective address of register4 + a displacement 0 to the memory location at effective address of register2 + a displacement 512 for an explicit length of 80 bytes.

D25024004000 (in hex) [opcode,length,Rx,Dx,Ry,Dy]
110100100101000000100100000000000100000000000000 (in binary)

If R2 continues to contain 4096, and R4 contained 1024 then the effective address of the source would be location 1024, (1024 + 0).
And the effective address of the target would be location 4608, (4096 + 512).

This instruction would move 80 bytes of data from address 1024 to adress 4608. The length of 80 is explicit.

Instructions must be aligned in memory, generally on architectural boundaries divisible by 2. This way the CPU need only concern itself with data starting at the "even" numbered memory locations when looking for the next instruction. Furthermore, the CPU can ignore all bit locations other than the zero bit location that begins each evenly addressed location. Alignment permits a more efficient interpretation, lowers thermal buildup, and facilitates speed.

Most instructions set a Condition Code that represents the success or failure (and other ending conditions) of the executed instruction. Since the CPU cannot “remember” what operation it last performed, this status information is stored in the Program Status Word (or Program Control Register). The condition code is available for access and interrogation by the next instruction. It allows the next instruction to choose the same or different path for continued program execution. (Think IF then ELSE)

Bit Manipulations used for Logical Operations

Logic Gates

A gate is an electrically engineered device that performs a basic operation on electrical signals.

(Please note that a gate can be built out of many different materials, and the signals received and sent are not required to be electrical.)

The gates in the computer circuitry hardware are referred to as logic gates. Each performs just a single logical function. That is, each logic gate accepts one or two input values and produces a single output value. Because we are dealing with binary information, each input and output value is either a zero (0) or a one (1), corresponding to some reliable physical process. The type of gate and the input values determine the output value.

Logic Gates are combined into circuits to perform increasingly more complicated tasks. The following logic operations were designed in circuitry hardware to efficiently manipulate bits to perform operations used in arithmetic calculation and determination of logical relationships.

Below are the Boolean truth tables that define the function of each gate by listing all possible input combinations that these logic gates could encounter, along with the corresponding outputs produced:

NOT Flips the bit to become the opposite:

NOT 1 = 0
NOT 0 = 1

AND The intersection of two bits, both bits must be one:

1   AND   1    =    1
1   AND   0    =    0
0   AND   1    =    0
0   AND   0    =    0

NAND (NOT AND), the opposite of AND:

1   NAND   1    =    0
1   NAND   0    =    1
0   NAND   1    =    1
0   NAND   0    =    1

OR The union of two bits, either bit is one or both bits are one:

1   OR   1    =    1
1   OR   0    =    1
0   OR   1    =    1
0   OR   0    =    0

XOR Exclusive OR (union minus the intersection), either bit is one, but NOT both bits:

1   XOR   1 =    0
1   XOR   0 =    1
0   XOR   1 =    1
0   XOR   0 =    0

Other constructed logic gates are found that combine two of the prior discussed gates. In particular the NAND and NOR gates simply run the AND and OR outputs through a NOT gate.

NAND (NOT AND), product the opposite of the AND gate:

1   NAND   1    =    0
1   NAND   0    =    1
0   NAND   1    =    1
0   NAND   0    =    1

NOR (NOT OR), product the opposite of the OR gate:

1   NOR   1    =    0
1   NOR   0    =    0
0   NOR   1    =    0
0   NOD   0    =    1

Electronic logic gates are built from transistors. The transistor acts like a valve for controlling the flow of electrons.

Instructions are combinations of logic gates. The instruction takes input, then performs some function, and finally produces a result of a single bit value.

Note that other than the NOT gate the other logic gates are not reversible, that is we cannot run the outputs through the gate again to recover the inputs.

Fundamental logic gates provide us with the capability to use electrons to process intelligence.

Architectural Considerations regarding Alterations

The size of the register, memory addressability, fixed storage locations, instruction and operand alignment, the length of a page of storage, the condition codes set by each instruction are all simultaneously bound by the underlying architecture. Make a change to one aspect of the architecture and you will have to re-engineer all of them.

How small can we make a computer in principle? How little energy is actually needed to carry out a computation?

"The Fundamental Physical Limits of Computation", by C.H. Bennet and R. Landaur.

Please note that with each of the above logic gates, aside from the NOT, you cannot always derive the input from the outputted result. That is the computation is not reversible; information about the input is lost.

When circuits get very small you must consider Brownian Motion which will introduce some randomness in the path of electrons (thermal oscillation). By using current energies (ie. 2 volts) the chance of an electron jumping backward to its intended flow is extremely low (10**-43). But what if we wanted to reduce energy considerably, such that now we have to consider possible path reversals.

At low energies the calculation proceeds forward, then possible goes backwards to "un-calculate", then proceeds forward again.

Using normal logic gates information is lost when the electron flow moves backward. We need a reversible gate to accommodate such behavior. Take the NAND gate and let's make it reversible. It now has 3 inputs (A, B, C) and three outputs (A', B', C').

Two of the outputs A' and B' are the same as the inputs A and B, but the third output is dependent on the following:

C' is the same as C unless A and B are both 1, in which case it C' is the opposite of whatever C is.

Now we can discover what went into the gate by what came out, and therefore the gate is reversible.

Place two of these modified NAND gates in series, then add a little electrical bias to move forward, and you are assured of the correct computation; even though it may reverse itself some.

Topic 3
Machine Arithmetic

Does hardware actually perform arithmetic as humans are taught in elementary school?

What happens when an arithmetic operation results in a bigger number than can be represented in the hardware?

How do we represent negative numbers?

How do we partition the permutations of bits available in the register's width to accommodate both positive and negative numbers?

How do we represent zero?

And lastly, what mechanical operations are applied to these binary number representations such that we can correctly affect arithmetic computation?

Why use Binary for Machine Arithmetic over Decimal?

Engineers settled on binary because binary-coded arithmetic use much fewer components than decimal-coded arithmetic. Circuits distinguishing between only two possible values are much more reliable than circuits having to determine a range of ten values.

Signed and Unsigned Number Representations

In all number bases the high-order digits represent the greater magnitude number, and the low-order digits represent the least significant numbers.

All numeric values are based upon the following format:

A digit d raised to the power base**i

Example of base 2: ( 1 x 2**24 )

Example of base 10: ( 987 x 10**8 )

Positive and Negative numbers are represented by both a sign and a magnitude.

(The magnitude of a number indicates how far to the right or left of the origin (0) it appears in our well known number line.)

In binary, the negative sign is represented by a 1-bit in the high-order position (that is, the leftmost digit), whereas a positive number will have a zero-bit as it's high-order digit.

Note that the high-order digit cannot be used in determining the magnitude of a number, so this reduces the greatest possible number (magnitude) by half.

Unfortunately, simply including a high-order 1-bit to the magnitude does not give us a usable negative value. For instance, this has an undesirable side-effect of having two zeros (positive and negative).

Instead engineers have come up with a scheme that removes this problem. The requirements for providing both usable negative numbers and performing reliable arithmetic operations:

a) Virtually equal number of bit permutations on each side of the number line.
b) There will be only a single zero.
c) All permutations must function correctly under arithmetic operations.

Electrical engineers have also determined that it is more efficient to ADD binary numbers then it is perform a SUBTRACTION of them. So, when a difference is called for, the number to be subtracted (the subtrahend) is converted to a special form called TWOs COMPLEMENT.

In TWOs COMPLEMENT each bit is inverted, then a 1 is added to the result. This new binary number (known as the ONES COMPLEMENT) is then added to the first number, and a correct difference will result.

Representing Negative numbers with Twos Complement

Examples of Two's Complement:

One's Complement number representation takes each bit of the binary number and inverts it. That is, the bits are sent through a NOT gate.

Example:
7 = 0111
One's Complement of 0111 is 1000

Two's complement number representation takes the one's complement and then adds one to it (with any carries).

Example:
7 = 0111
One's Complement of 0111 is 1000
Two's Complement is 1001

Two's Complement numbers are a way to encode negative numbers into binary, such that you can add the twos complemented number to another number and in effect result in a subtraction.

Adding -1 + 1 should always equal 0.

If during addition in making a two's complemented number, you have a carry out to a position to the left of the high-order digit available to the register, then the hardware will ignore the carry. (It doesn't fit and drops away)

The Two's Complement method successfully splits the number range evenly between positive and negative numbers on the number line.

Twos Complement ensures that there is only a single zero configuration, that being 0000....0 propagated throughout the entire register.

The smallest magnitude positive number is 0000...0001

The greatest magnitude positive number is 0111...1111

Note that zero is in the high-order position.

The smallest magnitude negative number is 1111...1111

The greatest magnitude negative number is 1000...0000

Note that a one is in the high-order position.

1000...0001

(-) ||-------------------...-------------------|-----------------...-----------------------| (+)

^ 0 ^
1000...0000 0111...1111

What may appear unintuitive about the negative range is that the negative number with the greatest magnitude has all zeroes in it's low-order bits, along with the negative indicator at the high-order bit location.

The negative number with the smallest magnitude (-1) has all of its bits set to ONEs. Aside for the sign bit, the negative range appears to run backwards.

Addition and Subtraction

If a number is to be subtracted from another number, the subtrahend is converted via a twos complement manipulation, then that result is added to the original minuend. Any carries past the high-order position are ignored (dropped by the hardware).

Detection of overflow generates an exception.

Note that “adding” is performed by XOR operations, with a carry if the result of the XOR operation is a zero ( a 1 OR'ed with a 1 results in 0).

(see examples in the BinaryArithmetic diagram)

Overflow

Overflow occurs when the result from an operation cannot be represented in the receiving hardware register. Arithmetic operations can generate results that cannot fit in the architecture's fixed size register. A maximum of 65 bits may be required to add two large 64 bit numbers.

Overflow can occur only when two operands with the same sign are added.

Overflow cannot occur when the signs differ.

The hardware will "know" the sign of the result prior to performing the computation.

For signed numbers, the high-order bit is used to indicate whether the contents are positive or negative.

During overflow from adding, the high-order sign bit is set with the value of the result through carries. When the carries do not result in the sign expected then the hardware will detect and report an exception.

When adding two positive numbers the sign will indicate negative, if overflow occurs.

When adding two negative numbers the sign will indicate positive, when overflow occurs.

Multiplication

The multiplicand is multiplied by the multiplier resulting in a product.

The intermediate products are shifted 1 bit to the left.

The length of the product of an n-bit multiplicand and a m-bit multiplier is n+m-1 bits. That is n+m-1 bits are required to accommodate all the possible products.

Generally multiply instructions are provided with a product area that is two registers in length.

A maximum of 128 bits are likely required to multiply two 64-bit numbers that have significance in any bit position greater than bit 32.

Overflow cannot occur because two registers are used to contain the result.

Hardware multiplication is a series of shifts and ANDs applied to the bits. “Multiplying” a digit in the multiplier by a digit in the multiplicand is performed by AND operations.

The hardware will "know" the sign of the result prior to performing the computation.

Multiplication Steps:

The multiplier bits are logically ANDed with each bit in the multiplicand, with the result stored as a row.

The next multiplier bit to the left performs likewise, with the result stored in the next row but shifted 1 bit to the left.

The row is added (logically XOR with possible carry) to the prior to produce an intermediate result.

The process continues until the multiplier bits are exhausted for length of the architectural number of bits in a register.

The last result becomes the product.

Note that shifting bits 1 bit to the left has the same effect as multiplying by 2, with more efficient hardware performance.

Faster multiplication is accomplished by a pipelined series of adders. One for each bit of the multiplier. One input is the multiplicand ANDed with the multiplier bit, and the other is the output of the prior adder. Pipelining means connecting the outputs of one adder with the inputs of the next.

For signed multiplication, simply recall the signs of the multiplier and multiplicand. Following simple mechanical arithmetic rules: If they are different then the result must be negative otherwise the result is positive. (see examples in the BinaryArithmetic diagram)

Division

Dividend = (Quotient x Divisor) + Remainder

The Remainder will always be smaller than the Divisor.

The problem when dividing decimal numbers is that you must know how many times the divisor goes into the portion of the dividend you are currently working with.

With binary numbers the divisor goes into the intermediate portion of the dividend either 0 times or 1 time. If the dividend has more significant high-order digits than the divisor then you know that the divisor can be subtracted from the dividend again.

Division Steps:

The dividend is examined to see if it is bigger than the divisor. I so then a 1 is generated in the quotient, else a zero.

With each step the divisor is subtracted (using twos complement) from the high-order portion of the dividend being processed.

Then then next digit in the dividend is appended to the intermediate remainder and the process iterates until complete.

The final remainder and quotient is the result. (see examples in the BinaryArithmetic diagram)

For signed division, simply recall the signs of the divisor and dividend. Following simple mechanical arithmetic rules: If they are different then the result must be negative otherwise the result is positive.

Floating-Point Binary Number Representation

First, know that binary numbers can have a decimal point. It works more or less the same way that the decimal point does with decimal numbers. For example, the decimal 22.589 is merely 22 and 5x10^-1 + 8x10^-2 + 9x10^-3.

Similarly, the binary number 101.001 is:

1x2²+0x2¹+1x2⁰+0x2^-1+0x2^-2+1x2^-3, or rather simply 2²+2⁰+2^-3 = 5.125).

Second, know that binary numbers, like decimal numbers, can be represented in scientific notation.

Example: The decimal 923.52 can be represented as 9.2352 * 10².

Similarly, binary numbers can be expressed that way as well. Say we have the binary number 101011.101 (which is 43.625 in decimal).

This would be represented using scientific notation as 1.01011101 * 2⁵.

In 32-bit architecture the single precision floating point unit is a packet of 32 bits, divided into three sections one bit, eight bits, and twenty-three bits, in that order.

Sign 1 bit Exponent 8 bits Mantissa 23 bits

the representation has three fields:

     +-+-------+-------------------------+
     |S|   E   |            M            |
     +-+-------+-------------------------+

S is one bit representing the sign of the number
E is an 8-bit biased integer representing the exponent
M is an unsigned integer

the decimal value represented is:

                           e
          (-1)S  x  m  x  2  , where

e = E - bias

            m = ( M / (2^n) ) + 1

Sign Field

The first section, the sign, is one bit in length. 0 indicates that the number is positive, 1 negative. The number 1.01011101 * 2⁵is positive, so this field would have a value of 0.

Exponent Field

The second section is eight bits long (also called the characteristic), and serves as the "exponent" of the number as it is expressed in scientific notation as explained above. A field eight bits long can have values ranging from 0 to 255.

To cover the case of negative values, this "exponent" is actually 127 greater than the "real" exponent "a" of the 2^aterm in the scientific notation. The actual range is therefore -127 to +127. In our 1.01011101 x 2⁵number, the 8-bit exponent field would have a decimal value of 5 + 127 = 132. In binary this is 10000100.

Mantissa Field

The third section is twenty-three bits long, and serves as the "mantissa." (The mantissa is sometimes called the significand.) The mantissa is merely the "other stuff" to the left of the 2^aterm in our binary number represented in scientific notation. In our 1.01011101 * 2⁵number, you would think that the 23-bit mantissa, would take the value 10101110100000000000000, but it does not. If you think about it, numbers expressed in binary notation would have a leading 1. Why? In decimal scientific notation there should never be expressed a value with a leading 0, like 0.2392 * 10³or something. This would be expressed instead as 2.392 * 10². The point is, there is never a leading 0 in scientific notation, and it makes no difference whether we're talking about binary, decimal, hexadecimal or any number base. The advantage of binary, though, is that if a digit can't be 0, it must be 1! So, the 1 is assumed to be there and is left out to give us just that much more precision. Thus, our mantissa for our number would in fact be 01011101000000000000000. (The long stream of 0s at the end has one more zero than the alternative number at the beginning of this paragraph.)

Double Precision vs. Single Precision

In addition to the single precision floating point described here, there are also double precision floating point units. These have 64 bits instead of 32, and instead of field lengths of:

1, 8, and 23 ,

we have field lengths of:

1, 11, and 52.

The exponent field contains a value that is actually 1023 larger than the "true" exponent, rather than being larger by 127 as in single precision. Otherwise, it works exactly the same.

Sign 1 bit Exponent 11 bits Mantissa 52 bits

Floating Point Operations

Support for fractions and real numbers (numbers expressed as p/q)

Numbers are expressed in scientific notation, numbers with a single digit to the left of the decimal point.

A floating point number is normalized if it has no leading zeroes.

The sign is the first bit, followed by an exponent, then the fraction.

Overflow means that the exponent is too large.

Underflow means that the exponent is too small.

A double infers a floating point that is twice the bit length. Both the exponent and the fraction contain an increased number of significant bits, and thereby greater precision and magnitude.

The Number Representation Problem

Infinitely many real numbers cannot be squeezed into a finite number of bits, and therefore requires an approximate representation. Although there are infinitely many integers, in most programs the result of integer computations can be stored in the number of bits provided by a register. In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using the available bits. Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation.

Topic 4
The Processor

What are the processes by which instructions get executed in the CPU?

How does the CPU know where in memory to go to get the next instruction to be processed?

How does the CPU recognize the difference between executable code from non-executable data fields?

Steps in instruction execution

Fetch the instruction from memory.

Examine the address of the next instruction via the Program Counter (PC) or the Program Status Word (PSW).

Go to the memory location containing the next instruction to be executed.

Bring the instruction (which includes the opcode and its operands) from real addressable memory into the CPU hardware buffer. This is generally consists of the maximum amount of data needed to populate the longest instruction format. This data starts at that memory address that the PSW points.

Decode the instruction.

Each instruction falls into a category or like instructions. Such as arithmetic instructions, move from and to memory instructions; instructions that perform logical operations, etc. These categories are known as the Instruction Formats. Instructions of the same format share characteristics like their length in bytes, the use of registers, a possible displacement and an explicit length code.

Using the Instruction Set list and the Instruction Formats, the Decode stage determines the Operation Code (opcode) and reads the contents of the register(s).

The opcode field denotes the operation to be performed and the format of the instruction.

Further, the Decode stage sets the control and data signals based upon the format of the particular instruction. Each format has a unique requirement for processing. These signals effectively reconfigure the CPU “on the fly”.

Compute the address of the next instruction to be fetched and update the Program Status Word (PSW).

Execute the instruction.

This work takes place in the Arithmetic Logic Unit (ALU),

Uses the control and data signals set by the Decode stage as a guide to determine the work to perform and the data to act on.

May calculate a next instruction address in memory in which to transfer program flow, (a branch). If the next inline address does not contain the instruction to be next fetched, then override the next instruction address in the PSW that was calculated in the Decode stage.

Data memory access.

Ensure that the operands' memory addresses can be reached, that is, they are addressable by the architecture. Ensure that the address computed by the content of the registers and displacements does exist.

Ensure access to any register(s) needed.

Set the Condition Code in the PSW.

Write the result back to memory.

Place the data results of computation into the targeted memory location or into the result register.

Logic Elements

A combinational element is one that carries out some operation, such as an AND or XOR gate, in the Arithmetic Logic Unit (ALU). This element will determine how the information provided by the two state elements (the operands) is processed or combined by the operation of the instruction.

A state element is a contiguous set of bits residing at a memory location or occupying a register. This data represents the input contents or the results of a logic operation. Those would be the elements that would need to be saved and restored upon context-switching the current task (interrupting, saving, restoring and re-dispatching) in the event of an interruption.

The state element is somewhat like a data variable in a high-level language. It holds the state or the current value of an operand used by the instruction.

An instruction found in memory appears as a string of bits that consists of both a combinational element (the operation code) and state elements (the operands and values pointed to by the operands).

A datapath element is a unit, similar to state elements, used to operate on or hold transient, or temporary, data within the CPU. This may include intermediate results from the instructions, the ALU, and the adders.

Clocking

Clocking is rhythmic ticking of the computer's internal timepiece.

Clocking is measured in Hertz, (cycles per second), and is currently on the order of billions of cycles per second.

It takes time for logic gates to propagate signals from their input registers to their output registers. Operations have a certain amount of time to start and complete. If new inputs are provided to be read before the circuits "settle" then the outputs are likely to be misinterpreted by subsequent circuits.

Clocking defines when signals can be read and when they can be written; as well as the time interval in which a unit of work must start and then finish executing.

Every logical step in computation must be completed before the next step is attempted. The interval between ticks is long enough to guarantee that the circuit is fully settled before the output is stored. This provides the software programmer the guarantee that each instruction will be correctly finished before the next step is attempted and that the inputs are available. The instructions are processed reliably in the order in which they were written.

Synchronization is important so that all work units execute in lock-step. If operations did not synchronize into a tight tolerance then a signal could be read at the same time it is being written thereby causing unpredictability and loss of data integrity.

A logic gate has the interval of time from one clock edge to the next to complete a unit of work. Finishing early is not a problem. Finishing late would prove catastrophic.

Any values updated in a sequential logic element (register or variable in memory) occur at a clock edge and are maintained for the entire clock interval.

Signals

Control Signals are used for the proper selection or the directing of the operation of a functional unit. Each instruction is a member of a particular format category. The control signals reconfigure subsequent stages on the basis of the format for the executing instruction. Such as:

1) the instructional requirements to access a register or memory location,
2) the string of logic gates used to carry out the function or computation,
3) resolving the locations of the inputs and output of the calculation.

The setting of the control lines is completely determined by the operation code (opcode) of the instruction. The instruction format tells the CPU how to "retool" at decode stage time. Somewhat like a car assembly line being reconfigured for a particular model with each vehicle being produced.

A Data Signal is used to provide content information (the operands) of the instruction that is operated on by the subsequent stages. That is, identification to the further stages as to the source of the inputs (register, condition code, memory location, immediate data) and the content of the outputs. This information is encoded in the operands' fields of the instruction format.

How important are signals?

Without signals we would need a different, separate special purpose CPU to handle each type of instruction format. The signals allow a generalized CPU to be customized “on-the-fly” to accommodate the different instruction formats, with their various inputs and outputs.

With special purpose CPUs you would not need the full implementation of the DECODE step, thus making instruction execution faster, but only a single CPU could be operating at a time. The multiple CPUs would be much more expensive to implement.

With a generic CPU the DECODE stage is employed which consumes more clock cycles per instruction by the hardware, but the compromise makes this relatively inexpensive.

Single-Cycle Implementation

Performing the above in single cycle fashion will work correctly, but is not used in modern designs because it is inefficient.

The clock cycle must be the same length for each instruction. The clock cycle is determined by the longest possible path instruction, the worst case delay, to be completely processed by the CPU.

Each instruction is executed serially (one at a time) to completion.

The next instruction does not begin processing until the prior has vacated the CPU.

Pipelined Datapath and Control

Pipelining is a technique that exploits parallelism among multiple instructions in a sequential instruction stream. The instructions are overlapped in execution.

The division of an instruction into five stages means a five-phase pipeline, which in turn means that up to five instructions (theoretically) will be in CPU processing simultaneously during any single clock cycle.

Pipelining increases the number of simultaneously processing instructions and the rate at which instructions are started and completed. Pipelining does not reduce the time it takes to complete any individual instruction.

As each step in the instruction procedures finishes,

the first step of the next instruction is fetched.

the second step of a prior instruction is decoded,

the third step of a prior instruction is executed,

the fourth step of a prior instruction accesses data memory,

the last step of a prior instruction writes the result.

Pipelining does NOT change the order in which instructions are executed. The programmer can consider the programmed instruction order as "set in stone".

Pipeline (ideal) performance:
If stages are perfectly balanced with ideal processing conditions then:

Time between instructions = Time between non-piped instructions / Number of pipe stages

Pipeline vs. Single Cycle instruction performance
Just as the single-cycle design must take the worst-case number of clock cycles to process the complete instruction, the pipeline need only consider the worst-case amount of time needed to complete the slowest step in the pipeline.

Outside of the memory system, the effective operation of the pipeline is usually the most important factor in determining the Clock Cycles Per Instruction of the processor and hence its performance.

Designing Improvements for PIPELINES

1) Have very few differing instruction formats, thereby less CPU re-configuring.

2) Make each instruction format for the entire instruction set, the same length, thus easier computing of the location of the next instruction.

3) Capability to operate on operands that are in main memory.

4) Ensure that instruction operands reside in the same "page" of storage as the instruction itself.

PIPELINE considerations

Given a large number of instructions the increase in speed is proportional to the number of pipe stages.

Pipelining improves performance by increasing instruction throughput. It does not decrease the execution time for any single instruction.

Modern processors employ a number of pipeline stages in an attempt to maximize the processor clock rate. The circuit designers divide the instruction execution into a large number of very simple steps so that each stage can minimize the delay.

Pipeline Hazards

A pipeline hazard is a situation where the next instruction in a stream cannot execute due to conditions set up by the processing of instructions currently preceding it in the pipeline. Hazards impact the ideal performance theoretically achievable by pipelining.

Let's look at four types:

A structural hazard is one in which the instruction cannot execute in the same clock cycle because the hardware does not support all the possible combination of instructions that are set to run simultaneously in overlapped execution. There exists a conflict, such as a common resource that is needed by more than one operation and that resource cannot be shared.

Examples:
1) An instruction that executes a compare operation and has not yet set the condition code, (the common resource), in the PSW. The following instruction has the need to take action on the basis of the condition code.

2) Two instructions require access to the same memory location and the memory has a lock or a single path. The first instructions may want to write (store), the second to read (load).

A data hazard occurs when an instruction cannot execute in the same clock cycle because the input data that is needed by the instruction has not yet been made available; it still remains in the process of being computed or has not yet begun to be computed.

Examples:
1) A DIVIDE instruction needs the result of a prior ADD instruction which is still in the pipeline and has not yet stored its sum.

2) A MOVE instruction that puts spaces into a field in memory is followed immediately by another MOVE of data into the same location. The first instruction must complete it's work before the subsequent instruction, otherwise corruption ensues.

Approaches to remediate structural and data hazards

Scheduling – The programmer (or compiler) explicitly avoids scheduling instructions whose sequence could possibly create data hazards. The programmer does not do this work today.

Duplicating – Add additional hardware to the design so that each instruction could access separate and independent resources at the same time.

Stalling – A pipeline stall is a necessary "wait" introduced into the processing that is required to resolve a hazard.

A NOP (No Operation) is an instruction used as a programmed stall to consume time (clock cycles) but performs no computational work. The NOP takes up space in a stage, adding elasticity (like a bubble) to the pipeline without loss of information or computational results.

Dynamically generated NOP instruction(s) are injected into the execute stage each time there is a need to hold an instruction back that wants to use data, until the source instruction ahead produced the data. It does not cause any changes to the registers, memory, condition code or program status.

Forwarding or bypassing -- In the case of data hazards, forwarding or bypassing is a method of resolving a data hazard by retrieving the missing state elements from the CPU internal buffers rather than waiting for results to arrive at the programmer visible registers or the expected place of storage in memory.

Forwarding is valid only if the destination stage of instruction pipeline 'Y' occurs at a later time than the source stage in pipeline 'X'.

Bypassing comes from passing the result around the register file to the desired destination unit.

A control hazard ( better known as a branch hazard ), occurs when the instruction cannot execute in the pipeline cycle because the next instruction fetched is not the instruction that is needed. That is, the logic flow of instruction accesses has "jumped" or "branched" away from the address of the next expected instruction. The location of that next logical instruction is located somewhere else in memory, and not adjacent, to the instruction just executed.

This gives rise to the "IF-THEN-ELSE" construct that programmers are familiar.

The pipeline cannot possibly have future knowledge of what the next instruction to execute should be if it is currently processing a branch type instruction.

A solution would be to stall the instruction stream until the next effective instruction address was computed, but this would cause a slowdown in processing.

Branch prediction is a method of resolving a branch hazard that assumes a given outcome for the branch and proceeds from that assumption rather than waiting to ascertain the actual outcome. A "guess" based upon the recent past flow of instructions.

Prediction will not slow down the pipeline when it is correct. However, when wrong, the subsequent instructions need to be flushed from the pipeline and those pipeline stages must be reprocessed with the correct string of instructions.

Longer pipelines exacerbate the problem by raising the cost of an error in prediction.

Program-Check Exception Events

An exception is an event (usually software) that disrupts the normal program execution and leads to an interruption. Exceptions can be normal and expected such as a supervisor service call, or abnormal unexpected events such as a program-check. In either case the computer architecture recognizes the event and proceeds by initiating the interruption processing.

The abnormal exceptions are hazards. In particular program-check exceptions were initially designed to handle unexpected events from within the processor, like arithmetic overflow, divide by zero, etc.

An interrupt is an asynchronous hardware event that alters the sequence in which the processor executes instructions. The term asynchronous used here means that the interrupt event is not serialized in time from some other event. It does not wait for some other process to finish. Multiple interruptions can happen simultaneously and occur without coordination with each other. Typically asynchronous interruptions are generated by an I/O device, a Timer or Operator intervention.

When an exception or interruption occurs the programming flow of control is suspended and the programming path gets immediately changed. For example, if application code has been executing then the change is made to operating system code. If the event were a program-check, the architecture must perform some action to save the address of the offending instruction; then a transfer of control to a special OS routine. This routine, called an interrupt handler, takes corrective, supervisory or recovery action. Immediately following the event, the status, context and registers contents involved with the suspended task are saved in a unique recovery area of memory ( a data structure) by the OS.

Note that any unfinished instructions in the pipeline ahead of the instruction causing the exception must be allowed to run to completion.

Furthermore, and instructions in the pipeline following the instruction causing the exception must be flushed from the system as though they never happened.

From the software engineer's point of view, he/she must be assured by the computer architecture that the instructions were processed as though executed serially, one after another, in time.

The pipelined implementation treats exceptions as another form of a control hazard. The hardware will stop the offending instruction in midstream, let all prior instructions run to completion, and flush all the instructions following the exception from the CPU pipeline.

The von Neumann Bottleneck

The shared bus between the program memory and data memory leads to the von Neumann bottleneck, the limited throughput (data transfer rate) between the CPU and memory compared to the amount of memory. Because instructions and data must travel between the memory and CPU across a bus, throughput is lower than the rate at which the CPU can work. This seriously limits the effective processing speed when the CPU is required to perform minimal processing on large amounts of data. The CPU is continually forced to wait for needed data to be transferred to or from memory. Since CPU speed and memory size have increased much faster than the throughput between them, the bottleneck has become more of a problem, a problem whose severity increases with every newer generation of CPU.

Topic 5
Exploiting Memory Hierarchy

Virtually all computers are connected by the Internet. No computer has immediate access to all the information that it may process. The information exists remotely, not locally in the machine's real addressable memory. Information must be physically moved to within reach of the CPU to be acted upon.

Using limited addressable memory we find that the most important program property that we regularly endeavor to exploit is the information's locality of reference. This comes from the observation that programs tend to reuse data and instructions that have been recently used.

It can be predicted with reasonable accuracy which instructions and data a program will use in the near future based upon its history of accesses in the recent past.

Locality

The Principle of Locality states that programs access a relatively small portion of their address space during any interval of time.

Temporal Locality is the principle stating that once a location containing data or instructions is referenced then it is likely that it will be referenced repeatedly in the very near future. Such as a programming loop.

Spatial Locality is the principle stating that if a data location is referenced, then subsequent references will be to data with addresses in the same physical neighborhood.

Both temporal and spatial locality will define the working-set size of a program, that is the set of instructions that make up the normal operating core of a program and tend to be executed or reused over again.

A program's working-set consists of those pages that the operating system considers (measured) to exhibit the greatest locality.

In general, it is desirable to develop programs with a high degree of locality because locality helps to keep the CPU busy. This means that the CPU is not waiting for instructions and data to be fetched from non-local areas (such as auxiliary storage volumes). Exploiting locality principles will reduce the Von Neumann bottleneck.

Programmers no longer manage this consciously. We rely on the operating system and other mechanisms to track and determine which information exhibits relatively high degrees of locality.

Memory Hierarchy

A memory hierarchy is the application of the locality principles that employs a structure that uses multiple levels of memories (or storage).

As the distance of the information from the CPU increases, the access time needed to retrieve data at distance will increase.

The closer memories are generally smaller and faster; and more expensive.

The memories further out toward the base of the hierarchy pyramid decrease in expense due to their spatial or temporal distance.

The Data Pyramid

                                                                  CPU
                                                         Buffers
                                                      General Registers
                                                    L1/L2/L3.../Lx Cache
                                               Real Addressable Memory
                                               Non-Addressable Storage
                                          Solid State Fixed Disk Storage
                                    Direct Access Storage Device Cache
                                   Direct Access Storage Devices (harddisk)
                               T    h    e           I      N      T      E      R      N      E     T
                          C     L     O     U     D                   S    e    r    v     e    r    s
                       Physical Detachable Magnetic TAPE Drive Storage Silos

The goal of a memory hierarchy is to present the user with as much memory as is available using the least expensive technology, while providing access speed that is as close as possible to that offered by the fastest memory.

By implementing the memory system as a hierarchy the user is presented with the illusion of a memory that it is as large as the largest level of the hierarchy, but can be accessed as if it were all built from the fastest and closest memory to the CPU.

Benefits:

Memory hierarchies take advantage of temporal locality by tracking the more recently accessed data and then keeping it physically closer to the CPU.

Memory hierarchies take advantage of spatial locality by moving multiple contiguous blocks in memory (data residing in the same neighborhood) in anticipation of use to upper levels of the hierarchy.

The price of main memory on the IBM 360 Model 65 was about $1.60 per byte!

The price on the IBM 370 Model 168 was about $0.20 per byte.

Today the cost per byte is much less than $0.00001. This is “less than dirt cheap”.

The Basics of Caching

A cache is a smaller, faster memory which stores copies of instructions most recently executed and the data accessed from the most frequently used memory locations.

Cache Hits and Misses

A memory hierarchy will consist of multiple levels. Data is copied, or propagated, between only two adjacent levels at a time.

If the data that is being requested by the processor is present in some block in the upper level of the hierarchy then we have a "hit".

If the data being requested is not found in the upper level then we are said to have a "miss".

The hit ratio is the percentage of memory accesses that successfully found target data in the upper level. This is used as a measure of performance of the memory hierarchy.

The miss ratio is the percentage of the memory accesses that failed to find its data in the upper level.

The hit time is the time required to access a level of the memory hierarchy, which includes the time needed to determine whether the access is in fact a hit.

The miss penalty is the time required to fetch a block into a level of the memory hierarchy from the lower level, which includes the time to access the block and transmit it from one level to the next, then insert it into the upper level that experienced the miss.

Cache Structures

A cache structure in which each memory block is mapped to exactly one block in the cache is called direct-mapped cache. A computation is employed that directly locates the block's position if and when that block is present in the cache. As you will see, this is the scheme widely used for an operating system memory management technique called “demand paging”.

A fully associative cache structure is one in which a memory block can be placed in any location in the cache. There is no precise location for the block within the cache. All blocks may have to be searched sequentially to find the correct one. The performance improvement comes from a reduced search area.

A set-associative cache is one in which has a fixed number of locations where each block can be placed. A computation gives a "neighborhood" in which it must reside if present in the cache. This is a hybrid of direct-mapped and fully-associative methods. You need only search a single neighborhood to determine if the block sought is present in the cache. (Think of a Dictionary, and finding a word as an example).

Performance of Cache Structures ( low << .. << high )

fully-associative << set-associative << direct-mapped

Handling Cache Writes

To ensure consistency in data both memory and cache may have to be updated on writes.

write-through -- The information is written to both the memory block in the cache and the memory block in the lower level of memory via a write buffer. If the write buffers cannot be emptied fast enough then the CPU will incur a stall condition until a write buffer becomes available .

write-back -- The information is written only to the memory block in the cache. The changed memory block is written to the lower level only when the block is replaced. If the machine were to lose power or the OS were to abnormally end then the cache would be lost and the data would now be in an inconsistent state.

Memory Management

What is an Address Space?

An Address Space is the contiguous range of addresses that are assigned to a process for housing its instructions and storing data. (This task is handled by the operating system, following its invention). This range of addresses at zero and can extend to the highest address permitted by the architecture.

What does VIRTUAL mean?

Virtual [anything] implies using techniques that extend some commonly used mechanism thereby producing an illusion such that it approaches becoming indistinguishable from the Real [anything].

With regards to computer architecture and memory management this implies extending the apparent reach of the CPU such that the universe of available information becomes far greater than what would otherwise be available in the real addressable memory of the local computer.

Logical (Virtual) vs. Physical (Real) Addressing

Virtual address is the address based upon the “view” from a programmers perspective of the "address space" in which the program executes. The virtual address can far exceed the amount of real memory.

Physical or Absolute address is a real location on the memory chip. It must be between 0 (the first byte of physical memory) and the highest addressable physical byte of real memory.

Address Binding

Symbolic address – The addresses as viewed from the perspective of the programmer. This is very similar to the virtual address. Symbolic (or logical) addresses start at logical location zero and continued in N consecutive memory locations. Symbolic addresses are not bound to physical addresses. Furthermore, each process or program has its own set of symbolic addresses.

Relocatable address – These are addresses in a format that can be altered dynamically at execution time. The addresses in a program are either hard-coded (immediate) or based upon the contents of Index + Base register contents plus a Displacement giving us an “effective address” into a block of storage (that we will later call a “page”). Adjusting the contents of the base register is performed by come mechanism during relocation.

Address Mapping

Without virtual memory (only real memory), mapping between symbolic address and real addresses was performed by adding an offset to each reference of storage by the application program. The offset value held in a relocation register was the displacement of the beginning memory address of the program from zero (0).

With virtual memory run-time mapping between virtual and physical addresses is performed by a hardware mechanism called Dynamic Address Translation (DAT) or the Memory-Management Unit (MMU).

Keep in mind that many instructions contain two addresses, one for each operand. The value of these addresses must be changed BEFORE execution of the instruction. The original symbolic address coded by the programmer is changed to the actual (real) address of the location that the data resides.

Why is this necessary? Because you cannot know at program writing time where in memory the operating system will decide to load your program. For efficiency the OS needs the flexibility to move information to wherever memory space is currently available.

The application program never sees the real physical addresses.

So we have logical (virtual) addresses in the range 0 to “any address”.

And physical addresses in the range of real 0 to “last physical address” in real memory.

Memory Allocation

Single Contiguous Memory Allocation

This is the implementation of a simple address space that could accommodate a single application. The entire program must be able to fit into the address space.

Symbolic addresses = Real addresses. Each instruction and data address is mapped directly to a real address within the address space.

The unused portion of storage went wasted (called a fragment).

If the entire program cannot fit in the address space, then it cannot be run; or it must be segmented (split apart) by the programmer.

The processor also required a card reader, card punch, printer and tape drives. (These are known as unit record devices because they process a single card or line-of-print at a time. These are very slow devices relative to CPU speeds).

Partitioned Allocation

Static Partitioned. Address space that could accommodate a fixed number of partitions and therefore more than one application. A new breed of program called an operating system was needed; it was loaded beginning of memory (location zero). Each application partition, aside from the first where the operating system resided, was offset (see graphic) by a fixed displacement from memory location zero.

Since programs are written symbolically with the presumption of running relative to address zero, each memory reference had to be manipulated by an offset representing the beginning partition address so that it could address the proper physical address relative to the partition start boundary.

Symbolic addresses no longer matched physical addresses.

Any remaining unused portions of each partition called fragments could not be used for other programming executions.

Dynamic Partitioned. Had all the features of Static Partitioned with the exception that the address space could accommodate a variable number of partitions and applications. The new OS would dynamically adjust the partition size from the un-allocated contiguous memory available. The variable number of partitions was set by a human operator at machine startup (IPL or boot) time.

Both static and dynamic partitioned required that each program must be able to fit in the partition allocated to it.

Again, programs running in partitions, other than the first partition, must have their memory address references offset by the displacement of the start of the partition in which the program resides.

Still the unused portion of storage in each partition was unusable for any other program. Fragmentation was exacerbated over time.

And, when the entire program did not fit in any of the partitions, then it could not be run; or it must be segmented (re-built) by the programmer.

There was no technology to collect the fragments into a single contiguous partition.

All tasks shared a single address space. A single program failure could endanger the remaining processes. Programmers developed and tested off-shift because their programs posed a risk to operation.

Multiple sets of Unit Record equipment: You could now run multiple programs at the same time but you needed to have a card reader, punch and printer allocated to each partition. If you had 4 partitions running you may have been required to purchase 4 sets of unit record equipment.

The requirement that the unit record equipment functioned in real time, it takes a long time to read a card or write a line of print. Due to the I/O delay in reading/writing, the CPU is prevented from attaining peak utilization; the CPU is spending much time waiting for work.

Spooling

The SPOOLer (which stood for Simultaneous Peripheral Operations Online) would capture the formatted output (unit record work) destined for the individual printer/punch and queue it onto a disk device. The same for any card reader data destined as input for an individual partition.

Spooling allowed multiple partitions to run at the same time while the users may have had only a single printer, card reader and card punch. In effect, the use of unit record devices was virtualized for each partition. Each partition “saw” it’s own unique set of unit record devices, where in reality there were none; only simulations.

Without spooling, most programs would be relegated to patterns of fast CPU processing followed by long waits for the printer and card processing.

Use of spooling removed much of the WAIT from the execution system, causing the CPU to be tasked more frequently with productive computational work and thereby undergo greater utilization. The result was faster throughput and the ability to run more programming in a smaller schedule window.

Other results:

You no longer needed a copy of Unit Record Devices for each partition.

Users could delay their needs for more computing power.

Savings in electrical power and computer room footprints.

Excess capacity to develop more usefull programming.

Relocatable Partitioned Scheme

This was an attempt to have the benefits of dynamic partitioned allocation with reduced fragmentation. The relocatable scheme had the ability to reclaim the unused fragments of each partition.

By suspending operation and “percolating” the used portions of each partition to the bottom memory, thereby reducing the partition sizes dynamically. At the same time “compacting” the fragments thereby creating a new unused partition at the top of memory representing the aggregate of the collection of the former fragments. That new partition now can be used to run another application.

Problems were the cost and complexity of relocation. The logical addresses had to be relocated dynamically during execution. Compaction time was substantial. Computer operations were required to be suspended, then each partitions' instruction operands had to be manipulated to reflect a new displacement, or offset, from the beginning of the address space.

There was no guarantee that the new partition would be large enough to run a particular application.

The user had no control of when relocation took place.

Failures required restarting the application(s) from their beginning.

Paged Memory Management

Break up real storage into small blocks called page frames. Each page frame, usually of size 4K, would eventually contain a section of the application programming.

Each application program was also carved into 4K blocks known as pages. The memory for an application no longer must be physically contiguous, blocks can exist wherever in storage that a free page frame is found available.

Logical page addresses which run consecutively do not match physical page addresses.

Each page has its own offset from the beginning of the address space. We call that offset the page number or block number.

The pages are linked together to form a logically contiguous address range by way of an array called the Page Map Table (PMT), a data structure maintained within and by the operating system. The computer architecture is provided with a fixed address via a hardware register that contains the address of the PMT array located within the OS.

Note: You still couldn’t load more pages of information than the available real physical memory.

Since pages could be located anywhere in real storage, each instruction operand must be manipulated to refer to the actual physical page that contained the true offset into real storage needed by an application. Recall the program was originally written with only knowledge of the logical addresses beginning from location zero. Now at execution time the OS relocates each page to where ever a frame is available.

Therefore each instruction had to be translated in CPU pipeline processing by using the contents of the Page Map Table and thereby directing the execution operations to the actual real physical page frame where information resides. This mapping must happen before the instruction is executed. The stage is called Dynamic Address Translation (DAT).

Results:

Fragmentation was remedied. The problem became so small as to become insignificant .

Allowed a dynamic number of concurrent partitions.

Program size still may not exceed the size of real memory.

All tasks still shared a single address space and therefore an errant process could still adversely affect the others. Programmers continue to test off-shift.

You are limited to the number of programs that could be concurrently run together.

Demand-Paged Memory Management

The motivation here was to address the problem of running an application(s) that exceeded the size of the real memory available. Recall that the furthest horizon of the computer architecture is the last byte of addressable real memory. Any information further out than this location requires another mechanism to bring it within reach of the computer architecture. That mechanism is the operating system.

(Prior to this time programmers were responsible to manage memory allocation by dividing a program that could not fit in the available memory into overlays. The overlays were loaded and unloaded under user program control during execution. The programmer needed to ensure that the code loaded never attempted to exceed the real storage.)

The concept of Virtual Storage

The operating system, using a combination of hardware and software, maps memory addresses used by a program, now called virtual addresses, into physical addresses. Main storage, as viewed by a process, appears as a contiguous address space or collection of contiguous memory segments. The OS manages virtual address spaces and the assignment of real memory to virtual memory through the DAT mechanism which automatically translates virtual addresses to physical addresses.

What this means is that we effectively use a "trick" on the computer architecture to have it "believe" that it can see a far greater universe of real memory. Memory is virtualized. In reality, the real memory has not changed, but an illusion is presented to the computer architecture.

The OS extends these capabilities to provide a virtual address space that can far exceed the capacity of real memory and thus reference more memory than is physically present in the system. Effective addresses can be presented for processing far in excess of real memory locations.

Virtual Storage is a technique that used main real memory as a “cache” for secondary or auxiliary storage. The auxiliary storage is usually an array of disk drives that contains all the pages for all the programmed applications running in the computer.

The virtual address corresponds to a location in the program's symbolic address space and must be re-mapped to a real physical address when memory is accessed. Real storage is used as a cache in which only the “active” parts of the application reside. Each cache block is known as a page. The OS brings pages into the cache when the page is demanded by the needs of the program running. The OS determines what parts of the program remain in the cache through its measurement of locality.

If, while executing an instruction, the CPU fetches information located at a particular virtual address then its operands' effective addresses must first be translated to the corresponding physical address. This is done by a hardware component which looks up the real address using a special control register which contains the base address of the Page Map Table.

Dynamic Address Translation (DAT), is the process of translating a virtual address during a storage reference into the corresponding real physical address.

If the physical page sought is found not to be resident in real memory (therefore not found in the PMT) then a cache miss occurs, better known as a page fault. The ensuing interruption process would cause the OS to fetch the needed page from auxiliary storage (page slot), bring it into real memory (page frame), the PMT is then updated, followed by a re-dispatch the task that incurred the fault.

Likely the OS would need to remove a page prior to replacing with the page needed. The choice of the victim page is usually based upon a Least Recently Used (LRU) algorithm. The page replaced is the one that has not been used for the greatest amount of time. That is, the page that exhibits the least locality.

Results:

Demand paged memory management had all the advantages of paged memory management with relief from the problem of “too big” applications.

Virtually no limit on the number of applications that could run concurrently.

There is some increased processor overhead and OS complexity.

All tasks continued to share a single address space. Programmers still tested off-shift.

Segmented Memory Management

Segmented Memory Management allowed for multiple address spaces to be constructed.

Up until the invention of segmented memory management all processes shared a single address space. Each process would occupy a partition of memory carved from a contiguous portion of this address space. The logical view of the process memory space was mapped by a relocation register into some displacement into the one address space.

Under a single address space scheme it was possible for one partition to adversely affect another partition.

With multiple virtual address spaces, errors are confined to one address space, thus improving system reliability and making error recovery easier. Programs in separate address spaces are protected from each other. Isolating data in its own address space also protects the data.

The use of address spaces allows the OS to maintain the distinction between the programs and data belonging to each address space. The private areas in one user's address space are isolated from the private areas in other address spaces, and this address space isolation provides much of an operating system's security.

Segmented Memory Management (SMM) allowed for multiple address spaces to be constructed. SMM allowed all the benefits of demand paging using multiple address spaces. Each application was architecturally isolated from programs running in other address spaces.

Segmentation supports the user's logical view of memory by allowing each process to run in its own protected address space.

The architecture works closely with the OS to support multiple address spaces. The architecture provides a special control register called the Segment Table Origin (STO) which points to a unique individual Page Map Table associated with the address space. The OS loads the STO register just before dispatching each task, thereby providing capability to context switch among multiple independent address spaces.

Now, no process running in its address space can reach and affect the execution of another process running in some other address space.

The argument could be made that it was now safe to run production and testing processes concurrently on the same physical machine.

SMM forms the basis for memory management on all major computer architectures.

Topic 6
Virtual Machine

The VM Hypervisor and its Guest Operating Environments

Virtual Machines are those operating environments that provide emulation methods that present a standard software interface to their underlying emulated operating environments. The standard software interface is indistinguishable from the computer architecture being simulated.

Aside from a close inspection of time passage, you cannot tell that a process or an operating system is running under virtual machine.

The concept of Virtual Machine was first developed in 1967 and the concept has remained an important part of enterprise computing since. However, it was not until the 1990's that Virtual Machine gained wide popularity.

Those factors include:

The ubiquity of business computing.

The increasing importance of isolation and data security.

The failures in reliability attributable to current standard operating systems.

The sharing of a single physical computer among many unrelated users, such as in a data center or "cloud".

The dramatic increases in the speed and number of processors.

The development of microcode (firmware), which make the overhead of emulating computer architecture more acceptable.

The special operating systems that create these emulated environments are known as "hypervisors”, "virtual machine monitors" or “Control Programs".

The Control Program generates multiple instances of the same computer architecture as the one on which the hypervisor is natively running. Additionally, later hypervisors can emulate other architectures through software simulation of the hardware components of the different architecture. (Generally, this means the simulation of an architecture’s instruction set.)

The Control Program maps virtual resources to physical resources. The physical resource may be one that is:

a) time-shared, such as a CPU.

b) separated into logical partitions, such as real memory and disk storage .

c) or emulated wholly in software, like Unit-Record equipment. Today, UR devices continue to be used, but there are no-real devices in existence anymore.

Examples of todays hypervisors include:

IBM’s zVM – emulates IBM 370/390/ESA/Z architectures (first)
Bowler and Maynard’s Hercules – emulates IBM/390/ESA/Z on Intel

EMC’s VMWARE – para-emulation of Intel architectures
RedHat KVM – para-emulation of Intel/AMD architectures
Oracle’s VirtualBox – para-emulation of Intel/AMD architectures

Virtual Machine is essentially the embodiment of Computer Architecture. The instantiated computer architecture is indistinguishable (from the perspective of the OS and its applications) from the operation of a real hardware architecture.

Each instance of computer architecture is called a Virtual Machine.

Operating systems that run in these virtual environments are known as “guest” operating systems. Each guest operating system can run its own set of applications. The OS and applications running in each guest are independent from those of any other guest machine. Guests are logically isolated from each other.

Each virtual environment can comes equipped with a unique set of possible virtual hardware devices, namely:

CPU
memory
I/O devices
tape drives
disk drive storage
unit-record devices
optical devices
network interface devices
channel-to-channel adapters
hipersockets
routers
consoles

The hypervisor intercepts each I/O request and creates a precise "look and feel" of the targeted equipment for the virtual machine attempting to use it. An instruction issued in a virtual machine will receive a response that is not unlike that received by a real native OS using real hardware. Any simulated hardware signals received by the Control Program look exactly as what would be expected from real hardware.

A program running in a virtual machine cannot tell that the emulated environment is anything but the real thing. There is no test that a program could perform (ie. a query instruction execution) that will provide information that the environment is virtual. From the VM Guest's perspective the operating environment consists of real hardware.

As the native operating system, and owner of the real Page Zero (,the Fixed Storage Locations,) the hardware will respect the VM Control Program as the owner and master of the real hardware processor. The Control Program is the only process that runs in the real privileged (or supervisor) state.

The Guest operating systems must necessarily be dispatched by the Control Program in the real problem (or unprivileged application) state. From the hardware computer architecture's perspective the guest systems are just applications, and as such are disallowed from presenting privileged instructions to the CPU.

Q: If only the “native” OS is permitted by the hardware to execute privileged instructions, then how do “guest” OSes run successfully in a virtual machine environment?

When a guest attempts to execute a privileged instruction such as an I/O request, the real hardware will trap and capture a Program Check Interruption through the normal interruption process, and invoke the VM hypervisor's Program-Check Interrupt Handler for processing. The guest is suspended (stopped in its tracks).

Next one of two logic paths are taken:

(1)If the hypervisor determines that the guest was in the proper "virtual" privileged state, then the hypervisor will execute the instruction on the behalf of the guest. Then the hypervisor will provide signal and completion statuses to the guest's operating environment, in its Fixed Storage Area.

(2) If the hypervisor determines that the guest was NOT in the proper virtual privileged state, then the instruction will fail. The hypervisor will alter the virtual configuration of the guest in exactly the same way as would occur on a real machine, then re-dispatch the guest. The guest would resume with "seeing" the occurrence of a virtual program-check interruption, then would take the appropriate action against the offending application.

This process of providing an illusion through software emulation (and manipulation of data structures) the same service that would have occurred on a real computer architecture is called reflection.

The hypervisor will then schedule the guest to be re-dispatched. The guest will have no indication that it did not perform the same instruction operation. From the perspective of the guest operating system the instruction was performed by itself.

The guest operating system generally cannot detect that it is running in a virtual machine aside for the unaccounted wall clock time as measured by the guest.

The VM hypervisor is, in essence, a very elaborate Program Check Interrupt handler.

Why use Virtual Machines?

Employing multiple guest operating systems in a single real hardware environment has the following advantages:

Pooling of Resources -- All guests can share the CPU(s), Memory and I/O devices. This includes disks, network interfaces, unit record and special devices.

CPU Utilization – Greater possibilities of keeping the CPU busy with productive work. Idle time occurring with any guest OS can be consumed by an active guest OS. Reduction of the Von Neumann bottleneck.

Memory Utilization -- All guest OSes reside in their own address space, and safely run simultaneously. Memory idled by any guest can be used by other active guests. Implementation of locality principles.

Code Sharing -- Guests of the same OS type can simultaneously share the same pages of Read-Only memory. That is pages of memory that never undergo change.

Input/Output -- Multiple guests can multiplex traffic over the same shared channels to transmit and receive data.

Networking -- The connections among the OSes are much faster than those that employ physical cables and router hardware. LANs and subnets can be accurately simulated and maintained as software constructs.

Virtual Hardware Creation -- Hardware that does not exist as real devices can be virtualized for the virtual machine.

Production and Development Machines – Test and live production operating environments can co-exist safely and unobtrusively on the same physical computer.

Footprint -- The physical space taken by a single real computer is smaller than that of multiple servers.

Multiple Instruction Sets -- Today's hypervisors can emulate the different instruction sets of various architectures.

Disaster/Recovery requires only a single off-site hardware machine. All the virtual machines can be transmitted over the Internet, or via tape.

User View -- The users will see no difference in functionality from that presented by a real machine.

Hardware Cost -- Less hardware requirements due to sharing, less expense, smaller and simpler physical environment.

Security -- Only a single real machine to secure and locate in a hardened facility.

Serviceability and Maintenance -- Less labor since less equipment and physical networking. All work is centered upon maintaining software configurations, much of which can be automated.

Labor -- The entire complex of virtual machine environments can be maintained by a single systems programmer vs a large technical team.

Topic 7
Storage and Input/Output

Input/Output Work Distribution

Can the computing system multitask by delegating work to other mechanisms that specialize in performing I/O functions?

How do these other mechanisms safely coordinate accessing shared resources with the operating system and the computer architecture?

The I/O Transaction

An I/O transaction is a sequence of operations that includes a request and likely include a response, which contain control information, and either of which may carry data. A transaction is initiated by a single request and may include many individual operations; such as "sense", "position", "read", etc.

Synchronous v. Asynchronous I/O Operations

A synchronous operation is one that includes a clock in the control lines and a fixed rigid protocol for communicating, that is dependent on the "ticks" of some clocking mechanism. It has the disadvantage that every device on the communication bus must run at the same clock rate, and be synchronized to the same ticking rate. Clock skew problems can occur if the buses are physically too long.

The sender initially transmits "ticks" to the receiver so that the receiver can get into a synchronicity with the rhythm of the sender.

An asynchronous operation uses a protocol that is NOT dependent on a clock for data transmission between parties. To coordinate the transmission of data between sender and receiver the asynchronous bus employs a simple flexible handshaking protocol. The handshaking protocol is a series of steps in which the sender and receiver proceed to the next step only after both parties agree that the current step has completed. Each step is called a state transition. Any state will transition to a usually small number of next possible states. Each state is well-defined and understood by all communicating parties.

Asynchronous methods can accommodate a wide variety of devices of differing speeds. There is no clock skew, and therefore buses can be arbitrarily long. There are no synchronization problems.

I/O Signaling

The process of rapidly and periodically checking the status of I/O device activity to determine the need to service the device is called polling.

Polling is the simplest way for an I/O device to communicate status to the processor that initiated the transmission. The I/O device simply stores information in a status register, and the processor eventually will loop around to interrogate the register to obtain information of a change or signal.

Polling is inefficient because it can waste much processor time repeatedly checking for results that have yet to be posted. While the polling may continue for many thousands of iterations; the I/O signal will post only once.

Event-driven or Interrupt-driven I/O employs interruptions to notify the processor that a signaling device is reporting status, or is in need of attention.

An I/O interrupt is asynchronous with respect to CPU instruction processing. Unlike other interrupts caused by page faults and programming exceptions, its event does not prevent the instruction currently running in the CPU from running to completion.

If the I/O interrupt can not be processed immediately, then it is deferred for the short time that it takes the OS to complete its current processing and then respond to servicing the I/O interrupt.

The I/O interrupt presents to the processor information regarding the identity of the device generating the interrupt. It is a “tap on the shoulder” to the computer architecture, which in turn notifies the OS that it has I/O information pending.

Transferring data between an I/O Device and Memory

Both polling and interrupt-driven I/O transfers place a burden of moving data and managing the transfer on the main CPU processor.

While polling consumes an extraordinary number of CPU cycles, notification of the I/O event occurs within one polling iteration.

Interrupt-driven I/O relieves the OS of requiring high frequency testing for a report from the I/O subsystem, but the notification of the I/O event may be slightly delayed (like a millisecond) by the processing of other work currently being performed by the CPU.

DMA Memory Access and the Memory System

Engineers have designed a mechanism for offloading work from the processor and delegating that work to a device controller or a channel processor, which handles the transfer of data directly to/from memory without involving the main processor. This allows the CPU to work on more important computation.

This I/O mechanism is called Direct Memory Access (DMA).

The interrupt mechanism is used by the I/O processing to signal the completion of the data transfer, alerts when an error occurs or to post milestones to the OS during the I/O operation. These milestone events may be Channel End, Control Unit End and Device End events. In the meantime, the processor is free to work on other tasks' computation.

DMA is a specialized controller that transfers data between an I/O device and memory independent of the processor. There may be multiple DMA devices in the computer.

The DMA follows a script known as a I/O Program or Channel Program. This program provides a list of instructions to be followed to complete the I/O request.

The DMA can now perform the data transfer without much need of the OS attention or consuming CPU cycles. Both the DMA and the CPU, however, must serialize their access to the common shared real memory.

Acting independently from the CPU, the DMA will have the same difficulties as the operating system when dealing with a virtual memory system. The page frames that will be the source, or targets, of a data transfer will have both a physical and a logical address.

The DMA will have to necessarily perform address translation of the page frame.

Because of ongoing paging operations, the OS will reserve or "lock" the page(s) that contain I/O data in real storage until the completion of the transfer operation. The I/O completion is signaled through the computer architecture to the OS by a subsequent I/O interruption. After receipt of the interrupt event the OS may release the page(s).

Topic 8
Multiprocessors

Uniprocessor vs. Multiprocessor

A uniprocessor is a computer that operates using a single processor or CPU. At any time the single CPU is operating on the execution of a single instruction. The processor is said to work on instructions serially, or in the case of multitasking, concurrently.

A multiprocessor is a computer system with at least two processors that work more or less cooperatively. At any time the CPUs together may be processing the execution of more than one instruction simultaneously. The processors are said to work on instructions in parallel. (Two processors never operate on the same thread of instructions).

Resource Sharing among Multiple CPUs (vs. Single)

In a single CPU environment only a single instruction can be running at any time. Multiprocessing allows simultaneous processing of multiple tasks and therefore multiple instructions may be simultaneously executing.

If you want to employ multiple CPUs to act on a single problem then you need to design special sophisticated programming to efficiently operate.

A problem that is exacerbated when employing multiple processors is that they may both need access a shared resource, (such as memory,) at the same time. There is a danger if more than one processor attempts to access the same resource at the same time will likely cause corruption of the data value; this is called a data race.

An interlock mechanism must provide serialization and synchronization when sharing a common (the same) resource. This interlock mechanism can be implemented by both:

(1) a programming construct that is referred to as a semaphore, mutex or enqueue flag. The semaphore acts as a traffic light serializing access to a road intersection. Keep in mind that when one CPU has the shared resource then the other CPUs are WAITING for access.

(2) and a special external interrupt called a Signal Processor (SIGP), that allows message exchange between two CPUs.

However, if a computation can be made independent from other computations for a given problem, then the parallel processing can radically reduce the time to result.

Synchronization

Processors in a multi-processor configuration must communicate with one another since they will share common resources such as real storage. Any access to a shared resource must first safely “lock out” other manipulating processors before accessing. Therefore, we must have a set of instructions that provide an interlock mechanism to serialize access to a shared resource.

These following instructions are "atomic" and are non-interruptible by the hardware; once the instruction begins it must run to completion:

Compare and Swap

Test and Set

Both of the above instructions test to see if any other process is using the resource, and if not, it reserves the resource for its own use. (If the resource is already reserved then it must wait for that resource to be freed.) Once reserved, the resource information is processed quickly through programming code known as a critical section. Finally, the resource is released.

Multi-Threading and Multi-Processing

Do not confuse multi-threading with multi-processing. A multi-tasking or multi-threading program is one that sub-divides itself into multiple independent executable parts, and then coordinate these execution streams so that together they cooperatively accomplish the objective of the program.

Multi-processing hardware makes available multiple CPUs to work on a given process by letting the process be dispatched on any available CPU, and/or to extend multi-threading by employing multiple CPUs to simultaneously work on multiple threads.

Multi-threading software can also run on uniprocessor hardware. On a uniprocessor only a single thread can be running at any instant. The threads of the process are interleaved and context-switched by the operating system. Context switching by the OS can take many hundreds of processor cycles and instructions.

Are More CPUs Better?

Theoretically, aggregating processors with the expectation that more CPU engines brought to bear on a given problem is more efficient than one. By taking advantage of parallelism, using multiple processors to operate independent programs simultaneously can allow a process can potentially handle multiple requests at the same time. In general, the goals of parallel processing are to

reduce the time a computation takes,
increase CPU productive utilization,
increase the capacity of a server,
and improve the responsiveness of a server.

Addressing Programming Challenges

What are the areas for getting better performance and efficiency by exploiting parallel processing program on a multiprocessor?

Since a single physical address space is shared by all processors enqueueing techniques must be employed to ensure serialization of shared resources. Some type of reserve/release interlock mechanism is required to ensure integrity when accessing shared data.

Waits and Posts are needed to keep threads of a program synchronized.

Handling of hardware and software exceptions and failures.

Processors must tightly coordinate and cooperate using signal processor type external interrupts. Communicate actively and frequently via message exchange.

To send and receive messages each processor must be made aware of when a message should be expected, received and accepted. One processor must signal the other.

Each processor should know that its message did, in fact, arrive, or failed to reach its destination. This means that each message received should be acknowledged and confirmed.

When a thread is dispatched on a different CPU than its last, all cached information may have to be copied to become accessible by the next CPU.

Most important challenge is educating the programmer to think about a problem solution in a parallel manner as opposed to serially.

Hardware vs. Software Multi-Threading

Multi-Threading is the running of multiple, generally independent and asynchronous tasks on a processor. Each task is given a share and a timeslice of the processing resources.

Multi-threading hardware can employed to increase processor utilization by switching to another thread when the current thread is found to be stalled, waiting or yielding. Sometimes referred to as “hyperthreading” or “simultaneous multi-threading”.

Advantage: Very high-speed as opposed to OS context-switching. Increases processor utilization by switching to another thread when the current thread is found to be stalled, waiting or yielding. The OS “sees” this as multiple CPUs. The hardware is virtualizing the view of each CPU

A multicore microprocessor is one that is engineered to contain multiple logical CPUs on a single integrated circuit. The multicore processor is made to appear to the underlying OS as multiple logical CPUs shared over a single CPU.

To accomplish this feat of sharing the processor, the hardware must duplicate the independent state of each thread. Each thread must have a separate copy of the registers, memory caches, interrupt mechanisms, page zero of memory and various save areas. This copy must be backed up and restored as needed by some lower-level switching mechanism.

What is Simultaneous Multi-threading ?

It is the ability of a single physical processor, or core, to run more than one stream of instructions at a time. Each stream of instructions is called a thread. The threads share the hardware assets on the core. When the core cannot make progress on one thread, perhaps it can move forward computation on the other one. Some examples are:

Cache miss,
One thread issues an SVC call to the OS,
A thread is waiting for human intervention (ie. keyboarding),
A program voluntarily gives up the CPU (ie. wait and post),
A program requests performance of an I/O operation.

Each of these introduces WAIT into the core processing system. When recognized by the architecture a switch to another thread occurs. This can increase overall core capacity to complete instructions.

Much research is currently being done in exploiting multiple processors.

EXTRAS

Dependability, Reliability, Availability and Serviceability

Computer system dependability is the quality of delivered service such that reliance can justifiably be placed on this service.

How can one determine dependability?

1) Service accomplishment, where the service is delivered as specified.

2) Service interruption, where the delivered service is different from the specified expected service.

Transitions from state-1 to state-2 are called service failures.

Transitions from state-2 to state-1 are called service restorations.

Reliability is a measure of the continuous service accomplishment or the mean time to failure (MTTF).

Availability is a measure of service accomplishment with respect to the alternation between the two states of accomplishment and interruption. This is the ratio that represents the system's “uptime”.

Availability = Service Accomplishment / (Service Accomplishment + Service Interruption)

Serviceability is a measure of the ease (or difficulty) of the application of upgrade or corrective service to be performed to the computing system. Can service be performed on a running system, or must the system experience a reduction in availability.

Classes of Computing Equipment

All computing equipment share the same basic design and organization.

Low-end to High-end performance and capacity:

Handheld (Smart Phones, Tablets, Pads, Watches, etc)

Laptop and Desktop.

Servers (Intel or AMD based)

Mainframe or Enterprise Server (370, 390, ESA, IBM z/Architecture)

Supercomputers (Cray)

Memory Allocation

Bytes -- consists of 8 consecutive bits

Kilobytes 2**10 or 10**3

Megabytes 2**20 or 10**6

Gigabytes 2**30 or 10**9

Terabytes 2**40 or 10**12

Petabytes 2** 50 or 10**15

Exobytes 2** 60 or 10**18

....and so on.

Memory and Media Types

Paper

Magnetic

Optical

Electronic and Solid State

Steel and Vinyl

What media type do you think holds the record for longevity?

Input and Output

Input Devices – Devices that accept information and transfer that information to memory,

Output Devices – Devices that pass information from memory to the external environment.

CPU Performance --
Clock cycles is the number of "clock ticks" needed to execute some instruction. A faster clock will tick more rapidly and therefore a greater number of instruction execution cycles per unit time.

Instruction performance -- The number of clock cycles that are required for each type of instruction. Different manufacturers' instructions consume different number of clock cycles.

Hardware Performance

Moore's Law and the Power Wall

The observation made in 1965 by Gordon Moore, co-founder of Intel, that the number of transistors per square inch on integrated circuits had doubled every year since the integrated circuit was invented. Moore predicted that this trend would continue for the foreseeable future. In subsequent years, the pace slowed down, but data density has doubled approximately every 18 months.

Amdahl's Law

Amdahl's law is named after IBM's former chief computer architect for the OS/360, Gene Amdahl, and is used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical maximum increase in speed using multiple processors. Amdahl's law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.

Goals in Architecture Design

Speed

Clock cycle governs the speed of the hardware.

The number of clock cycles consumed per instruction determine the speed of a given instruction.

The number of transactions (which is made up of a string of instructions) executed per unit of wallclock time determine the speed of the computation.

Addressability does not directly relate to speed of computation, only the ability to reach a greater displacements in memory.

Simplicity favors Regularity

Software Extensible Instruction Set, (new instructions can be added, old instructions are never removed).

Microcode reprogramming of a generic hardware design.
Ability to correct or enhance an instruction by the distribution of software patches vs. the replacement of physical hardware.

Smaller and Less is faster

Because the speed of light is fixed, less components that are physically closer together allow signals to arrive earlier and therefore result in faster processing speeds.

Make the Common Case fast

Make the set of instructions that are used the most consume less clock cycles. Such as fixed point arithmetic on machines to be used for business applications.

Good design demands good compromises

The best design may be too expensive to implement. The cheapest design may take an inordinate about of time or resources to run your applications.

An instruction set that is too small may place a burden on your human labor resources to write software.

A processor that requires a nitrogen bath to dissipate excess heat could be both unwieldy and expensive

Return to Professor Page]

Professor Cullen Lecture Notes

Computer Architecture and Assembly Language

Topic 1 Introduction to Computer Technology

Topic 2 Language and Instructions

Topic 3 Machine Arithmetic

Topic 4 The Processor

Topic 5 Exploiting Memory Hierarchy

Topic 6 Virtual Machine

Topic 7 Storage and Input/Output

Topic 8 Multiprocessors