Professor Cullen Lecture Notes

COMS223 Data Structures

Chapter 1
Arrays

Algorithmic Methods of Inquiry

Design
Design is the process of planning how to build something. The things that computer scientists design range from the very abstract, such as software algorithms, to the very concrete, such as hardware computers themselves. Computer scientists also design programs, but programming is just one of many forms of design in computer science. No matter what you are designing, remember that design is planning for building. Building a product from a good set of plans should be straightforward, and the product usually performs as expected; building from poor plans likely leads to unexpected problems, and the product generally does not function as expected. The most common mistake computer science students make is to try to write a program without first performing the due diligence to plan the actual steps to do it.

Theory
Theory is the process of predicting how an algorithm will behave when executed. Theory allows you to discover much about an algorithm without ever executing it. Can the problem be solved by a particular theory or not? There are properties independent of how the algorithm is executed that affect every implementation of the algorithm.

Empirical Testing
Empirical Testing is the process of learning through experiment and observation. The more experiment and observation you conduct, then the more confident you become with the performance of the algorithm and the less chance of reaching a conclusion that is not true.

Education
The passing of knowledge through time.

Discovery
The creation or uncovering of new knowledge.

Abstraction in Programming Design

Abstraction means deliberately ignoring the details of solution in order to concentrate on overall features that are essential to the job at hand. Designing algorithms can often be complicated. Ignoring some of the complicating details while working towards the solution process at a higher level can help turn an overwhelmingly large problem into a series of smaller individually manageable subproblems. (Divide and Conquer).

In this course we will attempt to build data structures that will stand the test of time and facility. That is we will build into the structure the minimum amount of behaviors that are needed to simulate the object of concern. Nothing more. The code to be designed and tested shall be a little as possible.

Our data structure will be required to concentrate only on a single element at a time.

No displays of data will take place in the data structure itself for other than debugging purposes. All information regarding the performance of the structure must be unambiguously communicated to the calling program by way of return codes.

Should we find that a feature might be desireable, then we will first investigate if the feature could be attained by using a combination of those methods already available in the structure.

Furthermore, if we can argue that a feature should not be included as a data structure behavior, then that feature if needed would be in the province of the calling programming.

The Array

An ARRAY is a ordered set of a specific, finite number of same-type nodes, whose structural properties involve the relative positions of the nodes. The nodes are randomly (or directly) accessible by an integer index. The addresses of array members occur at contiguous locations in the virtual storage of the computer.

The Array by definition is fixed in size. If you require a larger or smaller array you cannot alter an existing Array, but must create another, then populate it.

Order is not of great importance in an Array.

There are two basic operations for accessing an Array: extraction and storing.

The number of elements in an Array is called its range.

The first element in an Array will be in the zeroth slot.

The maximum defined element will occupy the (N-1) slot.

You populate an Array by storing information into the first available free slot.

To query or access a stored element you directly address it using the slot location index.

Further behaviors would require custom design, such as:

To store an element into an already populated position may require that all elements on one side of the insertion point be shifted by one slot location to accommodate the change.

Conversely, to remove an element without leaving a “gap” would require that all remaining elements are shifted by one slot location to contract the empty position.

To find an unordered element would require searching every element position.

While most all programming languages provide us with an implementation of the Array they may not provide all the behaviors that we need. Array implementations do commonly include the ability to query, store to and extract from a particular location.

But we may also like to store in the first available empty location, find the location of a given value, and provide the count of occupied slots. These functions we will have to program ourselves.

Minimal Design Considerations
When designing a data structure you should seek to implement the fewest behaviors that will successfully and safely allow the greatest degree of desired manipulation for a single element.. Any manipulation required by a user above and beyond this basic functional set should be accomplished, if necessary, through a combination of the basic behavior functions. Such as querying the contents of all elements via a loop.

BOUNDARY CONDITIONS are those special or unique circumstances that present themselves when dealing with the logic necessary to process the endpoints of a given data structure. The boundary conditions describe the behavior at the “edges” of the simulation region.

Boundary Conditions include the following:

Logic to handle addition of next element in a FULLY POPULATED data structure.

Logic to handle removal of an element from an UNPOPULATED data structure.

An ELEMENT is an instance of a single data object that is used to populate a data structure. It defines the contents of the NODE.

For our discussions, SLOTS or buckets are addressable virtual storage areas that hold elements of data. Slots are in logically contiguous storage areas.

Example: Lets say that the data element integer 10 is found in slot number 42. Therefore if we went to slot location 42 we would find that its contents would contain the number 10.

A note about your labs. All labs must conform to the following structure:

Programming will adhere to Object Oriented Programming principles. We will have a User Interface that serves and the main program. The user interface, instantiates the data structure that we are working with and calls on its methods to perform the required function.

The Data Structure will contain all the methods (routines) and properties (variables) that will define its behaviors and its current state.

Each program class instantiates the next level of data structure. Such as the user interface instantiates the Array class, then the Array class instantiates copies of the NodeElement class.

By the time we study Hash Tables we will have a chain of dependent structures working for us. The User Interface program will call the methods of the Hash class, the Hash class will call the Double-Linked Queue class, the Double-Linked Queue class will call the Single-Linked Stack class, and the Single-Linked Stack will call NodeElement class.

Do Not Attempt to place all components into the same Class. You will be LOST.

To access a sample of User Interface source code for both JAVA and C++: arraytest.java, arraytest.cpp,

To access the NodeElement data structure source code used by our labs, for both JAVA and C++: NodeElement.java, NodeElement.h, NodeElement.cpp,

Chapter 2
Stacks

A STACK is a linear arrangement based upon the foundation of an array with the requirement that all insertions and deletions (and queries) are made at a single end of the structure, usually known as the TOP of the stack.

The order in which elements can be retrieved from a stack is Last-In-First-Out (LIFO).

Inserting a data item on the top of the stack is called "pushing" into the stack, while removing an item from the top is called "popping" off the stack.

Examining the contents of the top slot of the stack is called "peeking" or "querying the top".

The Stack can be thought of as an Array with additional certain constraints placed upon access to its elements. While any of an Array's nodes can be directly accessed, by definition you can only "see" the TOP node of the Stack.

The TOP property is all important to the Stack because it points to the only view that the user has of the Stack's contents.

Note: To save some coding effort you can consider making the Stack an extension of Array.

Types of Stacks

Simple Stack
Self-contained mechanism in which a fixed number of nodes are pre-allocated and sequentially ordered within contiguous storage locations. The "top" pointer simply moves forward or backward one element length at a time. Element slots allocated and reserved by some process, must be returned in the exact reverse order.

Linked Stack
Self-contained mechanism like the simple stack with a fixed number of nodes, but each node is linked to its next element which is not necessarily in an adjacent storage location. This allows the stack to maintain a pool of storage elements that can be shared by multiple calling programs that wish to concurrently use the same stack. Element slots can be reserved and returned without regard to the order in which the slots were reserved.

Dynamic Linked Stack
A dynamic linked stack has all the capabilities of a linked stack but depends on some other storage mechanism to perform the allocation/deallocation of its nodes and can therefore expand and contract dynamically in size.

The TOP variable always points to the current element at the logical top of the stack.

Where Stacks are used

Cafeteria plates
Parentheses validation with arithmetic expressions
Program Calls
Recursion

Chapter 3
Recursion

An algorithm is RECURSIVE if it defines a procedure in terms of a simpler case of itself.

A recursive program is one that is required to invoke itself. Every invocation of itself leads to a reduced or simpler problem. The same computation recurs, or occurs repeatedly, until the problem is solved.

For a recursive definition to work, the definition in any given case must be well-defined, avoiding an infinite regress. That is:

1) Every recursive call must simplify the computation in some way.

2) There must be special termination condition or stopper cases to handle the simplest computations directly, without the need to call an even simpler case of itself.

Sometimes the term inductive definition is used as a synonym, because recursion derives general principles from specific observations that something is true.

A recursive call can result in many further recursive calls because the each call is dividing the problem in new smaller subproblems, or smaller cases of the same problem. The recursive function is said to converge to a solution.

For a recursive method to terminate, the problem must eventually be reduced to a stopping case. The stopping case is a trivial result to the current level of the recursive procedure that requires virtually no computation to produce an answer. The trivial result is usually a fundamental constant or an conventional mathematical identity.

For at least one argument, a recursive function must be defined in terms that do not involve calling the function itself. There must be a "way out" of the sequence of recursive calls. When reached, the method returns a result to its immediate caller. This caller, in turn, continues the computation at the point that it left off, and returns another result to its own caller. This process continues unwinding the sequence of recursive calls until a result is percolated back up the calling chain to the original caller.

The difference between a circular definition and a recursive definition is that a recursive definition must have base cases. Base cases are those that satisfy the definition without being defined in terms of the definition itself, and all other cases comprising the definition must be successively "smaller".

Pitfalls

Infinite recursion happens either because the parameter values passed during a recursive call do not get simpler or because a special terminating "stopper" case is missing or never reached. (ie. the function provides no guarantee of convergence; it is divergent or harmonically oscillates.)

Missing base case.

Excessive memory requirements.

Excessive recomputation performed.

Recursive function call implementation requirements:
1)    Passing arguments.
2)    Allocating and initializing local variables.
3)    Transferring control to and regaining control from the function.

Each call to the recursive function causes a new variable data save area to be allocated and each reference to a variable in the function's data save area is local to the most recently allocated area.

Each return or unwinding causes the current data save area to be released, and the data area allocated immediately prior to the current area becomes reconstituted as the new local data area.

Recursion vs. Iteration

The difference between a recursive function and an iterative function is that the iterative type calls for the explicit repetition of some process until a certain condition has been met.

Recursion is generally more expensive in computing resources than non-recursive iterative methods. The run environment must store and recover all local and temporary variables and the "return control" addresses prior to calling and upon returning from each recursion operation.

A stack is generally used for this purpose. The costs for calls to push and pop as well as handling of boundary conditions could become expensive.

In a non-recursive procedure the stacking of unnecessary variables can be eliminated.

However, from a human programming point of view, the recursive solution may be the most natural and logical way of solving a problem.

To access the JAVA source code: recursetest Factorial Fibonacci Sumtorial

To view Tower of Hanoi Algorithm demonstration Launch

Chapter4
Queues and Lists

ALLOCATION is the acquiring of storage to accommodate the data structures and supporting code of a given task.

DEALLOCATION is the releasing of previously allocated virtual storage.

The new operator initiates the class "instantiation".

Instantiation is an allocation of storage into which variables and program code are loaded at a unique location in virtual memory to support an instance (a copy) of that program. The instance is called an object.

The instantiation process returns as its output a reference variable that serves as a pointer to the object created in virtual memory. The reference variable provides addressability to the instance.

A special method, in the instance, called a Constructor is then invoked to request the initialization, if so desired, of the instance's state properties (aka. variables).

The operating system (OS) or Java Virtual Machine (JVM) services the request for instantiation.

GARBAGE COLLECTION is the reclaiming of deallocated storage used by code and data structures so that it may be reused by other processes.

A Destructor causes Object de-allocation of storage that represents variables and program code caused from some prior class instantiation that created the object. Destructors are not explicitly used in JAVA.

Simple Queues

A QUEUE is a linear list (if linked members) or an array (if contiguous members) with the characteristics that all insertions are made at one end of the list, and all deletions are made from the other end. Data is processed First-In-First-Out (FIFO).

Like a stack, a queue can be thought of as an array with certain constraints placed upon access to its member elements.

Placing an item into a queue is called "insertion", which is done at the end of the queue called "tail".

Removing an item from a queue is called "deletion", which is done at the other end of the queue called "head".

Types of queues include:

BOUNDED QUEUES which are of fixed length such as the "pipe queue" and "circular queue" .

In a pipe the items are inserted at one end and then migrate towards the other end.

In a circular queue items are place in the next sequential slot and do not move, but employ a set of pointers that change value.

UNBOUNDED QUEUES which can expand or contract, such as the "linked queue".

A PRIORITY QUEUE has all the attributes of the standard QUEUE with the added capability that items can be inserted into, (or removed from), the structure at any position (prioritizing).
This attribute does violate the strict definition of a queue.

LISTS

A LIST is an ordered set of elements. Order is imposed on the list by linkages to another node of the data structure. Lists differ from arrays in that they are not necessarily arranged in contiguous storage locations. The next element may appear anywhere in storage.

Lists have a disadvantage compared to Arrays in that to access an element stored in a List you must traverse the structure by following the linkages; whereas the Array index permit direct random addressability.

Lists have an advantage over Arrays when inserting or removing a node from other than the structure's endpoints. Only a simple adjustment of the linkage pointers is required.

A LINKED LIST is an ordered set of related elements that are logically connected where each record is linked to the next by a single pointer.

A DOUBLE LINKED LIST is an ordered set of related elements that are logically connected where each record is linked to the next and the previous elements by a pointer.

Chapter 5
Binary Trees

Up until now we have studied LINEAR data structures (Array, Stack, Queue, List).

With linear structures data is arranged in a sequential manner.

Each node in these structures has a logical beginning node and a logical ending node.

Each node has a previous node and a following node.

You want to choose the best structure for your data on the basis of:

Logical representation of the type of data.

The efficiency or cost of operations (CPU and Time considerations).

Memory consumption.

Ease of implementation and maintenance.

A TREE is a non-linear structure formed from a collection of nodes that allows us to organize data in a hierarchical (pyramid like) manner. That is:

There exists a primary node and all others are connected in a single path directly or indirectly to that node.

Each node in the tree may contain a reference to both a node closer to the primary node and also to nodes further away from the primary.

Real-world tree applications occur in many places: genealogy, sports tournaments, corporate organizations, military ranking, and operating system file-management structures, network routing, etc.

Since trees are structurally more complex than linear representations, they have a more extensive nomenclature for referring to their various components.

Root, decendants, leaves, branches, etc.

Some general tree constraints:

No node can be its own ancestor or its own descendant.

Two nodes may not have the same child.

(Both the above would introduce a loop during traversal operations.)

Binary Trees

A special case of the tree data structure is the BINARY TREE. The binary tree consists of a finite set of single-linked elements that is either empty or partitioned into at most 3 disjoint subsets.

The first subset contains a single element called the ROOT of the tree.

The other subsets are called the left and right subtrees of the original tree. These subtrees are in themselves binary trees.

A tree NODE represents a single informational element of the tree structure.

Each node has exactly one PARENT, which is its ancestor, except for the ROOT which by definition will have no parents.

A parent may have up to two CHILDREN, a left-child and a right-child. These are known as the descendants of the parent.

A node with no children is called a LEAF.

Two nodes are called SIBLINGS if they are children of the same parent. No child may have more than one parent node, thereby avoiding loops during a traversal.

The LEVEL of a node in a tree is the node's distance (number of links) from the root.

The HEIGHT of a node in a tree is the node's distance (number of links) from a leaf node.

The DEGREE of a node in a tree is the number of child nodes attached to it. A node of a binary tree has a maximum degree of 2.

The series of nodes connecting an ancestor to a descendant is called a PATH. Its length is known as the PATHLENGTH.

The number of CHILD links is always the number of nodes of a tree minus one.

Tree as a Recursive Data Structure

The tree's root link to nodes which in effect are the roots of subtrees. Each subtree is a subset of the original tree. The tree is a recursive structure because we can reduce it into subtrees in a self-similar manner

Definition of an Ordered Tree

A tree clearly is not a linear sequence. In order to use a tree for sorting, we need to restate the definition of "sorted" in terms of functionality rather then in terms of a static or positional description. We say that a collection a[0], a[1],...,a[n] is sorted if the organization facilitates sequential traversal of the structure. In other words, a way to visit each node systematically in ascending (or descending) order.

A tree is said to be inorder if the following two requirements hold:

(A) Its left subtree contains only those values that are less than (or equal) to its root value.
(B) Its right subtree contains only those values that are greater than its root.

Visiting the nodes of a tree in a specific order is called a TRAVERSAL.

Three traversal paths we will look at are:

PRE-ORDER TRAVERSAL (also known as depth-first order) visits the root, then the left subtree in preorder, then the right subtree in preorder.

IN-ORDER TRAVERSAL (also known as symmetric order) visits the left subtree in inorder, then the root, then the right subtree in inorder.

POST-ORDER TRAVERSAL visits the left subtree in postorder, then the right subtree in postorder, followed by the root.

A THREADED Tree is one so constructed so that the right links of all leaf nodes point to the next node required to be visited during an in-order traversal. This provides a "short cut" or improvement in the number of node visits while traversing the tree. Programming demands for storage and computation are significantly reduced for a threaded tree.

Chapter 6
Sorting

A FIELD is a set of bytes that represents some abstract information.

A RECORD is a collection of related fields.

A FILE is a collection of records.

A KEY is a one or more fields or sub-fields of the record used for sorting and searching.

A SORT is a process that rearranges the records of a file into a sequence that is sorted on one or more key(s).

A sort is known as stable if the arrangement of all records of the same key are preserved after the sort.

A merge is the process of combining two or more sorted files into a resultant sorted file.

Considerations in determining the efficiency of a particular sort

Efficiency is measure by the consumption of one or more of the following resources:

CPU time
the number of CPUs
memory availability
external storage space
wall clock time.

Different sorts may exploit one resource over another.

Programs that require less wall clock time to perform a sort usually require more memory and more frequent CPU cycles.

Sorts that require less memory and CPU generally need more external storage to move data and elapsed wall clock time.

Sorts whose work can be broken up into disjoint, independent units can be dispatched on separate CPUs, where the processing can execute in parallel.

The speed classification or order of sorts

Sorts of polynomial order that run in polynomial time are more efficient than exponential order that require exponential execution time.

An algorithm is said to be of polynomial time if its running time is upper bounded by a polynomial expression in the size of the input for the algorithm, i.e., T(n) = O(n**k) for some constant k

Exponential order sorts grow more rapidly in the processing time required with the increase in the number of items sorted (N).

An algorithm is said to be exponential time, if T(n) is bounded by O(2**nk) for some constant k.

That is as the number of items to be sorted increased linearly, the time to perform the sort increases exponentially.

Sub-exponential time is used to express that the running time of some sort algorithm may grow faster than any polynomial but is still significantly smaller than an exponential. In this sense, problems that have sub-exponential time algorithms are somewhat more tractable than those that only have exponential algorithms.

Fastest …............. Faster …............. Somewhat Fast …............ Not Fast

Logarithmic << Linear << Polynomial << Sub-Exponential < Exponential

Graph of Sort Performance

Sort Complexity

The common sorting algorithms can be divided into two classes by the complexity of their algorithms. Algorithmic complexity is a non-trivial subject, suffice it to say that there's a direct correlation between the complexity of an algorithm and its relative efficiency.

Orders of Functions

Algorithmic complexity is generally written in a form known as Big-O notation, where the O represents the complexity of the algorithm and a value n represents the size of the set that the algorithm is run against.

Big-O notation is frequently used to describe a sort algorithm's use of computational resources usually with regard to consumed CPU cycles and wall-clock time.

For example, O(n) means that an algorithm has a linear complexity. In other words, it takes ten times longer to operate on a set of 100 items than it does on a set of 10 items (10 * 10 = 100).

Complexity of O(log n) means that an algorithm has logarithmic complexity. It takes the two times longer to operate on a set of 1000 items vs a set of 100 items. This is extremely efficient.

The complexity was O(n**2) (quadratic complexity with exponential order), then it would take 100 times longer to operate on a set of 100 items than it does on a set of 10 items.

The classes of strongly polynomial algorithms which run in O(n**2) and less, include the bubble, insertion, selection, and shell sorts;

If the sort is on order of O(c**n) this has geometric complexity.

If the sort is on order of O(n!) this has factorial complexity.

Weakly polynomial would include any sorts which run in O(nk) time and O(n log n) (polynomial order)which includes the heap, merge and quick sorts.

EXCHANGE SORTS are sorts that exchange pairs of items until the list is sorted.

A Bubble Sort passes through the file sequentially several times. Each pass consists of comparing each element in the file with its successor neighbor ( x[i] with x[i +1] ) and interchanging the two elements if they are not in the proper order. Note that after each pass or iteration, the (nth - i) element is in the proper order, so you can improve the sort time by reducing the scans to the remaining (n - i - 1) items. The bubble sort is generally considered to be the most inefficient sorting algorithm in common usage. Under best-case conditions (the list is already sorted), the bubble sort can approach a constant O(n) level of complexity. General-case is an abysmal O(n**2).

A Bi-directional Bubble Sort improves on the Bubble Sort by alternating the direction of successive passes through the file between the top and the bottom. In this way the smaller elements move quickly to the top of the order in the same way that the larger elements move to the bottom of the file. The sort saves some time by not resetting its pointers back to the beginning of the remaining unsorted items. Every successive pass compares the remaining unsorted items of the previous pass.

A Selection Sort works by scanning for the smallest item remaining in the unsorted part of the list, and then swapping it with the item in the next position to be filled of the sorted part of the list. The selection sort has a complexity of O(n**2). A relatively slow sort.

The Quick Sort, also called the partition exchange sort, first provides a partitioning method for the Ath element (called the pivot) to seek its final position J in the partition. At this point elements in positions 0 through J-1 are less than or equal to A; and all elements in positions J+1 through N are greater than or equal to A. The process is recursively reapplied on both newly formed left and right partitions until all partitions are of size 1. Now the file is sorted.

The partitioning method logic works as follows:

Setup: Choose a pivot P to be the contents of the rightmost element ( A[ RIGHT ] ), this is arbitrary. (A[ RIGHT ] now holds the pivot point value.)
Let LOW = LEFT index of the current partition.
Let HIGH = RIGHT index of the current partition.
While the condition ( LOW < HIGH ) holds continue to loop Step1 thru Step3.
Step1: Repeatedly increase the cursor pointer named LOW by +1 until A[ LOW ] > A.
Step2: Repeatedly decrease the cursor pointer named HIGH by -1 until A[ HIGH ] < A.
Step3: If LOW < HIGH, interchange contents A[ LOW] with A[ HIGH ].
Go to Step1.
Step4: Otherwise interchange the contents of A[ RIGHT ] with A[ HIGH ] .
Done, the pivot "P" has reached its final position. Now all elements to the left of the pivot P are less than or equal the contents of A[P], and elements to the right of pivot P are greater or equal.

Now recursively sort the new partition to the left of the pivot, and then sort the new partition to the right of the pivot (not including the pivot in either sort).

The efficiency of this algorithm is impacted by which element is chosen as the pivot point. The worst-case efficiency of the quick sort, O(n**2/2), occurs when the list is sorted and the left-most element is chosen. Randomly choosing a pivot point rather than using the left-most element is recommended if the data to be sorted isn't random. As long as the pivot point is chosen randomly, the quick sort has an algorithmic complexity of O(n log n).

The Heap Sort is an implementation of the general selection sort using the input array as a heap representing a descending priority queue. The pre-processing phase creates a heap of size N using the sift up operation, and the selection phase redistributes the elements of the heap in order as it deletes elements from the priority queue using the sift down operation. The Heap Sort has a complexity of O(n log n). In the average case, the heapsort is not as efficient as the quicksort. The Heap Sort requires twice as much time as Quicksort for random input.

INSERTION SORTS are similar to Exchange Sorts, but the essence of their algorithms insert into an existing sorted file.

A Simple Insertion Sort is one that sorts a set of records by inserting records into an existing sorted file. Such as you would sort a hand of cards. Like the Bubble Sort, the Insertion Sort has a complexity of O(n**2). Although it has the same complexity, the insertion sort is a little over twice as efficient as the bubble sort.

A Shell Sort sorts separate subfiles of the original file. These subfiles contain every Kth element of the original file. Each subfile is sorted by simple insertion, a new smaller K is chosen, and the file is again partitioned into a new set of subfiles and sorted again. This process is repeated with smaller values for K until K is set to 1. At this point the entire file is sorted. Invented by Donald Shell in 1959, the shell sort is the most efficient of the O(n**2) class of sorting algorithms. Of course, the shell sort is also the most complex of the O(n**2) algorithms.

An Address Calculation Sort is sorting by hashing. In this method a function f is applied to each key. The result of this function determines which of several subfiles the record is to be placed in. The function should have the property that if x =< y, then f(x) =< f(y). Such a function is called order-preserving.

Parallel Execution SORTS are sorts where the work can be divided into disjoint units that can be dispatched simultaneously on separate processors (CPUs). While these sorts can be run on uniprocessors, they will not be high performance.

The Odd-Even Transposition Sort and the Shear Sort are examples of parallel execution sorts.

MERGE SORTS involve combining two or more sorted files into a third sorted file.

The merge sort uses a technique to sort a file by dividing the file into N subfiles and merge adjacent pairs of files. The process is repeated until there is only a single file remaining of size N. There are no more than O(n log n) comparisons in the merge sort involving less than N comparisons, but is does on average require twice as many assignments as quicksort.

To view a comparative Sort Algorithm demonstration Launch Here

Chapter 7
Hashing

The HASH Table is an associative array data structure that relates keys with data values. The primary operation it supports is to efficiently and rapidly provide store and retrieval/lookup functions. It is employed where speed of the access and quick retrieval of data is of primary importance.

The data structure elements are divided into (preferably) equal-sized categories, or buckets, to allow quick access to the elements. The hash function determines in which bucket an element belongs.

A HASH Function is a function or algorithm applied to a data key found in the record to be accessed or stored. The hash function transforms the key into a displacement, or index value into a table, or some other reference to a location, called a hash. The hash is used to locate the desired record in the Hash Table. The key for the record could be, for example, a person's name or social security number.

A HASH Key is the output of the hash function that serves as the index entry into the hash table.

A HASH Collision occurs when 2 or more records resolve to the same position or location reference based upon their hash keys. Their hash function computation have produced equivalent results and therefore point to the same location.

Thus the second part of the hashing search (or storage) requires a collision-resolution process.

Separate Chaining is a method of resolving hash collisions. It involves keeping a distinct linked list (or other suitable data structure) for all records whose keys hash into the same value. Each record resolving to the same key is placed appropriately on the list that is anchored to the hash table index entry.

Hash Key Computation

The arithmetic computation is usually simple to implement, but you must use caution to avoid various subtle pitfalls.

If the resultant hash table is to hold N number of items or keys then we must ensure that our hashing computation results in a number between [0, N-1].

The ideal hashing function should evenly distribute its output across the span of the table's keys. Avoid clumpiness due to collisions, and gaps.

You must be familiar with the data, and its variance, to be able to generate a suitable hashing function and to predict the table size needed.

Hashing provides a way to use a reasonable amount of both memory and computation time to strike a balance between the consumption of these two resources.

Pseudocode for sample LinkListds linklistds

Pseudocode for sample String Hashing Calculation hashstringcalc

100 Random integers (1-1000) for input to lab exersize Click Here

[Return to Professor Page]

Professor Cullen Lecture Notes

COMS223 Data Structures

Chapter 1 Arrays

Chapter 2 Stacks

Chapter 3 Recursion

Chapter4 Queues and Lists

Chapter 5 Binary Trees

Chapter 6 Sorting

Chapter 7 Hashing