Professor Cullen Lecture Notes

COMS223 Data Structures


Chapter 1
Arrays


Definition:

An ARRAY is a ordered set of a specific, finite number of same-type nodes, whose structural properties involve the relative positions of the nodes. The nodes are randomly (or directly) accessible by a numeric index. The addresses of array members are usually contiguous in the virtual storage of the computer. 

Order is not of great importance in an Array.

There are two basic operations for accessing an Array: extraction and storing.

You populate an Array by storing into the first available free slot.

To query or access a stored element you directly address its slot location.

To store or extract an element from the middle of a populated Array requires that all elements on one side of the insertion/extraction point are shifted by one slot to either make room or to contract the space needed to accomodate  the change.   


BOUNDARY CONDITIONS are those special or unique circumstances that present themselves when dealing with the logic necessary to process the endpoints of a given data structure.  The boundary conditions describe the behavior of the algorithmic simulation at the edges of the simulation region.

Boundary Conditions include the following:
Logic to handle addition of next element in a FULLY POPULATED data structure.
Logic to handle removal of an element from an UNPOPULATED data structure. 


An ELEMENT is a single data object that is used to populate a data structure. It defines the contents of the NODE.

For our discussions, SLOTS or buckets are addressable virtual storage areas that hold elements of data. Slots may or may not be in contiguous storage areas.

Example: Lets say that the data element integer 10 is found in slot number 42. Therefore if we went to location 42 we would find that its contents would contain the number 10. 



To access the pseudo-code source code: arraytest.java, arraytest.cpp,


Chapter 2
Stacks


Definition:

A STACK is a linear list or array with the constraint that all insertions and deletions (and queries) are made at a single end of the list, usually known as the TOP of the stack.  The order in which elements can be retrieved from a stack is  Last-In  First-Out (LIFO) or First-In-Last-Out (FILO).

Inserting a data item on the top of the stack is called "pushing" into the stack, while removing an item from the top is called "popping" off the stack. 

Examining the contents of the top slot of the stack is called "peeking" or "querying the top".

The Stack can be thought of as an Array with certain constraints place upon access to its elements.  While any of an Array's nodes can be directly accessed, you can only "see" the TOP node of the Stack. 

Note: To save some coding effort you can make Stack an extension of Array.

Types of Stacks

Simple Stack
Self-contained mechanism in which a fixed number of nodes are pre-allocated and sequentially ordered within contiguous storage.  The "top" pointer simply moves forward/backward one element length.

Dynamic Stack
Structure begins with zero nodes, places allocation/deallocation requests to some other storage management mechanism, such as an array, that obtains clean nodes and recovers free nodes from a pool.  The dynamic stack will expand and contract in virtual storage.

Linked Stack
Self-contained mechanism like the simple stack, but each node is linked to its adjacent next element.  This allows the stack to maintain a pool of storage elements that can be shared by multiple calling programs that wish to concurrently share the stack.  Elements slots can be fetched and returned without regard to their ordering.

Dynamic Linked Stack
A Linked Stack that uses some other storage mechanism to perform allocation/deallocation of its nodes and therefore can expand and contract.


The TOP variable points to the current element at the top of the stack.



To access the pseudo-code source code: NodeElement,


Chapter 3
Recursion


Definition:

An algorithm or method is RECURSIVE if it defines a procedure in terms of a simpler case of itself.
A recursive program is one that invokes itself. The same computation recurs, or occurs repeatedly, as the problem is solved.  

For a recursive definition to work, the definition in any given case must be well-founded, avoiding an infinite regress.  That is:

1) Every recursive call must simplify the computation in some way.

2) There must be special termination condition or stopper cases to handle the simplest computations directly, without the need to call an even simpler case of itself.

Sometimes the term inductive definition is used as a synonymn.

A recursive call can result in many further recursive calls because the each call is dividing the problem in new smaller subproblems, or smaller instances of the same problem.

For a recursive method to terminate, the problem must eventually be reduced to a stopping case.  

For at least one argument, a recursive function "f" must be defined in terms that do not involve "f".   There must be a "way out" of the sequence of recursive calls.  When reached, the method returns a result to its immediate caller.  This caller, in turn, performs a computation and returns a result to its own caller.  This process continues until a result is percolated back up the calling chain to the original caller.

The difference between a circular definition and a recursive definition is that a recursive definition must have base cases.  Base cases are those that satisfy the definition without being defined in terms of the definition itself, and all other cases comprising the definition must be "smaller" (closer to those base cases that terminate the recursion).

Infinite recursion happens either because the parameter values passed during a recursive call do not get simpler or because a special terminating "stopper" case is missing or never reached.

Recursive function call implementation requirements:
1)    Passing arguments.
2)    Allocating and initializing local variables.
3)    Transferring control to and regaining control from the function.

Each call to the recursive function causes a new variable data area to be allocated and each reference to a variable in the function's data area is local to the most recently allocated area.  

Each return causes the current data area to be released, and the data area allocated immediately prior to the current area becomes the new current data area.

Recursion vs. Iteration

The difference between a recursive function and an iterative function is that the iterative type calls for the explicit repetition of some process until a certain condition has been met.

Recursion is generally more expensive in computing resources than non-recursive iterative methods.  The run environment must save and recover all local and temporary variables and the "return control" locations prior to and upon return from each recursion instance.   

A stack is generally used for this purpose.  The costs for calls to push, pop and peek as well as handling of boundary conditions could get expensive.

In a non-recursive procedure the stacking of unnecessary variables can be eliminated.

However, from a programming point of view, the recursive solution may be the most natural and logical way of solving a problem.



To access the pseudo-code source code: Factorial


Chapter4
Queues and Lists


Definition:

ALLOCATION is the acquiring of storage to accommodate the data structures and supporting code of a given task. 

DEALLOCATION is the releasing of previously allocated virtual storage.


A Constructor causes class "instantiation".  Instantiation is an allocation of storage into which variables and program code are loaded at a unique location in virtual memory to support an instance (a copy) of that program.   The Constructor is called to request the building of the instance.  The Constructor returns as its output an reference to the program and gives the calling program addressability to the instance.  The operating system or Java Virtual Machine services the request for construction.


GARBAGE COLLECTION is the reclaiming of deallocated storage used by code and data structures so that is may be reused by other processes. 


A Destructor causes Object de-allocation of storage that represents variables and program code caused from some prior class instantiation that created the object. Destructors are not used in JAVA.

Queues

A QUEUE is a linear list or array with the constraint that all insertions are made at one end of the list, and all deletions are made from the other end.  Data is processed First-In-First-Out (FIFO).  Like a stack, a queue can be thought of as an array with certain constraints placed upon access to its member elements. 


Placing an item into a queue is called "insertion", which is done at the end of the queue called "tail". 


Removing an item from a queue is called "deletion", which is done at the other end of the queue called "head". 


Types of queues include:

BOUNDED QUEUES which are of fixed length such as the "pipe queue" and "circular queue" and

UNBOUNDED QUEUES which can expand or contract, such as the "linked queue".

A PRIORITY QUEUE has all the attributes of the standard QUEUE with the added capability that items can be inserted into the structure at any position (prioritizing).
This violates the strict definition of a queue.

LISTS

A LIST is an ordered array. Order is usually imposed on the array by linkages to another node of the array. Lists differ from arrays in that they are not necessarily arranged in contiguous storage locations. 

Lists have a disadvantage compared to Arrays in that to access an element stored in a List you must traverse the structure by following the linkages; whereas Arrays permit direct random addressability.

Lists have an advantage over Arrays when inserting or removing a node from other than the structure's endpoints. Only a simple adjustment of the linkage pointers is required.

A RECORD is a set of related fields. 

In JAVA and NetRexx a record is defined by an Object.
In C++ a record is defined and mapped by a Structure.

Both the JAVA Object and C++ Structure contains non-executable data.


In COBOL a record is an 01 Level plus its sublevel (02 through N) definitions.

A LINKED LIST is an ordered array that is logically connected where each record is linked to the next by a single pointer.

A DOUBLE LINKED LIST is an ordered array that is logically connected where each record is linked to the next and the previous by a pointer.

// C++ Code Skeleton (HINTS)
//
// Purpose: Demonstrates 2 classes (Stack object and Queue Object) and a user interface.
// The UI interfaces with the Queue object,
// The Queue object interfaces with the Stack object,
// The Stack object maintains the storage structure NodeElement.
// The code within the skeleton methods is for the student to create.
// Feel free to use the Main User Interface to demonstrate your work.

To access the C++ source skeleton code file: Click Here


Chapter 5
Binary Trees


Definition:

A TREE structure allows us to organize data hierarchically.

Real-world trees occur in many places: geneology, sports tournaments, corporate ornanizations, operating system file structures, etc.

Since trees are structurally more complex than lists, they have a more extensive nomenclature for referring to their various components.

Root, nodes, leaves, branches.


No node can be its own ancestor or descendant.

Two nodes may not have the same child.


Binary Trees

A BINARY TREE is a finite set of single-linked elements that is either empty or partitioned into at most 3 disjoint subsets. 

The first subset contains a single element called the ROOT of the tree.

The other subsets are called the left and right subtrees of the original tree.  These subtrees are in themselves binary trees. 

A tree NODE is a single element of the tree.

Each node other than the ROOT has exactly one PARENT which is its ancestor. 


A parent may have up to two CHILDREN, a left-child and a right-child. These are known as the descendants of the parent. 

A node with no children is called a LEAF


Two nodes are called SIBLINGS if they are children of the same parent.  No child may have more than one parent node.  

The LEVEL  or HEIGHT of a node in a tree is the node's distance from the root.

The DEGREE of a node in a tree is the number of child nodes attached to it.   A node of a binary tree has a maximum degree of 2.

The series of nodes connecting an ancestor to a descendant is called a PATH.  Its length is known as the PATHLENGTH.

Definition of an Ordered Tree
A tree clearly is not a linear sequence.  In order to use a tree for sorting, we need to restate the definition of "sorted" in terms of functionality rather then in terms of a static or positional description.  We say that a collection a[0], a[1],...,a[n] is sorted if the organization facilitates sequential traversal.  In other words, a way to visit each node systematically in-order.

A tree is said to be inorder if the following two requirements hold:

(A) Its left subtree contains only those values that are less than (or =) its root value.
(B) Its right subree contains only those values that are greater than its root.

Visiting the nodes of a tree in a specific order is called a TRAVERSAL. 

PRE-ORDER TRAVERSAL (also known as depth-first order) visits the root, then the left subtree in preorder, then the right subtree in preorder. 

IN-ORDER TRAVERSAL (also known as symmetric order) visits the left subtree in inorder, then the root, then the right subtree in inorder. 

POST-ORDER TRAVERSAL(also known as postorder) visits the left subtree in postorder, then the right subtree in postorder, followed by the root.

 

A THREADED Tree is one constructed so that the right links of all leaf nodes point to the next node required to be visited during an in-order traversal.   This provides a "short cut" in traversing the tree.  Programming demands for storage and computation are reduced.



Chapter 6
Sorting


Definition:

A FIELD is a set of N bytes that represents some abstract information. 

A RECORD is a collection of N related fields.

A FILE is a collection of N records. 

A DATABASE is a collection of N related files.

A KEY is a one or more fields or subfields of the record used for sorting and searching.

A SORT is a process that rearranges the records of a file into a sequence that is sorted on one or more key(s).

A sort is known as stable if the arrangement of all records of the same key are preserved after the sort.

A merge is the process of combining two or more sorted files into a resultant sorted file.


Efficiency considerations for sorting are: CPU time, memory and external storage space, and wall clock time.
Different sorts may exploit one resource over another.  Programs that require less wallclock time to perform the sort usually require more memory and more frequent CPU cycles.  Sorts that require less memory and CPU generally need more external storage to move data and elapsed wallclock time.


Sorts of polynomial order are more efficient than exponential order.

Exponential order sorts grow more rapidly in the processing time required with the increase in the number of items sorted (N).  

That is as the number of items to be sorted increased linearly, the time to perform the sort increases exponentially.

Orders of Functions

The common sorting algorithms can be divided into two classes by the complexity of their algorithms. Algorithmic complexity is a complex subject, suffice it to say that there's a direct correlation between the complexity of an algorithm and its relative efficiency.

Algorithmic complexity is generally written in a form known as Big-O notation, where the O represents the complexity of the algorithm and a value n represents the size of the set that the algorithm is run against. 

Big-O notation is frequently used to describe a sort algorithm's use of computational resources usually with regard to consumed CPU cycles and wall-clock time.

 

For example, O(n) means that an algorithm has a linear complexity. In other words, it takes ten times longer to operate on a set of 100 items than it does on a set of 10 items (10 * 10 = 100).  

Complexity of O(log n) means that an algorithm has logarithmic complexity.  It takes the two times longer to operate on a set of 1000 items vs a set of 100 items.  This is extremely efficient.

The complexity was O(n**2) (quadratic complexity with exponential order), then it would take 100 times longer to operate on a set of 100 items than it does on a set of 10 items.

The classes of strongly polynomial algorithms which run in  O(n**2) and less, include the bubble, insertion, selection, and shell sorts; 

If the sort is on order of O(c**n) this has geometric complexity.

If the sort is on order of  O(n!) this has factorial complexity.

Weakly polynomial would include any sorts which run in O(nk) time and O(n log n) (polynomial order) which includes the heap, merge and quick sorts.


EXCHANGE SORTS are sorts that exchange pairs of items until the list is sorted.

A Bubble Sort passes through the file sequentially several times. Each pass consists of comparing each element in the file with its successor neighbor ( x[i] with x[i +1] ) and interchanging the two elements if they are not in the proper order. Note that after each pass or iteration, the (nth - i) element is in the proper order, so you can improve the sort time by reducing the scans to the remaining (n - i - 1) items.  The bubble sort is generally considered to be the most inefficient sorting algorithm in common usage. Under best-case conditions (the list is already sorted), the bubble sort can approach a constant O(n) level of complexity. General-case is an abysmal O(n2).

A Bi-directional Bubble Sort improves on the Bubble Sort by alternating the direction of successive passes through the file between the top and the bottom. In this way the smaller elements move quickly to the top of the order in the same way that the larger elements move to the bottom of the file. Every successive pass compares the remaining unsorted items of the previous pass.

A Selection Sort works by selecting the smallest unsorted item remaining in the list, and then swapping it with the item in the next position to be filled. The selection sort has a complexity of O(n2).  A relatively slow sort.



The Quick Sort, also called the partition exchange sort, first provides a partitioning method for the Ath element (called the pivot) to seek its final position J in the partition. At this point elements in positions 0 through J-1 are less than or equal to A; and all elements in positions J+1 through N are greater than or equal to A. The process is recursively reapplied on both newly formed left and right partitions until all partitions are of size 1.   Now the file is sorted. 


The partitioning method logic works as follows: 
Setup: Choose a pivot A to be the contents of the rightmost element ( X[RIGHT] ),  this is arbitrary.  (X[RIGHT] now holds the pivot point value.)
Let LOW = LEFT index of the current partition.
Let HIGH = RIGHT index of the current partition.
While LOW < HIGH loop Step1 thru Step3.
Step1: Repeatedly increase the cursor pointer named  LOW  by  +1  until  X[LOW] > A.

Step2: Repeatedly decrease the cursor pointer named  HIGH  by  -1  until  X[HIGH] < A.
Step3: If  LOW < HIGH,  interchange contents X[LOW] with X[HIGH].    Then GO to Step1.
Step4: Else interchange the contents of X[RIGHT] with X[HIGH].

Done, the pivot "A" has reached its final position.  Now all elements to the left of the pivot are less than or equal, and elements to the right of the pivot are greater or equal.

Now recursively sort the new partition to the left of the pivot, and then sort the new partition to the right of the pivot (not including the pivot in either sort).

The efficiency of the algorithm is impacted by which element is choosen as the pivot point. The worst-case efficiency of the quick sort, O(n2/2), occurs when the list is sorted and the left-most element is chosen. Randomly choosing a pivot point rather than using the left-most element is recommended if the data to be sorted isn't random. As long as the pivot point is chosen randomly, the quick sort has an algorithmic complexity of O(n log n).



The Heap Sort is an implementation of the general selection sort using the input array as a heap representing a descending priority queue.  The preprocessing phase creates a heap of size N using the siftup operation, and the selecton phase redistributes the elements of the heap in order as it deletes elements from the priority queue using the siftdown operation.  The Heap Sort has a complexity of O(n log n).  In the average case, the heapsort is not as efficient as the quicksort.  The Heap Sort requires twice as much time as Quicksort for random input.



INSERTION SORTS are similar to Exchange Sorts, but the essence of their algorithms insert into an existing sorted file.

A Simple Insertion Sort is one that sorts a set of records by inserting records into an existing sorted file. Such as you would sort a hand of cards. Like the Bubble Sort, the Insertion Sort has a complexity of O(n2). Although it has the same complexity, the insertion sort is a little over twice as efficient as the bubble sort.

A Shell Sort sorts separate subfiles of the original file. These subfiles contain every Kth element of the original file. Each subfile is sorted by simple insertion, a new smaller K is chosen, and the file is again partitioned into a new set of subfiles and sorted again. This process is repeated with smaller values for K until K is set to 1. At this point the entire file is sorted. Invented by Donald Shell in 1959, the shell sort is the most efficient of the O(n2) class of sorting algorithms. Of course, the shell sort is also the most complex of the O(n2) algorithms.


A Address Calculation Sort is sorting by hashing. In this method a function f is applied to each key. The result of this function determines which of several subfiles the record is to be placed in. The function should have the property that if x =< y, then f(x) =< f(y). Such a function is called order-preserving.



MERGE SORTS involve combining two or more sorted files into a third sorted file.

The merge sort uses a technique to sort a file by dividing the file into N subfiles and merge adjacent pairs of files.  The process is repeated until there is only a single file remaining of size N.  There are no more than O(n log n) comparisons in the mergesort involving less than N comparisons, but is does on average require twice as many assignments as quicksort.



To view Sort Algorithm demonstration Click Here



Chapter 7
Hashing


Definition:

The HASH Table  is an associative array data structure that relates keys with values. The  primary operation it supports is to efficiently and rapidly provide store and retrieval/lookup functions.  It is employed where speed of the access and quick retrieval of data is of primary importance.

The data structure elements are divided into (preferably) equal-sized categories, or buckets, to allow quick access to the elements. The hash function determines in which bucket an element belongs.

 
A HASH Function is a function or algorithm applied to a data key found in the record to be accessed or stored. The hash function transforms the key into a displacement, or index value into a table, or some other reference to a location, called a hash.  The hash is used to locate the desired record in the Hash Table.   The key for the record could be, for example, a person's name or social security number.  


A HASH Key is the output of the hash function that serves as the index entry into the hash table. 


A HASH Collision occurs when 2 or more records resolve to the same position or location reference based upon their hash keys. Their hash function computation have produced equivalent results and therefore point to the same location.   

Thus the second part of the hashing search (or storage) requires a collision-resolution process.


Separate Chaining is a method of resolving hash collisions. It involves keeping a distinct linked list (or other suitable data structure) for all records whose keys hash into the same value.  Each record resolving to the same key is placed appropriately on the list that is anchored to the hash table index entry.

Hash Key Computation

The arithmetic computation is usually simple to implement, but you must use caution to avoid various subtle pitfalls.  

If the resultant hash table is to hold N number of items or keys then we must ensure that our hashing computation results in a number between [0, N-1]. 

The ideal hashing function should evenly distribute its output keys across the span of the table.  Avoid clumpiness due to collisions, and gaps.

You must be familiar with the data to be able to generate a suitable hashing funtion and to predict the table size needed.

Hashing provides a way to use a reasonable amount of both memory and computation time to strike a balance between the consumption of these two resources.

Pseudocode for sample LinkListds linklistds

Pseudocode for sample String Hashing Calculation hashstringcalc

100 Random integers (1-1000) for input to lab exersize Click Here


[Return to Professor Page]