This paper was done for a collaborative recommendation report in Technical Writing (yes, it was a required course), but it ended up being completely uncollaborative and all my own work. And as I was alone, two sections are missing that were originally planned (Operating Systems and APIs). Nevertheless, I spent a solid month doing almost nothing but researching this paper. So, while it probably has some incorrect facts here and there, it is nothing but extensive.

Parallel Computing Projects

for Academic Environments

 

 

Puevfgbcure Unecre, Senior in Computer Science, Longwood University

 

December 13, 2007

 

 

Abstract

 

Experience in parallel computing is an increasingly necessary skill for today’s upcoming computer scientists as processors are hitting a serial execution performance barrier and turning to parallel execution for continued gains.  However, parallel programming is not a trivial undertaking; it has many unique challenges that are best learned through hands-on experience on suitable parallel computing setups.  This paper aims to assist students and professors in deciding on the best ways to learn parallel computing by providing them with an adequate assessment of the current state of the subject.

 

 

Outline

 

1)      Definitions

a)      Parallel Computing

i)        Architectures

(1)   Centralized Multiprocessors

(a)   Multicore

(b)   Symmetric Multiprocessing (SMP)

(2)   Distributed Multiprocessors

(a)   Grid Computing

ii)       Roles

(1)   Computational

(2)   Load Balancing

2)      Projects in Parallel Computing

a)      Factors

i)        Evaluation Criteria

(1)   Cost

(2)   Complexity

(3)   Educational Value

(4)   Innovation

ii)       Intrinsic Factors

(1)   Speedup

(2)   Scalability

(3)   Overhead

b)      Project Components

i)        Architectures

(1)   Centralized Multiprocessors

(a)   Multicore

(b)   Symmetric Multiprocessing

(2)   Distributed Multiprocessors

(a)   Grid Computing

(i)      Virtual Supercomputer

(b)   Beowulf Cluster

(c)   Commodity Cluster

(3)   Specialized Multiprocessors

(a)   Massively Parallel Processor (“Supercomputer”)

(i)      Vector Processor

(ii)    Processor Array

(b)   Stream Processing

(i)      GPGPU (General-Purpose computing on Graphics Processing Units)

ii)       Roles

(1)   Load-Balancing

(a)   Distributed Load-Balancing

(b)   Multithreading

(2)   Computation

(a)   Assembly Line (pipelining)

(b)   Redundancy

(c)   Data Clustering

3)      Conclusion

a)      Quantitative Assessment

b)      Selected Projects in Parallel Computing

i)        Simple Fork on Embarrassingly Parallel Problem

ii)       Benchmarking Performance for Known Parallel Setups and Applications

iii)     Distributed Load Balancing

iv)     Data Clustering for a Basic Search Engine

 


Definitions

 

Parallel Computing (or Parallel Processing) is the utilization of multiple computer processors working together concurrently for increased performance and thus faster computation.  This is possible because programming algorithms can often be divided into smaller, independent parts that can be executed in parallel, side-by-side.  However, beyond this one common goal of increased performance, all other aspects of parallel computing splinter off into many forms with many uses[1].

 

There are a few general forms (architectures) of parallel computers, which can be separated into three major types: centralized (multiprocessor computers), distributed, and specialized multiprocessors.  The former is named so because it contains two or more processors that communicate over the memory or bus of a single computer.  Multiprocessors that communicate over shared memory are typically multicore processors as they have two processing units in a single CPU package.  This is contrasted by symmetric multiprocessors, which communicate over a bus (an electrical channel between computer hardware components)[2].  Although hardware implementation for these two flavors of multiprocessors differs, they both function similarly enough at a software level to be programmed alike.

 

Distributed multiprocessors (or computer clusters) are a collection of individual computers that communicate over a network.  Machines in a cluster are often virtualized to form one autonomous computer.  However, they can also be made from individually interfaced or administered computers.  Another type of cluster is referred to as grid computing because its topology is similar to a power grid.  And like a power grid, grid computing is not geographically bound due to the advent of the Internet.  Even though distributed multiprocessors can function in most of the same ways as centralized multiprocessors, their major limiting factor is the speed of their communication, which can be too slow or unreliable to perform some operations.

 

As with the two major architectures of parallel computing, there are generally two major roles in which multiprocessors can perform.  These roles are based around the levels of parallelism that different algorithms can be adapted (see Figure 1).  Instruction-level parallelism has different processing units (or nodes) executing multiple independent instructions.  Data-level parallelism has different nodes executing the same instructions on different data.  And task-level parallelism has different nodes executing different instructions (and possibly different data) altogether[1].

 

In a strictly computational role, many nodes work on different parts of a single problem.  While this role can work with any level of parallelism, it lends itself particularly well to the instruction and data levels.  The algorithm can split up the instructions or the data onto different nodes, whichever performs better.  Load balancing is the opposing role to computation; it attempts to keep the workload the same across all nodes.  As such, load balancing performs best with data- or task-parallelism because the algorithm relies on either different data and/or instructions that can be separated into different execution threads.

 

 

Figure 1. Illustrating Levels of Parallelism with Cutting a Block of Cheese

 

 

Projects in Parallel Computing

 

The bulk of the information presented in this section is composed of brief descriptions and evaluations of the different components of parallel computing.  Four major categories of components were realized, which could generally be arranged into any configuration necessary.  However, some components are understandably more suitably matched in a configuration than others.  (Only two categories are present in this version because of time constraints.)  The categories are as follows:

 

  • Architectures are the designs and structures of computing hardware.
  • Roles are the parallel actions to be performed.
  • Platforms are the system software (operating systems) that allow interaction with the hardware.
  • APIs (Application Programming Interfaces) simplify interaction between programs and parallel computing resources.

 

Criteria for Evaluation

Before delving into the projects, it is necessary to define the criteria that will be used for evaluating them.  These factors would be important to an academic venture that has limited resources and previous participant experience. 

 

  • Cost – This one is rather straightforward; it includes not only the original cost of materials but also the usage and maintenance costs (i.e. electricity).
  • Complexity – This is the time it would likely take to implement the project given previous participant experience.
  • Educational Value – While there is no quantitative measure for educational value, projects can be assessed by their ability to teach fundamental concepts of parallel computing such as synchronization and communication.
  • Innovation – Usually an opposing factor to educational value, innovation is an objective measure of the degree to which a project is on the cutting edge of computing.

 

Intrinsic Factors

Several additional factors exist when considering different components of parallel computing (especially architectures), which, while they are not necessarily the deciding factors, are still important intrinsic factors.

 

  • Speedup is simply the ratio of performance increase of the parallel execution of some algorithm compared to the serial execution of the same algorithm, i.e. parallel execution time divided by serial execution time.  The quality of the algorithm programming is particularly important to speedup, but the architecture’s overhead is also a factor.
  • Scalability is the ability of a parallel processor to continue increasing its performance by the ratio of speedup as more processors are added.
  • Overhead is the amount of processing performance lost due to factors necessary to the operation of a parallel processor, such as communication and synchronization[3].

 

 

Project Components

 

Architectures, Centralized Multiprocessors

Centralized multiprocessors are named so because they contain two or more processors that communicate over the memory or bus of a single computer

 

Centralized Multiprocessors, Multicore

  • Cost:  Prices for multicore CPUs have already dropped substantially since their entry into the consumer market only a couple years ago.  The cheapest of the dual-core processors can be purchased for only ~$60, whereas the most costly processors are ~$700 or ~$1500 for desktop- and server-grade respectively.  The cheapest quad-core processors are selling for no less than $200 currently.  Considering a decently performing barebones system of $300 without the processor (or peripherals), a single multicore machine or a whole lab of multicore machines can be purchased for relatively cheap.
  • Complexity:  System assembly is rather straightforward; machines can be built with a minimum of hardware and operating system knowledge.  A couple years of experience in any high-level language is sufficient to be able to construct threaded programs, which are the basis of parallel processing on multicore systems.
  • Educational Value:  These multiprocessor machines are probably the best way to learn the basic concepts of parallel programming.  Communication and synchronization are comparatively simple to program and results can be seen quickly.
  • Innovation:  While multicore processors are only a couple years old, their introduction was not as groundbreaking as one might think.  Server-grade systems have been using SMP architectures for many years now, and they are programmed in much the same way as multicore processors.

 

Centralized Multiprocessors, Symmetric Multiprocessing (SMP)

  • Cost:  Since multicore processors were released, SMP architectures have been almost exclusively for server systems.  As such, the processor, motherboard, and memory prices are typically about twice more than for a multicore system.  Their only advantage over multicore would be explicit use in server operations and programming.
  • Complexity: See Multicore
  • Educational Value: See Multicore
  • Innovation:  As noted for multicore architectures, this technology has been around awhile and is rather well rooted in consumer-level systems and programming.

 

Architectures, Distributed Multiprocessors

Distributed multiprocessors are a collection of individual computers that communicate over a network.

 

Distributed Multiprocessors, Beowulf Cluster
Beowulf clusters are made from a collection of off-the-shelf computers that are stripped down to barebones components (processor, memory, motherboard, network interface, power supply, and enclosure only) and connected over a LAN[4].  These nodes are controlled by a head computer, which assigns tasks for them and balances load.  As with all distributed multiprocessors, Beowulf clusters need specially designed programs to allow them to be of any use.  However, their scalability is rather good until a certain size (roughly 32 nodes) where they can become unwieldy because of networking, space, heat, and power concerns[5].

  • Cost:  While Beowulf clusters are most well known for their use of surplus or discarded machines, they can still be built with a uniform set of new, high-performance barebones machines.  Thus, they can really be built for as little or as much as necessary.  If only for educational demonstration purposes, they can even be made for free (barring electricity needs) from discarded institutional materials.
  • Complexity:  Network and system setup for a Beowulf cluster can be a time-consuming logistical challenge, especially with a large number of nodes.  Programming them to do something useful is probably not as tedious as the setup, but it will still require several years of experience in a high-level language (probably C/C++), networking, and Unix to get even a simple algorithm running efficiently.
  • Educational Value:  A Beowulf cluster may be the best way to learn distributed parallel programming because the nodes are designated for parallel program use only, they can be cheaply built, and they are locally maintained.
  • Innovation:  Most, if not all, decent-sized universities have built at least one Beowulf cluster over the years.  As such, this architecture is really nothing new.

 

Distributed Multiprocessors, Commodity Cluster
A commodity cluster (also known as a “Cluster of Workstations” or “Pile of PC’s”) is similar to a Beowulf cluster in how parallel processing is implemented, but its node topology is laid out in a way that allows users to interact with the machines, such as a computer lab or office computer network.  I/O demands from users on the cluster’s machines thus lower the overall performance of parallel computing due to overhead[5].

  • Cost:  Since this type of cluster is generally made from existing machines, its use as a parallel computer requires little to no additional cost.
  • Complexity:  Although the only setup requirement for nodes is distributing the parallel programs, like the Beowulf cluster, the commodity cluster needs to be programmed explicitly for parallel processing to gain any benefit.  As such, the experience requirements are the same (substituting whatever operating system is used).
  • Educational Value:  Depending on the environment of the cluster’s nodes, a commodity clusters may be as useful for teaching distributed multiprocessing as a Beowulf cluster.  For example, a computer lab would be excellent as it allows all the users to collaborate on and interact with the multiprocessor.
  • Innovation:  See Beowulf

 

Distributed Multiprocessors, Grid Computing

Grid computing is essentially clustering over a larger geographic area, such as a campus, region, or the entire world.  Since the communication speeds of more dispersed nodes are slower and more variable, grid computing requires highly parallel algorithms that need only limited or no communication between nodes[6].

  • Cost:  As grid computing typically scavenges cycles from existing machines, there is little cost to creating and scaling them up.  Assuming coordinating tasks is not also distributed across the grid, the only costs will come from maintaining a head machine(s) and bandwidth.
  • Complexity:  Setting up a server to handle coordination is the only hardware aspect required.  But grid computing has other unique challenges, such as finding an appropriately parallel problem and programming it to be almost completely parallel.  Also, distribution of the client parallel programs will require additional consideration and probably the development of a website.  Expect experience needed in a high-level programming language (possibly portable), networking and network programming, and web programming.
  • Educational Value:  Although some aspects of parallel computing are needed more than others (less communication and more synchronization), the basics are still present.  However, debugging on a large-scale distributed multiprocessor is impractical, and thus a commodity cluster would be needed for testing before distribution.
  • Innovation:  As grid computing has not existed for more than a decade (it is particularly dependent on the widespread availability of consumer-level computers), there is still room for improvement.  Any project with enough innovation or a prominent purpose has the potential to attract thousands of users willing to give up their spare processing power.

 

Grid Computing, Virtual Supercomputer
These are grid multiprocessors with comparable processing power to a traditional supercomputer.  However, unlike a traditional supercomputer, virtual supercomputers cannot count on a consistent amount of this processing power as users and cycles come and go from the network.  Virtual supercomputers are generally thought of as having a global reach.

 

Architectures, Specialized Multiprocessors
These processors were developed to be highly efficient at handling a certain type of instruction on a certain type of data in parallel.  However, these processors can occasionally be adapted to other roles which they were not designed for and yet maintain a performance advantage over general-purpose parallel processors.

 

Specialized Multiprocessors, Massively Parallel Processor (“Supercomputer”)
MPPs are designed explicitly to be the most scalable parallel processors, so that they can easily contain the several thousand processors needed for supercomputer status (though the actual number of processors is arbitrary)[7].  Although supercomputers of the past often used specialized processors (i.e. vector processors), the trend in today’s supercomputers is to utilize consumer-level general-purpose processors (i.e. IBM PowerPC, Intel Xeon) with little or no modification.  They are typically setup as a mix of distributed and centralized multiprocessors where, for example, each cabinet might hold 128 processors in a centralized MP and all the cabinets are connected to form a distributed MP[8].

  • Cost:  Depending on the number of processors intended for the computer, they can cost $100 thousand to $100 million.  Even with a huge grant, these massive machines are not practical for a small university environment.
  • Complexity:  Excluding the logistics of setting up and maintaining a supercomputer, programming them remains a huge undertaking.  Algorithms would need to be extremely parallel; communication would be mind-boggling; and special tools would be needed just to measure performance.
  • Educational Value:  Their massive size is much too unwieldy to learn the basics of multiprocessing.  Additionally, having students interact with such an expensive and overwhelming system is a substantial liability.
  • Innovation:  While supercomputers have been around in many forms over the last sixty years, the newest ones are still at the cutting edge of performance, typically handling computations for some of science’s greatest pursuits such as weather, biology, and electronics.  Possessing a supercomputer will cause any institution to instantly become cream of the crop in computer science.

 

Massively Parallel Processor, Vector Processor
Specialized for doing vector arithmetic on a huge scale, these supercomputers are only good at “crunching numbers”[9].

 

Massively Parallel Processor, Processor Array
Similar to a stream processor, each specialized or general-purpose processor in the array executes instructions on different streams of data (or a single stream demuxed to many).  They are particularly well suited to large-scale video or audio processing.

 

Specialized Multiprocessors, Stream Processing
Stream processors use strictly data-level parallelism to execute the same operation on numerous data streams.

 

Stream Processing, GPGPU
General-Purpose computing on Graphics Processing Units is a new trend that uses the programmable stream processors of the latest GPUs to execute code on non-graphics data.  GPUs use a stream processing architecture because they frequently need to perform the same transformation on many pixels or vertices, of which there can be millions in a single frame[10].

  • Cost:  Graphics adapters that support these programmable stream processors vary widely in cost, but the newest generations have been roughly the same cost as the newest CPU.  Considering that a CPU is also a requirement to build a system with a GPU, that puts GPGPU at a cost disadvantage compared to just a general-purpose multicore system.
  • Complexity:  GPGPU programming uses C-like languages that function at a low level (as in the shader languages they are designed for), which may be particularly difficult to code without several years of C/C++ experience as well as some shader language and assembler experience[11].
  • Educational Value:  Stream processing with GPUs is not necessarily the best architecture for learning parallel processing fundamentals because of their non-traditional operation with respects to functionality and programming.
  • Innovation:  Programmable graphics adapters are an emergent technology which have not quite matured enough for widespread acceptance—only the most recent generation of graphics adapters have support for code branching.  However, if the selected role requires massively data-level parallelism, GPGPU should not be overlooked as it can provide better performance over general-purpose CPUs, particularly for floating-point intensive algorithms.

 

 

Roles, Load Balancing
Load balancing is an approach to parallel processing that is concerned with distributing operations evenly across all nodes in a multiprocessor so as to keep the utilization high.  Typically, load balancing itself is considered to be a role for distributed multiprocessors only, but the concept applies to other roles.

 

Load Balancing, Distributed Load Balancing
Distributed load balancing is often employed on network service clusters to keep service requests from overwhelming a single node.  Load balancers (directors) handle access and redirection of requests, which allows the cluster to be viewed as a single server from the outside.  If any node fails, the service’s availability is maintained by the other nodes in the cluster[12].

  • Cost:  At least two nodes are needed for load balancing.  As the number of nodes increases, an additional director node is usually required to simplify access and reduce overhead.
  • Complexity:  The most complex aspect of load balancing is the method of balancing, either collectively or through a director.  Aside from that, one only needs to ensure that all nodes have the same programs.
  • Educational Value:  Although the load balancing algorithm does require some forms of communication and synchronization (redirecting requests and monitoring load), they differ from those traditionally taught.
  • Innovation:  Distributed load balancing is not new.  There are many hardware and software products that can automate part if not all of a load balancing cluster.

 

Load Balancing, Multithreading (and Multitasking)
Multithreading and multitasking are hardware and operating system features that can be exploited by programs to allow load balancing of threads and processes across multiple processors[13].  Multitasking and multithreading function in generally the same ways, except that multitasking refers to simultaneous processes instead of threads[14].  Multithreading can also be used for computation by creating several threads of the same instructions and providing them with different data to crunch.

  • Cost:  At least two nodes are needed, of course.
  • Complexity:  Multithreading difficulty depends on the needs of the algorithm.  Many will require communication and synchronization between threads (using shared memory or mutexes, for example), adding to the complexity; but some algorithms may simply need thread creation and termination.
  • Educational Value:  Multithreading is perhaps the best way to learn parallel computing; all the major concepts play a role—even race conditions.  Actually, parallel hardware is not even a requirement for multithreading, increasing the availability of machines to learn with.
  • Innovation:  Multithreading has become particularly popular recently as processors have moved away from raw serial execution speed and towards parallel processing.  New methods for multithreading that can utilize multicore processor performance will be well received.

 

Roles, Computation (Number Crunching)
Computation is the act of applying arithmetic operations on data; however, in parallel computing computations, there are always vast amounts of data.  Therefore, computational algorithms are almost always data-parallel.

  • Cost:  The number of nodes in a computational multiprocessor is completely variable as the architecture and algorithm should be highly scalable.
  • Complexity:  Although computational algorithms can be very intricate or massive, they are typically straightforward when it comes to making them parallel. 
  • Educational Value:  Parallel computation usually requires task assignment, synchronization and communication between nodes, and returning results—all perfectly suitable topics for learning parallel computing.
  • Innovation:  Most of the world’s groundbreaking high-performance computing is involved in number-crunching vast amounts of data for everything from weather modeling to gene surveys and mapping the stars.  However, these types of projects are obviously out of reach for most institutions.  Still, starting with a less groundbreaking algorithm is a good foundation for the other.

 

Computation, Pipelining (Assembly Line)
Pipelining is the only type of parallel computing that works with serial algorithms.  This is because the instructions of an algorithm can be spread out over multiple nodes like an assembly line; as one stage finishes, it passes its results to the next stage, and so on[15].  Pipelining is best used when an algorithm is very data dependent on previous instructions, and thus cannot be easily made data-parallel.  The stages of a pipeline are split so that every stage needs the same time-slice to complete its instructions; buffers should be used between stages to prevent any one stage from stalling the entire pipeline[16].

  • Cost:  Typically, pipelining requires as many nodes as possible without affecting speedup through communication overhead or low node utilization.  It should be expected that a minimum of several nodes are needed, but really, it does not matter.
  • Complexity:  Since serial algorithms are used in pipelining, its difficulty comes from splitting an algorithm into appropriately sized stages.  Unless a program is capable of automatically splitting and load-balancing stages (difficult), a pipeline’s program will need to be rewritten if the length of the pipeline is changed or balancing is needed for better efficiency.
  • Educational Value:  As pipelining is generally serial programming on parallel hardware, it is not useful for learning parallel computing concepts.
  • Innovation:  Although assembly lines are not quite as revolutionary as they were in the early 20th century, they are occasionally still the best way to handle fine-grained, data-dependent algorithms.

 

Computation, Redundancy
In computing, redundancy is often used to guard against errors (fault tolerance).  Processing can also use redundancy in this manner to ensure that output is consistent; however, this is really only useful for mission critical systems.

  • Cost:  Only two nodes (or multiprocessor systems) are really needed.  More than three nodes is overkill.
  • Complexity:  Redundancy simply requires the duplication of instructions on another node, possibly with a data stream duplication mechanism placed at the beginning and a consistency check at the end.
  • Educational Value:  Like pipelining, parallel computing redundancy is serial programming on parallel hardware, and, although some synchronization is required, redundancy is still poor for learning parallelism.  It is also exceptionally boring until (if ever) a node actually produces inconsistent results.
  • Innovation:  Redundancy for fault tolerance has been around for decades and hardly even changed, so there is really no way to innovate here.  However, there may still be uses for redundancy that have yet to be explored.

 

Computation, Data Clustering
Data clustering is a popular computational technique involved with the organization of data into clusters.  It can be used for any data that can be ordered into groups but is particularly useful in visual recognition (through image segmentation), information organization and retrieval (i.e. search engines), and data mining.  Data mining attempts to find meaningful patterns in data, as opposed to standard clustering, which is already aware of the patterns.  Because data clustering is concerned with executing the same instructions on hoards of data (the definition of data parallelism), it is relatively simple to make a data clustering algorithm execute in parallel.[17]

  • Cost:  The number of nodes is variable.
  • Complexity:  Not only are the algorithms for clustering complicated (especially mathematically), but reading in the data and finding patterns can be even more difficult (depending on the type of data).
  • Educational Value:  Clustering relies heavily on communication to update the current analysis results, which are needed by all nodes.  In fact, the amount of communication required may be so much as to cause significant overhead[18].  These kinds of concerns plus the complexity of clustering algorithms themselves may make data clustering not be the best subject for beginners.
  • Innovation:  As computers have become more popular, faster, and capable of handling large amount of data (either large data items or large numbers of smaller data items) made copious by the Internet, data clustering has become even more processing-intensive.  Therefore, research into making them faster or more efficient is at the forefront of many cutting-edge technologies in information indexing, biological processes, marketing, and government data mining[19].

 

 

Conclusion

 

To make the findings for the evaluation criteria more understandable at a glance, the data was quantified so that it could be viewed visually (Figure 2).  For each component, a rating was given of one to five for each criterion as to how suitable (cheapest, simplest, most educational, most innovative) that component would be in an academic setting.  Ratings were assigned relative to the suitability of other components’ criteria (i.e. A is better than B is better than C) to make the scale of measurement from worst to best (bottom to top, respectively).

 

Figure 2. Quantitative Assessment of the Evaluation Criteria

 

Obviously, multicore, beowulf clusters, and commodity clusters are the most suitable architectures; however, grid computing may be more appealing as it is becoming increasingly popular through recent innovation.  Multithreading is the best role for centralized multiprocessors, because it is easy to implement (even without parallel hardware), is widely accepted, exhibits many of the fundamental properties of parallel computing, and is versatile (used for either load balancing or computation).  For distributed multiprocessors, computation is the most popular role as it offers shear speed to simple number-crunching algorithms; for a change of pace, data clustering and load balancing make for more interesting algorithms and implementations.

 


Selected Projects in Parallel Computing

 

Simple Fork on Embarrassingly Parallel Problem

  • Architecture:  Centralized Multiprocessors
  • Role:  Multithreading / Multitasking

 

Probably the most basic project possible in parallel programming, forking a process or thread to create multiple execution threads can be used on a simple embarrassingly parallel problem to teach basic concepts of parallel computing in addition to thread safety.  Embarrassingly parallel problems are ones that are easy to separate into parallel parts and require no communication dependencies.  A couple examples would be simple arithmetic (counting, multiplying matrices) and brute force computations (cracking a short document encryption key).  Gradually, problems that require more communication and thus synchronization can be built on top of the previous project for continued educational value.

 

 

Benchmarking Performance for Known Parallel Setups and Applications

  • Architecture:  Any
  • Role:  Computation

 

A variety of existing parallel applications can be run on any available parallel hardware to measure performance through a number of different criteria: completion time, utilization, communication overhead, speedup, etc.  This project is useful in exploring the different factors that can affect parallel algorithm performance and thus what algorithms perform better on what architectures.  It can be used as an introduction for students to create their own parallel programs, applying the knowledge gained to create algorithms that are more efficient.

 

 

Distributed Load Balancing

  • Architecture:  Distributed Multiprocessors
  • Role:  Load Balancing

 

Distributed load balancing is one of the best projects for illustrating the benefits of parallelism without all the parallel programming.  This can be particularly appealing for a parallel hardware-only course as existing software can be setup on worker nodes.  Thus, the only programming would be for job assignment (and possible result collection) from the head node.  The simple program requirements would allow for changes in node topology, interconnections, the number of nodes, and individual node performance to be made, demonstrating how these affect the performance of the cluster.

 

 

Data Clustering for a Basic Search Engine

  • Architecture:  Any
  • Role:  Data Clustering

 

For more advanced students, data clustering is a good balance of more challenging parallel and serial programming.  Creating a parallel local search engine is an easily implemented and practical way to learn clustering while gaining performance benefits from any choice of parallel hardware (though obviously not a grid computer).  Like their web-based counterparts, a local search engine can traverse file systems looking for plaintext files (or formatted text files like HTML for more challenge) and assigning items to clustering nodes.  The clustering nodes need not be too complex; a simple word usage count that filters common words is sufficient.  Optimizations can always be made later, time permitting.

 

Indexing is the biggest concern for the initial creation of a search engine.  While a simple text log of the cluster analysis would work, retrieving results would get increasingly time-consuming as more files are indexed.  Saving the analysis as a hash table structure will improve search time, but may be too complex and out of the project’s scope.  Perhaps the best method of indexing is to use proven existing hashing algorithms and/or databases to store results.

 

 


References

 



[1] ”Parallel Computing.”  Wikipedia.  Wikimedia Foundation, Inc.  29 Oct. 2007  <http://en.wikipedia.org/wiki/Parallel_Computing>

 

[2] Quinn, Michael J.  Parallel Programming in C with MPI and OpenMP.  McGraw-Hill Education, 2003.

 

[3] Barney, Blaise.  “Introduction to Parallel Computing.“  Lawrence Livermore National Laboratory.  13 Sept. 2007.  21 Nov. 2007.  <http://www.llnl.gov/computing/tutorials/parallel_comp/>

 

[4] ”Beowulf (computing).”  Wikipedia.  Wikimedia Foundation, Inc.  14 Nov. 2007  <http://en.wikipedia.org/wiki/Beowulf_(computing)>

[5] Brown, Robert G.  “Engineering a Beowulf-style Compute Cluster.”  Duke University Physics Department.  24 May 2004.  25 Nov. 2007.  <http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book/beowulf_book/beowulf_book.html>

 

[6] ”Grid Computing.”  Wikipedia.  Wikimedia Foundation, Inc.  22 Nov. 2007  <http://en.wikipedia.org/wiki/Grid_computing>

 

[7] ”Massive Parallel Processing.”  Wikipedia.  Wikimedia Foundation, Inc.  29 Nov. 2007  <http://en.wikipedia.org/wiki/Massive_parallel_processing>

 

[8] ”Supercomputer.”  Wikipedia.  Wikimedia Foundation, Inc.  23 Nov. 2007  <http://en.wikipedia.org/wiki/Supercomputer>

 

[9] ”Vector Processor.”  Wikipedia.  Wikimedia Foundation, Inc.  23 Nov. 2007  <http://en.wikipedia.org/wiki/Vector_processor>

 

[10] Boggan, Sha’Kia and Daniel M. Pressel.  “GPUs: An Emerging Platform for General-Purpose Computation.”  Army Research Laboratory.  Aug. 2007.  2 Dec. 2007 <http://www.arl.army.mil/arlreports/2007/ARL-SR-154.pdf>

 

[11] ”GPGPU.”  Wikipedia.  Wikimedia Foundation, Inc.  21 Nov. 2007  <http://en.wikipedia.org/wiki/GPGPU>

 

[12] ”Load Balancing (computing).”  Wikipedia.  Wikimedia Foundation, Inc.  30 Nov. 2007  <http://en.wikipedia.org/wiki/Load_balancing_(computing)>

 

[13] ”Multithreading (computer hardware).”  Wikipedia.  Wikimedia Foundation, Inc.  30 Nov. 2007  <http://en.wikipedia.org/wiki/Multithreading_(computer_hardware)>

 

[14] ”Computer Multitasking.”  Wikipedia.  Wikimedia Foundation, Inc.  2 Dec. 2007  <http://en.wikipedia.org/wiki/Computer_multitasking>

 

[15] ”Pipeline (computer).”  Wikipedia.  Wikimedia Foundation, Inc.  1 Dec. 2007  <http://en.wikipedia.org/wiki/Pipeline_(computer)>

 

[16] ”Instruction Pipeline.”  Wikipedia.  Wikimedia Foundation, Inc.  1 Dec. 2007  <http://en.wikipedia.org/wiki/Instruction_pipeline>

 

[17] Jain, A.K., M.N. Murty, and P.J. Flynn.  “Data Clustering: A Review.”  ACM Computing Surveys. Vol. 31, No. 3. Sept 1999.  9 Dec 2007.  <www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf>

 

[18] Basta, Stefano And Domenico Talia.  "Analysis of Communication Overhead in Parallel Clustering of Large Data Sets with P-AUTOCLASS"  Parallel Computing: Advances and Current Issues - Proceedings ParCo 2001.  Imperial College Press. pp. 265-273, 2002.  10 Dec. 2007.  <http://eproceedings.worldscinet.com/9781860949630/preserved-docs/9781860949630_0033.pdf>

 

[19] ”Cluster Analysis.”  Wikipedia.  Wikimedia Foundation, Inc.  10 Dec. 2007  <http://en.wikipedia.org/wiki/Cluster_analysis>

 

Last Modified: 7-28-08