Parallel Computing Projects
for Academic Environments
Puevfgbcure Unecre, Senior in Computer Science,
December 13, 2007
Experience in parallel
computing is an increasingly necessary skill for today’s upcoming computer scientists
as processors are hitting a serial execution performance barrier and turning to
parallel execution for continued gains. However, parallel programming is not a
trivial undertaking; it has many unique challenges that are best learned
through hands-on experience on suitable parallel computing setups. This paper
aims to assist students and professors in deciding on the best ways to learn
parallel computing by providing them with an adequate assessment of the current
state of the subject.
Symmetric Multiprocessing (SMP)
Projects in Parallel
Massively Parallel Processor (“Supercomputer”)
(General-Purpose computing on Graphics Processing Units)
in Parallel Computing
Simple Fork on
Embarrassingly Parallel Problem
Performance for Known Parallel Setups and Applications
Distributed Load Balancing
Data Clustering for
a Basic Search Engine
Parallel Computing (or
Parallel Processing) is the utilization of multiple computer processors working
together concurrently for increased performance and thus faster computation. This
is possible because programming algorithms can often be divided into smaller,
independent parts that can be executed in parallel, side-by-side. However,
beyond this one common goal of increased performance, all other aspects of
parallel computing splinter off into many forms with many uses.
There are a few general forms
(architectures) of parallel computers, which can be separated into three major
types: centralized (multiprocessor computers), distributed, and specialized
multiprocessors. The former is named so because it contains two or more
processors that communicate over the memory or bus of a single computer. Multiprocessors
that communicate over shared memory are typically multicore processors as they
have two processing units in a single CPU package. This is contrasted by
symmetric multiprocessors, which communicate over a bus (an electrical channel
between computer hardware components).
Although hardware implementation for these two flavors of multiprocessors differs,
they both function similarly enough at a software level to be programmed alike.
Distributed multiprocessors (or
computer clusters) are a collection of individual computers that communicate
over a network. Machines in a cluster are often virtualized to form one
autonomous computer. However, they can also be made from individually
interfaced or administered computers. Another type of cluster is referred to
as grid computing because its topology is similar to a power grid. And like a
power grid, grid computing is not geographically bound due to the advent of the
Internet. Even though distributed multiprocessors can function in most of the
same ways as centralized multiprocessors, their major limiting factor is the
speed of their communication, which can be too slow or unreliable to perform
As with the two major
architectures of parallel computing, there are generally two major roles in
which multiprocessors can perform. These roles are based around the levels of
parallelism that different algorithms can be adapted (see Figure 1).
Instruction-level parallelism has different processing units (or nodes) executing
multiple independent instructions. Data-level parallelism has different nodes
executing the same instructions on different data. And task-level parallelism has
different nodes executing different instructions (and possibly different data)
In a strictly computational
role, many nodes work on different parts of a single problem. While this role
can work with any level of parallelism, it lends itself particularly well to
the instruction and data levels. The algorithm can split up the instructions
or the data onto different nodes, whichever performs better. Load balancing is
the opposing role to computation; it attempts to keep the workload the same across
all nodes. As such, load balancing performs best with data- or
task-parallelism because the algorithm relies on either different data and/or
instructions that can be separated into different execution threads.
1. Illustrating Levels of Parallelism with Cutting a Block of Cheese
Projects in Parallel
The bulk of the information
presented in this section is composed of brief descriptions and evaluations of
the different components of parallel computing. Four major categories of
components were realized, which could generally be arranged into any
configuration necessary. However, some components are understandably more
suitably matched in a configuration than others. (Only two categories are
present in this version because of time constraints.) The categories are as
- Architectures are the designs
and structures of computing hardware.
- Roles are the parallel
actions to be performed.
- Platforms are the system
software (operating systems) that allow interaction with the hardware.
- APIs (Application
Programming Interfaces) simplify interaction between programs and parallel
Criteria for Evaluation
Before delving into the
projects, it is necessary to define the criteria that will be used for
evaluating them. These factors would be important to an academic venture that
has limited resources and previous participant experience.
- Cost – This one is rather
straightforward; it includes not only the original cost of materials but
also the usage and maintenance costs (i.e. electricity).
- Complexity – This is the
time it would likely take to implement the project given previous
- Educational Value – While
there is no quantitative measure for educational value, projects can be
assessed by their ability to teach fundamental concepts of parallel
computing such as synchronization and communication.
- Innovation – Usually an
opposing factor to educational value, innovation is an objective measure
of the degree to which a project is on the cutting edge of computing.
Several additional factors
exist when considering different components of parallel computing (especially
architectures), which, while they are not necessarily the deciding factors, are
still important intrinsic factors.
- Speedup is simply the ratio
of performance increase of the parallel execution of some algorithm
compared to the serial execution of the same algorithm, i.e. parallel
execution time divided by serial execution time. The quality of the
algorithm programming is particularly important to speedup, but the
architecture’s overhead is also a factor.
- Scalability is the ability
of a parallel processor to continue increasing its performance by the
ratio of speedup as more processors are added.
- Overhead is the amount of
processing performance lost due to factors necessary to the operation of a
parallel processor, such as communication and synchronization.
are named so because they contain two or more processors that communicate over
the memory or bus of a single computer
- Cost: Prices for multicore
CPUs have already dropped substantially since their entry into the
consumer market only a couple years ago. The cheapest of the dual-core
processors can be purchased for only ~$60, whereas the most costly
processors are ~$700 or ~$1500 for desktop- and server-grade
respectively. The cheapest quad-core processors are selling for no less
than $200 currently. Considering a decently performing barebones system
of $300 without the processor (or peripherals), a single multicore machine
or a whole lab of multicore machines can be purchased for relatively
- Complexity: System
assembly is rather straightforward; machines can be built with a minimum
of hardware and operating system knowledge. A couple years of experience
in any high-level language is sufficient to be able to construct threaded
programs, which are the basis of parallel processing on multicore systems.
- Educational Value: These
multiprocessor machines are probably the best way to learn the basic
concepts of parallel programming. Communication and synchronization are
comparatively simple to program and results can be seen quickly.
- Innovation: While
multicore processors are only a couple years old, their introduction was
not as groundbreaking as one might think. Server-grade systems have been
using SMP architectures for many years now, and they are programmed in
much the same way as multicore processors.
Multiprocessors, Symmetric Multiprocessing (SMP)
- Cost: Since multicore
processors were released, SMP architectures have been almost exclusively
for server systems. As such, the processor, motherboard, and memory
prices are typically about twice more than for a multicore system. Their
only advantage over multicore would be explicit use in server operations
- Complexity: See Multicore
- Educational Value: See Multicore
- Innovation: As noted for
multicore architectures, this technology has been around awhile and is
rather well rooted in consumer-level systems and programming.
are a collection of individual computers that communicate over a network.
Distributed Multiprocessors, Beowulf Cluster
Beowulf clusters are made from a collection
of off-the-shelf computers that are stripped down to barebones components
(processor, memory, motherboard, network interface, power supply, and enclosure
only) and connected over a LAN. These
nodes are controlled by a head computer, which assigns tasks for them and
balances load. As with all distributed multiprocessors, Beowulf clusters need
specially designed programs to allow them to be of any use. However, their
scalability is rather good until a certain size (roughly 32 nodes) where they
can become unwieldy because of networking, space, heat, and power concerns.
- Cost: While Beowulf
clusters are most well known for their use of surplus or discarded machines,
they can still be built with a uniform set of new, high-performance barebones
machines. Thus, they can really be built for as little or as much as necessary.
If only for educational demonstration purposes, they can even be made for
free (barring electricity needs) from discarded institutional materials.
- Complexity: Network and
system setup for a Beowulf cluster can be a time-consuming logistical
challenge, especially with a large number of nodes. Programming them to
do something useful is probably not as tedious as the setup, but it will
still require several years of experience in a high-level language
(probably C/C++), networking, and Unix to get even a simple algorithm
- Educational Value: A
Beowulf cluster may be the best way to learn distributed parallel
programming because the nodes are designated for parallel program use
only, they can be cheaply built, and they are locally maintained.
- Innovation: Most, if not
all, decent-sized universities have built at least one Beowulf cluster
over the years. As such, this architecture is really nothing new.
Distributed Multiprocessors, Commodity Cluster
A commodity cluster (also known as a “Cluster of
Workstations” or “Pile of PC’s”) is similar to a Beowulf cluster in how parallel
processing is implemented, but its node topology is laid out in a way that
allows users to interact with the machines, such as a computer lab or office
computer network. I/O demands from users on the cluster’s machines thus lower
the overall performance of parallel computing due to overhead.
- Cost: Since this type of
cluster is generally made from existing machines, its use as a parallel
computer requires little to no additional cost.
- Complexity: Although the
only setup requirement for nodes is distributing the parallel programs, like
the Beowulf cluster, the commodity cluster needs to be programmed
explicitly for parallel processing to gain any benefit. As such, the
experience requirements are the same (substituting whatever operating
system is used).
- Educational Value: Depending
on the environment of the cluster’s nodes, a commodity clusters may be as
useful for teaching distributed multiprocessing as a Beowulf cluster. For
example, a computer lab would be excellent as it allows all the users to
collaborate on and interact with the multiprocessor.
- Innovation: See Beowulf
Multiprocessors, Grid Computing
Grid computing is essentially
clustering over a larger geographic area, such as a campus, region, or the entire
world. Since the communication speeds of more dispersed nodes are slower and
more variable, grid computing requires highly parallel algorithms that need
only limited or no communication between nodes.
- Cost: As grid computing
typically scavenges cycles from existing machines, there is little cost to
creating and scaling them up. Assuming coordinating tasks is not also
distributed across the grid, the only costs will come from maintaining a
head machine(s) and bandwidth.
- Complexity: Setting up a
server to handle coordination is the only hardware aspect required. But
grid computing has other unique challenges, such as finding an
appropriately parallel problem and programming it to be almost completely
parallel. Also, distribution of the client parallel programs will require
additional consideration and probably the development of a website.
Expect experience needed in a high-level programming language (possibly
portable), networking and network programming, and web programming.
- Educational Value: Although
some aspects of parallel computing are needed more than others (less
communication and more synchronization), the basics are still present.
However, debugging on a large-scale distributed multiprocessor is
impractical, and thus a commodity cluster would be needed for testing
- Innovation: As grid
computing has not existed for more than a decade (it is particularly
dependent on the widespread availability of consumer-level computers),
there is still room for improvement. Any project with enough innovation
or a prominent purpose has the potential to attract thousands of users
willing to give up their spare processing power.
Grid Computing, Virtual Supercomputer
These are grid multiprocessors with comparable
processing power to a traditional supercomputer. However, unlike a traditional
supercomputer, virtual supercomputers cannot count on a consistent amount of
this processing power as users and cycles come and go from the network.
Virtual supercomputers are generally thought of as having a global reach.
These processors were developed to be highly
efficient at handling a certain type of instruction on a certain type of data
in parallel. However, these processors can occasionally be adapted to other
roles which they were not designed for and yet maintain a performance advantage
over general-purpose parallel processors.
Specialized Multiprocessors, Massively Parallel
MPPs are designed explicitly to be the most
scalable parallel processors, so that they can easily contain the several
thousand processors needed for supercomputer status (though the actual number
of processors is arbitrary). Although
supercomputers of the past often used specialized processors (i.e. vector
processors), the trend in today’s supercomputers is to utilize consumer-level
general-purpose processors (i.e. IBM PowerPC, Intel Xeon) with little or no modification.
They are typically setup as a mix of distributed and centralized multiprocessors
where, for example, each cabinet might hold 128 processors in a centralized MP
and all the cabinets are connected to form a distributed MP.
- Cost: Depending on the
number of processors intended for the computer, they can cost $100
thousand to $100 million. Even with a huge grant, these massive machines
are not practical for a small university environment.
- Complexity: Excluding the
logistics of setting up and maintaining a supercomputer, programming them remains
a huge undertaking. Algorithms would need to be extremely parallel; communication
would be mind-boggling; and special tools would be needed just to measure
- Educational Value: Their
massive size is much too unwieldy to learn the basics of multiprocessing.
Additionally, having students interact with such an expensive and
overwhelming system is a substantial liability.
- Innovation: While
supercomputers have been around in many forms over the last sixty years,
the newest ones are still at the cutting edge of performance, typically handling
computations for some of science’s greatest pursuits such as weather,
biology, and electronics. Possessing a supercomputer will cause any
institution to instantly become cream of the crop in computer science.
Massively Parallel Processor, Vector Processor
Specialized for doing vector arithmetic on a
huge scale, these supercomputers are only good at “crunching numbers”.
Massively Parallel Processor, Processor Array
Similar to a stream processor, each specialized
or general-purpose processor in the array executes instructions on different
streams of data (or a single stream demuxed to many). They are particularly
well suited to large-scale video or audio processing.
Specialized Multiprocessors, Stream Processing
Stream processors use strictly data-level
parallelism to execute the same operation on numerous data streams.
Stream Processing, GPGPU
General-Purpose computing on Graphics Processing
Units is a new trend that uses the programmable stream processors of the latest
GPUs to execute code on non-graphics data. GPUs use a stream processing
architecture because they frequently need to perform the same transformation on
many pixels or vertices, of which there can be millions in a single frame.
- Cost: Graphics adapters
that support these programmable stream processors vary widely in cost, but
the newest generations have been roughly the same cost as the newest CPU.
Considering that a CPU is also a requirement to build a system with a GPU,
that puts GPGPU at a cost disadvantage compared to just a general-purpose
- Complexity: GPGPU programming
uses C-like languages that function at a low level (as in the shader languages
they are designed for), which may be particularly difficult to code
without several years of C/C++ experience as well as some shader language and
- Educational Value: Stream
processing with GPUs is not necessarily the best architecture for learning
parallel processing fundamentals because of their non-traditional
operation with respects to functionality and programming.
- Innovation: Programmable
graphics adapters are an emergent technology which have not quite matured
enough for widespread acceptance—only the most recent generation of
graphics adapters have support for code branching. However, if the
selected role requires massively data-level parallelism, GPGPU should not
be overlooked as it can provide better performance over general-purpose
CPUs, particularly for floating-point intensive algorithms.
Load balancing is an approach to parallel
processing that is concerned with distributing operations evenly across all
nodes in a multiprocessor so as to keep the utilization high. Typically, load
balancing itself is considered to be a role for distributed multiprocessors only,
but the concept applies to other roles.
Load Balancing, Distributed Load Balancing
Distributed load balancing is often employed on
network service clusters to keep service requests from overwhelming a single
node. Load balancers (directors) handle access and redirection of requests,
which allows the cluster to be viewed as a single server from the outside. If
any node fails, the service’s availability is maintained by the other nodes in
- Cost: At least two nodes
are needed for load balancing. As the number of nodes increases, an
additional director node is usually required to simplify access and reduce
- Complexity: The most
complex aspect of load balancing is the method of balancing, either
collectively or through a director. Aside from that, one only needs to
ensure that all nodes have the same programs.
- Educational Value: Although
the load balancing algorithm does require some forms of communication and
synchronization (redirecting requests and monitoring load), they differ
from those traditionally taught.
- Innovation: Distributed
load balancing is not new. There are many hardware and software products that
can automate part if not all of a load balancing cluster.
Load Balancing, Multithreading (and Multitasking)
Multithreading and multitasking are hardware and
operating system features that can be exploited by programs to allow load
balancing of threads and processes across multiple processors.
Multitasking and multithreading function in generally the same ways, except
that multitasking refers to simultaneous processes instead of threads.
Multithreading can also be used for computation by creating several threads of
the same instructions and providing them with different data to crunch.
- Cost: At least two nodes
are needed, of course.
- Complexity: Multithreading
difficulty depends on the needs of the algorithm. Many will require
communication and synchronization between threads (using shared memory or mutexes,
for example), adding to the complexity; but some algorithms may simply
need thread creation and termination.
- Educational Value: Multithreading
is perhaps the best way to learn parallel computing; all the major
concepts play a role—even race conditions. Actually, parallel hardware is
not even a requirement for multithreading, increasing the availability of
machines to learn with.
- Innovation: Multithreading
has become particularly popular recently as processors have moved away
from raw serial execution speed and towards parallel processing. New
methods for multithreading that can utilize multicore processor
performance will be well received.
Computation is the act of applying arithmetic
operations on data; however, in parallel computing computations, there are
always vast amounts of data. Therefore, computational algorithms are almost
- Cost: The number of nodes
in a computational multiprocessor is completely variable as the
architecture and algorithm should be highly scalable.
- Complexity: Although computational
algorithms can be very intricate or massive, they are typically
straightforward when it comes to making them parallel.
- Educational Value: Parallel
computation usually requires task assignment, synchronization and
communication between nodes, and returning results—all perfectly suitable
topics for learning parallel computing.
- Innovation: Most of the
world’s groundbreaking high-performance computing is involved in
number-crunching vast amounts of data for everything from weather modeling
to gene surveys and mapping the stars. However, these types of projects
are obviously out of reach for most institutions. Still, starting with a
less groundbreaking algorithm is a good foundation for the other.
Computation, Pipelining (Assembly Line)
Pipelining is the only type of parallel computing that works with serial
algorithms. This is because the instructions of an algorithm can be spread out
over multiple nodes like an assembly line; as one stage finishes, it passes its
results to the next stage, and so on. Pipelining
is best used when an algorithm is very data dependent on previous instructions,
and thus cannot be easily made data-parallel. The stages of a pipeline are
split so that every stage needs the same time-slice to complete its
instructions; buffers should be used between stages to prevent any one stage
from stalling the entire pipeline.
- Cost: Typically,
pipelining requires as many nodes as possible without affecting speedup
through communication overhead or low node utilization. It should be
expected that a minimum of several nodes are needed, but really, it does
- Complexity: Since serial
algorithms are used in pipelining, its difficulty comes from splitting an
algorithm into appropriately sized stages. Unless a program is capable of
automatically splitting and load-balancing stages (difficult), a
pipeline’s program will need to be rewritten if the length of the pipeline
is changed or balancing is needed for better efficiency.
- Educational Value: As
pipelining is generally serial programming on parallel hardware, it is not
useful for learning parallel computing concepts.
- Innovation: Although
assembly lines are not quite as revolutionary as they were in the early 20th
century, they are occasionally still the best way to handle fine-grained,
In computing, redundancy is often used to guard
against errors (fault tolerance). Processing can also use redundancy in this
manner to ensure that output is consistent; however, this is really only useful
for mission critical systems.
- Cost: Only two nodes (or
multiprocessor systems) are really needed. More than three nodes is
- Complexity: Redundancy
simply requires the duplication of instructions on another node, possibly
with a data stream duplication mechanism placed at the beginning and a
consistency check at the end.
- Educational Value: Like
pipelining, parallel computing redundancy is serial programming on
parallel hardware, and, although some synchronization is required, redundancy
is still poor for learning parallelism. It is also exceptionally boring
until (if ever) a node actually produces inconsistent results.
- Innovation: Redundancy for
fault tolerance has been around for decades and hardly even changed, so
there is really no way to innovate here. However, there may still be uses
for redundancy that have yet to be explored.
Computation, Data Clustering
Data clustering is a popular computational
technique involved with the organization of data into clusters. It can be used
for any data that can be ordered into groups but is particularly useful in visual
recognition (through image segmentation), information organization and
retrieval (i.e. search engines), and data mining. Data mining attempts to find
meaningful patterns in data, as opposed to standard clustering, which is
already aware of the patterns. Because data clustering is concerned with
executing the same instructions on hoards of data (the definition of data
parallelism), it is relatively simple to make a data clustering algorithm execute
- Cost: The number of nodes
- Complexity: Not only are
the algorithms for clustering complicated (especially mathematically), but
reading in the data and finding patterns can be even more difficult (depending
on the type of data).
- Educational Value: Clustering
relies heavily on communication to update the current analysis results,
which are needed by all nodes. In fact, the amount of communication
required may be so much as to cause significant overhead.
These kinds of concerns plus the complexity of clustering algorithms
themselves may make data clustering not be the best subject for beginners.
- Innovation: As computers
have become more popular, faster, and capable of handling large amount of
data (either large data items or large numbers of smaller data items) made
copious by the Internet, data clustering has become even more
processing-intensive. Therefore, research into making them faster or more
efficient is at the forefront of many cutting-edge technologies in
information indexing, biological processes, marketing, and government data
make the findings for the evaluation criteria more understandable at a glance,
the data was quantified so that it could be viewed visually (Figure 2). For
each component, a rating was given of one to five for each criterion as to how
suitable (cheapest, simplest, most educational, most innovative) that component
would be in an academic setting. Ratings were assigned relative to the
suitability of other components’ criteria (i.e. A is better than B is better
than C) to make the scale of measurement from worst to best (bottom to top,
Figure 2. Quantitative Assessment of the Evaluation Criteria
multicore, beowulf clusters, and commodity clusters are the most suitable
architectures; however, grid computing may be more appealing as it is becoming increasingly
popular through recent innovation. Multithreading is the best role for
centralized multiprocessors, because it is easy to implement (even without
parallel hardware), is widely accepted, exhibits many of the fundamental
properties of parallel computing, and is versatile (used for either load
balancing or computation). For distributed multiprocessors, computation is the
most popular role as it offers shear speed to simple number-crunching
algorithms; for a change of pace, data clustering and load balancing make for
more interesting algorithms and implementations.
Selected Projects in
Simple Fork on Embarrassingly
- Architecture: Centralized
- Role: Multithreading /
Probably the most basic
project possible in parallel programming, forking a process or thread to create
multiple execution threads can be used on a simple embarrassingly parallel
problem to teach basic concepts of parallel computing in addition to thread
safety. Embarrassingly parallel problems are ones that are easy to separate
into parallel parts and require no communication dependencies. A couple
examples would be simple arithmetic (counting, multiplying matrices) and brute
force computations (cracking a short document encryption key). Gradually,
problems that require more communication and thus synchronization can be built
on top of the previous project for continued educational value.
for Known Parallel Setups and Applications
- Architecture: Any
- Role: Computation
A variety of existing
parallel applications can be run on any available parallel hardware to measure
performance through a number of different criteria: completion time,
utilization, communication overhead, speedup, etc. This project is useful in
exploring the different factors that can affect parallel algorithm performance
and thus what algorithms perform better on what architectures. It can be used
as an introduction for students to create their own parallel programs, applying
the knowledge gained to create algorithms that are more efficient.
Distributed Load Balancing
- Architecture: Distributed
- Role: Load Balancing
Distributed load balancing is
one of the best projects for illustrating the benefits of parallelism without
all the parallel programming. This can be particularly appealing for a
parallel hardware-only course as existing software can be setup on worker
nodes. Thus, the only programming would be for job assignment (and possible
result collection) from the head node. The simple program requirements would
allow for changes in node topology, interconnections, the number of nodes, and
individual node performance to be made, demonstrating how these affect the
performance of the cluster.
Data Clustering for a Basic
- Architecture: Any
- Role: Data Clustering
For more advanced students,
data clustering is a good balance of more challenging parallel and serial
programming. Creating a parallel local search engine is an easily implemented
and practical way to learn clustering while gaining performance benefits from
any choice of parallel hardware (though obviously not a grid computer). Like
their web-based counterparts, a local search engine can traverse file systems looking
for plaintext files (or formatted text files like HTML for more challenge) and
assigning items to clustering nodes. The clustering nodes need not be too
complex; a simple word usage count that filters common words is sufficient. Optimizations
can always be made later, time permitting.
Indexing is the biggest
concern for the initial creation of a search engine. While a simple text log
of the cluster analysis would work, retrieving results would get increasingly
time-consuming as more files are indexed. Saving the analysis as a hash table
structure will improve search time, but may be too complex and out of the
project’s scope. Perhaps the best method of indexing is to use proven existing
hashing algorithms and/or databases to store results.