6/6/10

Fundamental Concepts of Parallel Programming

By Richard Gerber and Andrew Binstock , June 01, 2010
Parallel programming requires designers to rethink the idea of process flow
Richard Gerber has worked on numerous multimedia projects, 3D libraries, and computer games for Intel. Andrew Binstock is the principal analyst at Pacific Data Works and author of "Practical Algorithms for Programmers". They are the authors of Programming with Hyper-Threading Technology.
--------------------------------------------------------------------------------
Parallel programming makes use of threads to enable two or more operations to proceed in parallel. The entire concept of parallel programming centers on the design, development, and deployment of threads within an application and the coordination between threads and their respective operations. This article examines how to break up traditional programming tasks into chunks that are suitable for threading. It then demonstrates how to create threads, how these threads are typically run on multiprocessing systems, and how threads are run using HT Technology.

Designing for Threads
Developers unacquainted with parallel programming are generally comfortable with traditional programming models such as single-threaded declarative programs and object-oriented programming (OOP). In both cases, a program begins at a defined point -- such as main() -- and works through a series of tasks in succession. If the program relies on user interaction, the main processing instrument is a loop in which user events are handled. From each allowed event -- a button click, for example -- an established sequence of actions is performed that ultimately ends with a wait for the next user action.
When designing such programs, developers enjoy a relatively simple programming world because only one thing is happening at any given moment. If program tasks must be scheduled in a specific way, it's because the developer chooses a certain order to the activities, which themselves are designed to flow naturally into one another. At any point in the process, one step generally flows into the next, leading up to a predictable conclusion, based on predetermined parameters -- the job completed -- or user actions.
Moving from this model to parallel programming requires designers to rethink the idea of process flow. Now, they must try to identify which activities can be executed in parallel. To do so, they must begin to see their programs as a series of discrete tasks with specific dependencies between them. The process of breaking programs down into these individual tasks is known as decomposition. Decomposition comes in three flavors: functional, data, and a variant of functional decomposition, called producer/consumer. As you shall see shortly, these different forms of decomposition mirror different types of programming activities.

Functional Decomposition
Decomposing a program by the functions it performs is called "functional decomposition", also called "task-level parallelism". It is one of the most common ways to achieve parallel execution. Using this approach, individual tasks are catalogued. If two of them can run concurrently, they are scheduled to do so by the developer. Running tasks in parallel this way usually requires slight modifications to the individual functions to avoid conflicts and reflect that these tasks are no longer sequential.
If discussing gardening, functional decomposition would suggest that gardeners be assigned tasks based on the nature of the activity: If two gardeners arrived at a client's home, one might mow the lawn while the other weeded. Mowing and weeding are separate functions broken out as such. To accomplish them, the gardeners would make sure to have some coordination between them, so that the weeder is not sitting in the middle of a lawn that needs to be mowed.
In programming terms, a good example of functional decomposition is word processing software, such as Microsoft Word. When a very long document is opened, the user can begin entering text right away. While the user is doing this, document pagination occurs in the background, as can readily be seen by the quickly increasing page count that appears in the status bar. Text entry and pagination are two separate tasks that, in the case of Word, Microsoft has broken out by function and run in parallel. Had it not done this, the user would be obliged to wait for the entire document to be paginated before being able to enter any text. Many of you will recall that this wait was common on early PC word processors.

Producer/Consumer
Producer/consumer (P/C) is so common a form of functional decomposition that it is best examined by itself. Here, the output of one task, the producer, becomes the input to another, the consumer. The important aspects of P/C are that both tasks are performed by different threads and the second one -- the consumer -- cannot start until the producer finishes some portion of its work.
Using the gardening example, one gardener prepares the tools -- puts gas in the mower, cleans the shears, and other similar tasks -- for both gardeners to use. No gardening can occur until this step is mostly finished, at which point the true gardening work can begin. The delay caused by the first task creates a pause for the second task, after which both tasks can continue in parallel. In computer terms, this particular model occurs frequently.
In common programming tasks, P/C occurs in several typical scenarios. For example, programs that must rely on the reading of a file are inherently in a P/C scenario: the results of the file I/O become the input to the next step, which might well be threaded. However, that step cannot begin until the reading is either complete or has progressed sufficiently for other processing to kick off. Another common programming example of P/C is parsing: an input file must be parsed, or analyzed semantically, before the back-end activities, such as code generation in a compiler, can begin.
The P/C model has several interesting dimensions absent in the other decompositions:
•The dependence created between consumer and producer can cause formidable delays if this model is not implemented correctly. A performance-sensitive design seeks to understand the exact nature of the dependence and diminish the delay it imposes. It also aims to avoid situations in which consumer threads are idling while waiting for producer threads.
•In the ideal scenario, the hand-off between producer and consumer is completely clean, as in the example of the file parser. The output is context-independent and the consumer has no need to know anything about the producer. Many times, however, the producer and consumer components do not enjoy such a clean division of labor, and scheduling their interaction requires careful planning.
•If the consumer is finishing up while the producer is completely done, one thread remains idle while other threads are busy working away. This issue violates an important objective of parallel processing, which is to balance loads so that all available threads are kept busy. Because of the logical relationship between these threads, it can be very difficult to keep threads equally occupied in a P/C model.

Data Decomposition
Data decomposition, also known as "data-level parallelism", breaks down tasks by the data they work on, rather than by the nature of the task. Programs that are broken down via data decomposition generally have many threads performing the same work, just on different data items. For example, consider a program that is recalculating the values in a large spreadsheet. Rather than have one thread perform all the calculations, data decomposition would suggest having two threads, each performing half the calculations, or n threads performing 1/nth the work.
If the gardeners used the principle of data decomposition to divide their work, they would both mow half the property and then both weed half the flower beds. As in computing, determining which form of decomposition is more effective depends a lot on the constraints of the system. For example, if the area to mow is so small that it could not warrant two mowers, it would be better done by just one gardener -- functional decomposition -- and data decomposition could be applied to other task sequences, such as when the mowing is done and both gardeners begin weeding in parallel.

Implications of Different Decompositions
Different decompositions provide different benefits. If the goal, for example, is ease of programming and tasks can be neatly partitioned functionally, then functional decomposition is more often than not the winner. Data decomposition adds some additional code-level complexity to tasks, so it is reserved for cases where the data is easily divided and performance is important.
However, the most common reason for threading an application is performance. And here the choice of decompositions is more difficult. In many instances, the choice is dictated by the problem domain: some tasks are much better suited to one type of decomposition. But some tasks have no clear bias. Consider for example, processing images in a video stream. In formats with no dependency between frames, programmers have a choice of decompositions. Should they choose functional, in which one thread does decoding, another color balancing, and so on, or data decomposition, in which each thread does all the work on one frame and then moves on to the next? To return to the analogy of the gardeners, the query would take this form: If two gardeners need to mow two lawns and weed two flower beds, how should they proceed? Should one gardener only mow -- that is, they choose functional -- or should both gardeners mow together then weed together?
In some cases -- for instance when a resource constraint exists, such as only one mower -- the answer emerges quickly. In others where each gardener has a mower, the answer comes only through careful analysis of the constituent activities. In the case of the gardeners, functional decomposition looks better, because the start-up time for mowing is saved if only one mower is in use. Ultimately, the right answer for parallel programming is determined by careful planning and testing. The empirical approach plays a more significant role in design choices in parallel programming than it does in standard single-threaded programming.
As mentioned previously, P/C situations are often unavoidable, but nearly always detrimental to performance. The P/C relation runs counter to parallel programming because it inherently makes two activities sequential. This sequential aspect occurs twice in a P/C situation: once at the start, when the consumer thread is idling as it waits for the producer to produce some data, and again at the end, when the producer thread has completed its work and is idling as it waits for the consumer thread to complete. Hence, developers who recognize a P/C relation are wise to try to find another solution that can be parallelized. Unfortunately, in many cases, this cannot be done. Where P/C cannot be avoided, developers should work to minimize the delay caused by forcing the consumer to wait for the producer. One approach is to shorten the activity required before the consumer can start up. For example, if the producer must read a file into a buffer, can the consumer be launched after the first read operation rather than waiting for the entire file to be read? By this means, the latency caused by the producer can be diminished. Table 1 summarizes these forms of decomposition.

Challenges
The use of threads enables performance to be significantly improved by allowing two or more activities to occur simultaneously. However, developers cannot fail to recognize that threads add a measure of complexity that requires thoughtful consideration to navigate correctly. This complexity arises from the inherent fact that more than one activity is occurring in the program. Managing simultaneous activities and their possible interaction leads to confronting four types of problems:
•Synchronization is the process by which two or more threads coordinate their activities. For example, one thread waits for another to finish a task before continuing.
•Resource limitations refer to the inability of threads to work concurrently due to the constraints of a needed resource. For example, a hard drive can only read from one physical location at a time, which deprives threads of the ability to use the drive in parallel.
•Load balancing refers to the distribution of work across multiple threads so that they all perform roughly the same amount of work.
•Scalability is the challenge of making efficient use of a larger number of threads when software is run on more-capable systems. For example, if a program is written to make good use of four processors, will it scale properly when run on a system with eight processors?

Each of these issues needs to be handled carefully to maximize application performance.
--------------------------------------------------------------------------------
This article is based on material found in book Programming with Hyper-Threading Technology by Andrew Binstock and Richard Gerber. Sphere: Related Content

No hay comentarios: