Visual Studio C++ Parallel Programming

From Minor Miracle Software
Jump to: navigation, search


  • Source Material
Visual Studio C++ Parallel Programming, March 2012[1] Source code is here[2]

Referred to as the Campbell Book.

  • Libraries

Parallel Patterns Library (PPL)[3][[4]]
Asynchronous Agents Library (AAL)[5]
These libraries do not work under LINUX.
Intel's C Compiler is parallel and runs under LINUX.

  • Exercises

The exercises offered are not rigorous enough. They should be hard programming problems. I'll have to look elsewhere for parallel programming exercises.

Parallel Programming Patterns

  1. Parallel Loop Pattern - apply and independent operation to multiple inputs simultaneously.
  2. Parallel Tasks Pattern - parallel control flow as a fork-and-join.
  3. Parallel Aggregation Pattern - merges partial results.
  4. Parallel Futures Pattern - emphasizes data flow dependencies between tasks.
  5. Parallel Dynamic Task Pattern - divide and conquer approach by spawning new tasks on demand.
  6. Parallel Pipeline Pattern - queue inputs and outputs.
  • Multitreaded[6] vs. Parallel[7]

Multithreaded execution executes separate programs on the same processor with each interrupting the other for resources. Parallel execution is one program executing across many processors while not interrupting each other.

  • Synchronization in any form is serialization.
  • A well-written parallel program runs at approximately the same speed as a sequential program when there is only one core.

A Word about Terminology

  • Concurrency and Parallelism

Concurrency is related to multitasking and asynchronous input-output (I/O). It usually refers to the multiple thread execution that may each get a time slice to execute before being preempted by another thread, which also gets a time slice. Concurrency is necessary for a program to react to external stimuli such as user input, devices, sensors, and other interrupts. Operating systems and games, by their very nature, are concurrent, even on one core. Concurrency is about correctness and speed.

With parallelism, concurrent threads execute at the same time on multiple cores. Parallel programming focuses on improving application performance across multiple processor cores and are not constantly interrupted when multiple cores are available. Parallelism is about throughput.

  • Potential Parallelism

Amdahl's Law[8][9] limits parallelism. There is a ceiling and floor for everything.
The real-world performance increase is less than Amdahl's Law predicts due to increased overhead from coordinating between parallel threads and cores. $$ \text{Speedup According to Amdahl's Law}=\frac{1}{(1-P) + (P/N)}$$ Where P is the potential parallelism in the program, N is the processor count, and 1-P is the sequential part.

How much parallelism a program has depends on how many independent tasks it has. This means no shared variables or synchronization. Any algorithm using shared variables, synchronization or coordination reduces potential parallelism.

What hardware will the program run on? How many CPU cores will it have? Or have available? Since software outlives hardware the best approach is using libraries, like PPL, to manage task scheduling. Using a mutex queued with threads and hand-crafted locks is a sure way to turn a perfectly good parallel program into a sequential dinosaur.

  • Coordinating Tasks

Task order and potential parallelism in a program are constrained by the underlying algorithms used. Constraints can arise from control flow (algorithm steps) or data flow (input and output availability). Understanding the constraints will reveal which parallel patterns are useful. Remember, a program might use more than one pattern.

  • Scalable Data Sharing

When tasks running in parallel need to share data each, in a different context, will lock in a race condition trying to update or read the same memory location.

In sequential programs synchronizing concurrent threads by using locks, atomic compare-and-swap operations, and semaphores. Applying these solutions to parallel programs just shifts the race condition to locks from a variable. Avoid programming with locks, it usually fails anyway. Technique for avoiding locks include using read-only data, message passing between tasks, queues.

Parallel Loops Pattern

This is a task and data independent pattern. Apply it where independent operations are applied to independent data for some iteration count. The tools used are the parallel_for and parallel_for_each methods.

  • The AAL and PPL rely heavily on Lambda Expressions[10] so be familiar with them

The parallelism degree is not expressed in the code, but during the run-time execution by the library.

How can you tell if the steps in a loop are independent? A declared variable shared by two or more steps would make them dependent and introduce errors because that variable would have different values at different times depending on which parallel task was using it. Objects in the loop might share an underlying variable that will introduce subtle bugs for the same reason. Using data accessors on an object:

for( int i = 0; i < n; i++ )

There is a possibility that GetParent accesses a single, shared object.

  • Loop-carried dependence

In this example the three data references will each execute in a different thread, and each with a different value for i.

for(int i = 1; i < N; i++)
  data[i] = data[i] + data[i - 1];

Parallel methods do not guarantee any particular execution order. Unlike a sequential method, some higher-valued indices might be processed before some lower-valued indices.

  • Example 1. Parallel For Loop

Notice the results is passed by reference and the loop's execution order does not matter.

vector<double> results = ...
int workload = ...
size_t n = results.size();

parallel_for(0u, n,
  [&results, workLoad](size_t i)
    results[i] = DoWork(i, workLoad);

  • Exceptions

An exception in any iteration is rethrown to the calling thread. Or, it kicks up to the calling parallel method.

  • Advanced Parallel Scientific Patterns

Your best options are to look elsewhere in your program for opportunities for parallelism or to analyze your algorithm and see if it matches some of the advanced parallel patterns that occur in scientific computing. Parallel scan and parallel dynamic programming are examples of these patterns.

  1. include <concrt.h>
  2. include <concrtrm.h>
  • Partitioning range in parallel_for loops

Each iteration of the outer (parallel) loop handles a range of indices instead of individual indices. By grouping iterations into ranges, you can avoid some of the overhead of a normal parallel loop.

    • Does not work on odd-numbered iteration loop
    • Does not work well when the loop body is large.
  • Controlling The Degree of Parallelism

The term degree of parallelism refers to how many cores are used to process iterations simultaneously. Usually the degree is managed by libraries.

Anti-Patterns: The things that can go wrong.

  • Hidden loop body dependencies -
  • Sharing a class instance that is not thread safe.
  • Sharing common variables that are not synchronized and protected.
  • Small loop bodies with few iteration. - The overhead for a parallel loop outweighs the benefit from using a parallel_for loop.
  • Duplicates in the Input Enumeration - Using the same variable, class, or pointer twice in a loop will create a race condition.
  • Scheduling Interactions with Cooperative Blocking - A cooperative blocking operation in every iteration of a parallel loop, the task scheduler may create more threads than intend. Cooperative blocking operations should be performed infrequently within the parallel loop.
  • Related Patterns - The Parallel Aggregation Pattern extends the Loop Pattern.
  • Breaking from a loop early.

Using the traditional break; command does break the loop but does not clean up the executing tasks. Use task_group::cancel to properly exit a parallel loop early and clean up the executing tasks.

  • Parallelism Degree

The term degree of parallelism refers to how many cores are used to process iterations simultaneously. While not recommended these can be set by adjusting the MinConcurrency and MaxConcurrency policies in the SchedulerPolicy class.

From Chapter 2 Exercises, page 47, I completed Exercise 2 by writing a parallel program to count word frequency in text files.

Parallel Tasks Pattern

Parallel Loops is applying a single operation to many data elements. Data Parallelism.

Distinct asynchronous operations that can run simultaneously.

  • Parallel tasks are asynchronous operations that can run simultaneously each with asynchronous data. Task Parallelism.
  • The program's control flow can fork with tasks that can execute in parallel.
    • This is also called the Fork/Join or Master/Worker pattern.
  • New tasks do not necessarily begin execution immediately, unlike threads. They sit in a queue and run when the scheduler executes them. Execution is based on available processor resources.
  • Exception Handling. In PPL an unhandled exception is deferred for when the task_group::wait method is called. Then the exception is rethrown in the wait method's scope. If more than one exception is thrown then one is chosen to be rethrown when all the tasks complete and the others are ignored. An unhandled exception with no corresponding wait command cancels the whole task group. Use the is_canceling method to check if a task is canceling or running.

The Basics - New Section. The Mechanism. Use the parallel_invoke method. The example below invokes the sequential methods DoLeft(); and DoRight(); in parallel.

  []() { DoLeft(); },
  []() { DoRight(); }

  • The parallel_invoke function in the Concurrency namespace creates a group of parallel tasks and waits for them all to complete.
  • Coordinating Tasks with Cooperative Blocking

A task can suspend its execution and relinquish control to the task scheduler until a specific condition is met.

  • Canceling a Task Group

Use the cancel method to cancel all the tasks in a task group from any thread. If two tasks in different groups are cooperating and one task is canceled, the other task will throw an exception. A deadlock will occur if two tasks from different groups are cooperating and one is waiting for the other after a cancellation. Cancellations can propagate between task groups under certain circumstances. A cancellation request automatically propagates to another task group if a call to that group’s wait method is blocking a task from a task group that is being cancelled.

  • Speculative Execution

Predictive Speculative Execution. Guess what a current computation will produce and start a new computation based on that value. If the guess is wrong, pay the time penalty and start over. Parallel Asynchronous Speculative Execution. Start two or more computations but only one is needed. For example, start three searches for an item, when it is found, cancel the other searches.

In the Parallel Asynchronous Speculative Execution example each search invokes the task_group::cancel method once a solution is found.

  task_group tg;[&tg](){ SearchLeft(tg); });[&tg](){ SearchRight(tg); });
  tg.run_and_wait([&tg](){ SearchCenter(tg); });

The SearchLeft method shown below invokes the cancel command upon completion.

 void SearchLeft(task_group& tg)
    bool result =
                DoCpuIntensiveOperation(g_TaskMilliseconds/5, &tg);
    wcout << L" Left search finished, completed = " << result
    << endl;

  • Anti-Patterns - When things go bad
    • Variables Captured by Closures

In C++, a closure[11] can be created using a lambda expression that represents an unnamed (anonymous) function. Closures can refer to variables defined outside their lexical scope, such as local variables that were declared in a scope that contains the closure.

In an incorrectly written closure captured variables might not behave as expected, especially in parallel programs. Problems occur when referencing a variable without considering its scope. Here’s an example.

 task_group tg;
 for (int i = 0; i < 4; i++)
    // WARNING: BUGGY CODE, i has unexpected value[&i]() { wcout << i << endl; } );

This code is intended to print 1,2,3, and 4 in an arbitrary order but that won't happen. The i is passed by reference so the actual value for i and shared by all the closures created by each step in the for loop. By the time a task starts i 's value will be different from when the task was created.

The solution is passing the shared variable by value in the appropriate scope.

 task_group tg;
 for (int i = 0; i < 4; i++)
    // GOOD CODE, i has the expected value[i]() { wcout << i << endl; } );

Now it prints the numbers 1,2,3,4 in an arbitrary order as intended.

  • Unintended Propagation of Cancellation Requests

API calls into the parallel library may create task groups that are internal to the library. If a task calls APIs you might unintentionally create a situation where your task group is waiting on a task group inside the API. Invoking the cancel method may implicitly cause the cancellation of task groups that were created by the library. This is called Transitive Propagation Cancellation and might not be the intended behavior. In some cases, you may prefer the library functions run to completion even though a higher-level component is beginning to shut down.

To avoid unintended propagation in a runtime context use a neutral, non-PPL thread context to call into any library functions that may depend on task group wait operations.

  • Task Group Calling Conventions

After invoking run call the object's wait method to allow all the tasks to complete before destroying the task group. Otherwise the runtime will throw a missing wait exception.

  • Tasks and Threads

When a task begins to run, the applicable task scheduler invokes the task’s work function in a thread it chooses. Tasks do not migrate among threads at run time. This lets you use thread-affine abstractions, like critical sections, because you don't have to worry about the task operating in different threads.

How to change a program from task parallelism to data parallelism?
Data parallelism and task parallelism are two ends of a spectrum. Data parallelism occurs when a single operation is applied to many inputs. Task parallelism uses multiple operations, each with its own input.

Answer Question 1.
Using Data Parallelism and not Task Parallelism. First, rewrite the program to take the images, perform all operations on each image, then blend them. Data centric vs. Task centric. So you'd simply put a while loop in "ParallelTaskGroupImageProcessing" method. In this example, no real difference either way.

  • Need a better programming excise here.

Pattern 4 Parallel Aggregation

An aggregation, say computing a sum, breaks under parallel loops because the 'count' variable is shared by all iterations. The Parallel Aggregation Pattern, a.k.a Parallel Reduction Pattern, solves this.
p.46 The answer is not to apply synchronization operations to the existing sequential algorithm in order to make it “safe” for parallel execution.

Exercises. No programming exercise. I'll need to find them elsewhere.
Done, onto chapter 5

Pattern 5 Futures

The “Parallel Tasks Pattern” allows forking the flow of control in a program. In the "Parallel Future Pattern" the control flow and data flow can be integrated. A future is a stand-in for a computational result that is initially unknown but becomes available at a later time. Calculating the result can occur in parallel with other computations. The Futures Pattern integrates task parallelism with arguments and return values.

  • Futures are asynchronous functions.
  • This differs from the std:future implemented in C++11.

Futures express potential parallelism by decomposing a sequential operation into futures. This can result in faster execution if hardware re- sources are available for parallel execution. However, if all cores are otherwise occupied, futures will be evaluated without parallelism.

A future is a task that returns a value when a value is ready. No need to use an explicit wait command. If the task has already finished, its result is waiting and is immediately returned. If the task is running but has not yet finished, the thread that needs the result blocks until the result value becomes available. If the task hasn’t started yet, the pending task may be executed in the current thread context.

The single_assignment class that is used is a messaging buffer. The send and receive functions allow for concurrency-safe communication with single data values. Messaging buffers are covered in “Pipelines.”

p. 62

External References

  1. Duffy, Joe. "Concurrent Programming on Windows", Addison-Wesley, 2008.
  2. Mattson, Timothy G., Beverly A. Sanders, and Berna L. Massingill. "Patterns for Parallel Programming". Addison-Wesley, 2004.
  3. OPL, Our Pattern Language for Parallel Programming ver2.0,2010.[12]

Internal Links

Parent Article: Main Page