By Ivar Jacobson and Steve Cook, May 12, 2010
UML comes of age -- but now what?
Ivar Jacobson is founder and CTO of Ivar Jacobson International and co-developer of UML. Steve Cook is a software architect at Microsoft and represents Microsoft at the OMG.
--------------------------------------------------------------------------------
More than 12 years have passed since the Unified Modeling Language (UML), became a standard. During these years opinions of UML have varied between delight and distaste. In this article, we discuss the deficiencies of the current UML specification and propose how to make it agile, leaner, smarter, and more flexible -- in short, how to prepare it for the future so that users can be confident that their investments in UML today will increase in value going forward.
At the beginning of the '90s there were 26 published methods on object-orientation, most with its own notation with its own set of icons. It was in this environment that UML was born. Although Grady Booch, Jim Rumbaugh, and Ivar Jacobson initiated the design of what became UML, many other contributors (including Steve Cook) quickly joined the effort and the Object Management Group (OMG) launched the result. UML quickly made most other methods -- along with their own notations, obsolete -- UML eventually became the standard we had hoped for, and toolbuilders and practitioners rapidly adopted the new approach.
Since UML first had outstanding success, we all knew that the pendulum would swing in the opposite direction some day -- and we were right. After a few years the setback arrived, but admittably for good reasons. For instance, at the outset there weren't many good UML tools. Some were very advanced, but hard to use. That disappointed many users and hindered the wide adoption of UML. The language received criticism from the academic world; for example David Parnas nicknamed it the "Undefined Modeling Language". The criticism was exaggerated, but not unfounded. Likewise, the original leaders of the agile movement were negative to modeling. They said "no modeling -- just code". Many people were skeptical about the tools, so they worked more with UML sketches on the whiteboard than formally with the tools themselves.
However, the pendulum is now swinging back. The tools have become better. Criticism from academics has mostly stopped. Agility has come to big companies and modeling is agile if done sensibly (and not as a "silver bullet"). Microsoft, for instance, has implemented UML in Visual Studio 2010, alongside domain-specific languages. Other important standards such as SysML are implemented as extensions to UML.
Thus it seems that today the world looks upon UML with a more balanced view. UML is not the panacea that it was sometimes sold as 10 years ago. Nor is it as bad as some academics, agilistas, and competitors have claimed. It is a practical tool to raise the level of abstraction of software from code to system level. Many organizations claim benefits from their use of UML, ranging from improved communication among developers to productivity gains using code generation from UML models.
UML is a good language to describe software systems at a higher level than code. Properly used UML can improve productivity and quality. After several years of consolidation, the OMG (the owner of the UML) is taking the initiative to improve it, having issued a Request for Information (RIF) on the Future Development of UML in 2009 under the leadership of Steve Cook.
The results of this RFI showed that users and vendors want UML to be leaner, more expressive, easier to learn, easier to integrate with other modeling and programming languages, and more relevant to today’s technologies. For example, some users want to use UML to drive the interactive behavior of a mobile application. Others want to use UML diagrams to automatically visualize the structure and dependencies within a massive distributed service-oriented application. Some would like models to be deeply integrated with modern programming languages without any semantic conflicts. Increasingly, users would like to use UML to model applications deployed in the Cloud. To address these needs, the OMG and UML vendors are working together towards making UML smarter and more agile.
Size and Complexity
One of the biggest complaints about UML is that it is too large and too complex. Typically a project that uses UML only uses 20% of the specification to help build 80% of the code and other deliverables. This 20% can vary according to the type of project: real-time systems, telecommunications, web applications, business applications, etc. What is considered essential may vary according to the kind of project, but in all cases the unused 80% obscures and complicates the essential.
To address this complaint, UML should be described differently for different user groups. There are ordinary users such as analysts, designers, website builders, database designers, developers, operators, architects, and testers, each bringing a different -- but valid -- perspective that uses different but overlapping subsets of UML. A particular class of users comprises the designers of UML itself and UML tool builders. It goes without saying that if the language is complex, these designers will have a hard time creating a language that is complete, consistent, extensible, and able to integrate with other languages, and the number of specification defects will become high.
Figure 1 depicts the main components of UML 2 and the dependencies between them. Although there is some layering, the overall structure contains a lot of cyclic dependencies, which makes it difficult to define useful subsets of the language. The UML specification does define formal "compliance points" which supposedly describe legal subsets of UML, but UML tool vendors have taken little or no notice of these, because they do not correspond to important use cases for UML users.
Figure 1: Components and Dependencies of UML 2.
A key point with the current UML is that there is no way in a compliant implementation to use the simple version of a concept without having the complicated version; for example, take Class. Most users think of a Class as a simple thing that has attributes, operations, inheritance etc. But a UML 2 class also has Ports, Parts, Connectors, and Receptions -- concepts only useful in specialized domains. There is no way to have just the simple one, so all users are burdened with the understanding required by advanced users. This can be -- and is -- mitigated to some extent by good tools. However, we believe that the simple options should be inherent in the language definition itself. Furthermore, the UML Class differs in detail from the concept of Class found in any particular programming language, which introduces additional conceptual barriers between UML and those of its users who are software developers. Again, flexibility of these definitions should be inherent in the language, so that it can be fine-tuned to match to modern programming technologies.
In summary, there are two major challenges to be addressed: complexity of the UML specification itself, and the need to describe UML in coherent subsets that address the actual needs of users in particular domains.
To address the first challenge, as a direct response to the feedback from the RFI, the OMG has embarked on a program of simplification for the UML specification. By the middle of 2011, a much simplified version of the UML specification should be available in which cyclic dependencies and redundancy have been greatly reduced. This specification will be compatible with existing tools and models, but described in a way that makes it much more amenable to further simplification and integration.
Refactoring UML
Once the simplification of the UML specification is complete in 2011, we will be able to move onto the next phase, which will be to refactor UML so that it can more effectively address the changing needs of many different classes of users. This section proposes some techniques that we can apply for this. It will be very important to retain backwards compatibility with the existing UML while doing this: we must not introduce changes that invalidate existing investments in tools, models or training.
We suggest it is possible to create a very small kernel of no more than 20 elements such as objects, links, types and actions, so that almost everyone can learn it in a few hours. In these elements, we will only add things that are necessary to understand the kernel, so that it in a way becomes complete. The kernel by itself is not particularly useful to any developer as it is, although it might suffice for abstract conceptual modeling. It is intended to serve as a basis to describe useful elements.
On top of the kernel we now add more useful concepts. The most essential UML use cases can be defined as extensions to the kernel. An example of an essential use case of UML would be: "Developing from requirements to test".
Figure 2: Suggested simplification of UML through separation of concerns.
The key idea here is that the kernel should be kept small and only includes generic elements and attributes needed by most use cases of UML. Specific elements or attributes to existing elements in the kernel which are needed to provide support for the use case "Developing from requirements to test" are added along with that use case. Thus they don't belong to the kernel. Everything new is added with the new use case, since it supports the use case. This is the whole point. This is what is called an "aspect-oriented structure", where the kernel can be kept clean and simple without knowledge of how it will be extended. The new additions to the kernel will be added with the new elements coming with the use cases that need them.
Returning to the use case example of "Developing from requirements to test".... There are many specializations of this use case; for example, "Developing a web application from requirements to test", and variants such as complex system, distributed system, which all wouldn't be essentials, but extensions to the essential use cases.
Figure 3: Adding a non-essential use case.
Since the use case "Developing from requirements to test" in itself is a large use case, we would need to find its constituent smaller use cases: Use case modeling, Designing, Implementing, Testing, Deploying. They are all different aspects or separate concerns of UML (see " Aspect-Oriented Software Development With Use Cases). Note that there is a danger in breaking down the complete use case into constituent use cases, which is that it might create silos and increase the risk of getting waste, contradicting principles of lean development. We mitigate this by using the Framework design principle where "each component should only do what only that component can do". This forces clean minimal factoring.
Figure 4: Constituent use cases of the larger use case "Developing from requirements to test".
Technically, we need some enhancements to the mechanisms used to extend UML. Today there is a mechanism called Profiles which provides some of the required capability but is not well integrated with the rest of the UML architecture. A simple new mechanism for extending UML -- multiple dynamic classification of model elements -- is currently under development at the OMG, and is scheduled to be available as an OMG standard in 2011. This enables, for example, the specification of Ports to be completely separated from the specification of Classes, so that Ports can be added to Classes only for those advanced use cases where they are required.
Structuring UML like this will help eliminate today’s dilemma of choosing between UML and one or more domain-specific languages (DSLs). When UML is structured as a simple kernel plus extensions, new domains can be addressed by crafting further extensions that can be seamlessly integrated with the existing structures. Another mechanism currently under development at the OMG -- Diagram Definition -- will enable new standard diagram types to be specified and integrated with existing ones. Diagram Definition is also scheduled to be available as an OMG standard in 2011, and will be applied to UML and to other modeling languages including BPMN (Business Process Model and Notation).
To help users get started we should define a basic subset of UML, here called "Essential UML", which can be learnt quickly in a few days. This would include the simple aspects of most of the familiar UML diagrams. The rest of the UML can be added as a set of seamlessly interwoven, deeply integrated yet completely independent extensions without altering what already has been described, and taught. This is smart!
During the last 10 years UML has not adapted quickly enough to developments in the software world. Ten years ago, for example we didn’t have so many frameworks. Today we are inundated with frameworks. The work of programming has increasingly moved to become a job of connecting already developed components or to use existing frameworks. The amazing production of apps for mobile phones is a perfect example. Today, we have not found a place for UML in this context. That does not mean that such a place does not exist, just that we have not yet adapted UML to this context or made UML appetizing to use in these contexts.
As we move forward with improving the UML, however, the first step is to simplify the existing UML, so we get it to become what UML should have been. We need to shrink UML first to lay the foundation for radical expansion of UML into many different new and exciting domains. As the basis for this work is being laid, we can begin to enrich UML with the possible new constructs we need today and tomorrow.
As a consequence of these developments, users can carry on investing in UML with confidence that the value of their investment will grow as UML moves into new domains, and becomes easier to apply and to integrate with other technologies.
Sphere: Related Content
7/6/10
The Boost.Threads Library
By Bill Kempf, May 01, 2002
Standard C++ threads are imminent and will derive from the Boost.Threads library, explored here by the library's author.
Important Update
For an update of the Boost.Threads library, see the article What's New in Boost Threads? by Anthony Williams, maintainer of the Boost.Threads Library. This update appears in the November 2008 issue of Dr. Dobb's Journal.
--------------------------------------------------------------------------------
Just a few years ago it was uncommon for a program to be written with multiple threads of execution. Today Internet server applications run multiple threads of execution to efficiently handle multiple client connections. To maximize throughput, transaction servers execute services on separate threads. GUI applications perform lengthy operations in a separate thread to keep the user interface responsive. The list goes on.
The C++ Standard doesn’t mention threads, leaving programmers to wonder whether it’s even possible to write multithreaded C++ programs. Though it is not possible to write standards-compliant multithreaded programs, programmers none the less write multithreaded programs in C++ using the libraries provided by their OS that expose the system’s support for threads. However, there are at least two major problems with doing this: these libraries are almost universally C libraries and require careful use in C++, and each OS provides its own set of libraries for handling multithreaded support. Therefore, the resulting code is not only non-standard, but also non-portable [1]. Boost.Threads is a library designed to address both problems.
Boost [2] is an organization started by members of the C++ Standards Committee Library Working Group to develop new source libraries for C++. Its current membership includes approximately 2,000 members. Many libraries can be found in the Boost source distribution [3]. To make these libraries thread-safe, Boost.Threads was created.
Many C++ experts provided input to the design of Boost.Threads. The interface was designed from the ground up and is not just a simple wrapper around any C threading API. Many features of C++ (such as the existence of constructors/destructors, function objects, and templates) were fully utilized to make the interface more flexible. The current implementation works for POSIX, Win32, and Macintosh Carbon platforms.
Thread Creation
The boost::thread class represents a thread of execution in the same way the std::fstream class represents a file. The default constructor creates an instance representing the current thread of execution. An overloaded constructor takes a function object called with no arguments and returning nothing. This constructor starts a new thread of execution, which in turn calls the function object.
At first it appears that this design is less useful than the typical C approach to creating a thread where a void pointer can be passed to the routine called by the new thread, which allows data to be passed. However, because Boost.Threads uses a function object instead of just a function pointer, it is possible for the function object to carry data needed by the thread. This approach is actually more flexible and is type safe. When combined with functional libraries, such as Boost.Bind, this design actually allows you to easily pass any amount of data to the newly created thread.
Currently, not a lot can be done with a thread object created in Boost.Threads. In fact only two operations can be performed. Thread objects can easily be compared for equality or inequality using the == and != operators to verify if they refer to the same thread of execution, and you can wait for a thread to complete by calling boost::thread::join. Other threading libraries allow you to perform other operations with a thread (for example, set its priority or even cancel it). However, because these operations don’t easily map into portable interfaces, research is being done to determine how they can be added to Boost.Threads.
Listing 1 illustrates a very simple use of the boost::thread class. A new thread is created that simply writes “Hello World” out to std::cout, while the main thread waits for it to complete.
Mutexes
Anyone who has written a multithreaded program understands how critical it is for multiple threads not to access shared resources at the same time. If one thread tries to change the value of shared data at the same time as another thread tries to read the value, the result is undefined behavior. To prevent this from happening, make use of some special primitive types and operations. The most fundamental of these types is known as a mutex (the abbreviation for “mutual exclusion”). A mutex allows only a single thread access to a shared resource at one time. When a thread needs to access the shared resource, it must first “lock” the mutex. If any other thread has already locked the mutex, this operation waits for the other thread to unlock the mutex first, thus ensuring that only a single thread has access to the shared resource at a time.
The mutex concept has several variations. Two large categories of mutexes that Boost.Threads supports include the simple mutex and the recursive mutex. A simple mutex can only be locked once. If the same thread tries to lock a mutex twice, it deadlocks, which indicates that the thread will wait forever. With a recursive mutex, a single thread may lock a mutex several times and must unlock the mutex the same number of times to allow another thread to lock the mutex.
Within these two broad categories of mutexes, there are other variations on how a thread can lock the mutex. A thread may attempt to lock a mutex in three ways:
1. Try and lock the mutex by waiting until no other thread has the mutex locked.
2. Try and lock the mutex by returning immediately if any other thread has the mutex locked.
3. Try and lock the mutex by waiting until no other thread has the mutex locked or until a specified amount of time has elapsed.
It appears that the best possible mutex type is a recursive type that allows all 3 forms of locking. However, overhead is involved with each variation,so Boost.Threads allows you to pick the most efficient mutex type for your specific needs. This leaves Boost.Threads with six mutex types, listed in order of preference based on efficiency: boost::mutex, boost::try_mutex, boost::timed_mutex, boost::recursive_mutex, boost::recursive_try_mutex, and boost::recursive_timed_mutex.
Deadlock may occur if every time a mutex is locked it is not subsequently unlocked. This is the most common possible error, so Boost.Threads is designed to make this impossible (or at least very difficult). No direct access to operations for locking and unlocking any of the mutex types is available. Instead, mutex classes define nested typedefs for types that implement the RAII (Resource Acquisition in Initialization) idiom for locking and unlocking a mutex. This is known as the Scoped Lock [4] pattern. To construct one of these types, pass in a reference to a mutex. The constructor locks the mutex and the destructor unlocks it. C++ language rules ensure the destructor will always be called, so even when an exception is thrown, the mutex will always be unlocked properly.
This pattern helps to ensure proper usage of a mutex. However, be aware that although the Scoped Lock pattern ensures that the mutex is unlocked, it does not ensure that any shared resources remain in a valid state if an exception is thrown; so just as with programming for a single thread of execution, ensure that exceptions don’t leave the program in an inconsistent state. Also, the locking objects must not be passed to another thread, as they maintain state that’s not protected from such usage.
Listing 2 illustrates a very simple use of the boost::mutex class. Two new threads are created, which loop 10 times, writing out an id and the current loop count to std::cout, while the main thread waits for both to complete. The std::cout object is a shared resource, so each thread uses a global mutex to ensure that only one thread at a time attempts to write to it.
Many users will note that passing data to the thread in Listing 2 required writing a function object by hand. Although the code is trivial, it can be tedious writing this code every time. There is an easier solution, however. Functional libraries allow you to create new function objects by binding another function object with data that will be passed to it when called. Listing 3 shows how the Boost.Bind library can be used to simplify the code from Listing 2 by removing the need for a hand-coded function object.
Condition Variables
Sometimes it’s not enough to lock a shared resource and use it. Sometimes the shared resource needs to be in some specific state before it can be used. For example, a thread may try and pull data off of a stack, waiting for data to arrive if none is present. A mutex is not enough to allow for this type of synchronization. Another synchronization type, known as a condition variable, can be used in this case.
A condition variable is always used in conjunction with a mutex and the shared resource(s). A thread first locks the mutex and then verifies that the shared resource is in a state that can be safely used in the manner needed. If it’s not in the state needed, the thread waits on the condition variable. This operation causes the mutex to be unlocked during the wait so that another thread can actually change the state of the shared resource. It also ensures that the mutex is locked when the thread returns from the wait operation. When another thread changes the state of the shared resource, it needs to notify the threads that may be waiting on the condition variable, enabling them to return from the wait operation.
Listing 4 illustrates a very simple use of the boost::condition class. A class is defined implementing a bounded buffer, a container with a fixed size allowing FIFO input and output. This buffer is made thread-safe internally through the use of a boost::mutex. The put and get operations use a condition variable to ensure that a thread waits for the buffer to be in the state needed to complete the operation. Two threads are created, one that puts 100 integers into this buffer and the other pulling the integers back out. The bounded buffer can only hold 10 integers at one time, so the two threads wait for the other thread periodically. To verify that it is happening, the put and get operations output diagnostic strings to std::cout. Finally, the main thread waits for both threads to complete.
Thread Local Storage
Many functions are not implemented to be reentrant. This means that it is unsafe to call the function while another thread is calling the same function. A non-reentrant function holds static data over successive calls or returns a pointer to static data. For example, std::strtok is not reentrant because it uses static data to hold the string to be broken into tokens.
A non-reentrant function can be made into a reentrant function using two approaches. One approach is to change the interface so that the function takes a pointer or reference to a data type that can be used in place of the static data previously used. For example, POSIX defines strtok_r, a reentrant variant of std::strtok, which takes an extra char** parameter that’s used instead of static data. This solution is simple and gives the best possible performance; however, it means changing the public interface, which potentially means changing a lot of code. The other approach leaves the public interface as is and replaces the static data with thread local storage (sometimes referred to as thread-specific storage).
Thread local storage is data that’s associated with a specific thread (the current thread). Multithreading libraries give access to thread local storage through an interface that allows access to the current thread’s instance of the data. Every thread gets its own instance of this data, so there’s never an issue with concurrent access. However, access to thread local storage is slower than access to static or local data; therefore it’s not always the best solution. However, it’s the only solution available when it’s essential not to change the public interface.
Boost.Threads provides access to thread local storage through the smart pointer boost::thread_specific_ptr. The first time every thread tries to access an instance of this smart pointer, it has a NULL value, so code should be written to check for this and initialize the pointer on first use. The Boost.Threads library ensures that the data stored in thread local storage is cleaned up when the thread exits.
Listing 5 illustrates a very simple use of the boost::thread_specific_ptr class. Two new threads are created to initialize the thread local storage and then loop 10 times incrementing the integer contained in the smart pointer and writing the result to std::cout (which is synchronized with a mutex because it is a shared resource). The main thread then waits for these two threads to complete. The output of this example clearly shows that each thread is operating on its own instance of data, even though both are using the same boost::thread_specific_ptr.
Once Routines
There’s one issue left to deal with: how to make initialization routines (such as constructors) thread-safe. For example, when a “global” instance of an object is created as a singleton for an application, knowing that there’s an issue with the order of instantiation, a function is used that returns a static instance, ensuring the static instance is created the first time the method is called. The problem here is that if multiple threads call this function at the same time, the constructor for the static instance may be called multiple times as well, with disastrous results.
The solution to this problem is what’s known as a “once routine.” A once routine is called only once by an application. If multiple threads try to call the routine at the same time, only one actually is able to do so while all others wait until that thread has finished executing the routine. To ensure that it is executed only once, the routine is called indirectly by another function that’s passed a pointer to the routine and a reference to a special flag type used to check if the routine has been called yet. This flag is initialized using static initialization, which ensures that it is initialized at compile time and not run time. Therefore, it is not subject to multithreaded initialization problems. Boost.Threads provides calling once routines through boost::call_once and also defines the flag type boost::once_flag and a special macro used to statically initialize the flag named BOOST_ONCE_INIT.
Listing 6 illustrates a very simple use of boost::call_once. A global integer is statically initialized to zero and an instance of boost::once_flag is statically initialized using BOOST_ONCE_INIT. Then main starts two threads, both trying to “initialize” the global integer by calling boost::call_once with a pointer to a function that increments the integer. Next main waits for these two threads to complete and writes out the final value of the integer to std::cout. The output illustrates that the routine truly was only called once because the value of the integer is only one.
The Future of Boost.Threads
There are several additional features planned for Boost.Threads. There will be a boost::read_write_mutex, which will allow multiple threads to read from the shared resource at the same time, but will ensure exclusive access to any threads writing to the shared resource. There will also be a boost::thread_barrier, which will make a set of threads wait until all threads have “entered” the barrier. A boost::thread_pool is also planned to allow for short routines to be executed asynchronously without the need to create or destroy a thread each time.
Boost.Threads has been presented to the C++ Standards Committee’s Library Working Group for possible inclusion in the Standard’s upcoming Library Technical Report, as a prelude to inclusion in the next version of the Standard. The committee may consider other threading libraries; however, they viewed the initial presentation of Boost.Threads favorably, and they are very interested in adding some support for multithreaded programming to the Standard. So, the future is looking good for multithreaded programming in C++.
Listing 1: The boost::thread class
#include ^boost/thread/thread.hpp^
#include ^iostream^
void hello()
{
std::cout <<
"Hello world, I'm a thread!"
<< std::endl;
}
int main(int argc, char* argv[])
{
boost::thread thrd(&hello);
thrd.join();
return 0;
}
— End of Listing —
Listing 2: The boost::mutex class
#include ^boost/thread/thread.hpp^
#include ^boost/thread/mutex.hpp^
#include ^iostream^
boost::mutex io_mutex;
struct count
{
count(int id) : id(id) { }
void operator()()
{
for (int i = 0; i < 10; ++i)
{
boost::mutex::scoped_lock
lock(io_mutex);
std::cout << id << ": "
<< i << std::endl;
}
}
int id;
};
int main(int argc, char* argv[])
{
boost::thread thrd1(count(1));
boost::thread thrd2(count(2));
thrd1.join();
thrd2.join();
return 0;
}
— End of Listing —
Listing 2: DEAD LINK
Listing 4: The boost::condition class
#include ^boost/thread/thread.hpp^
#include ^boost/thread/mutex.hpp^
#include ^boost/thread/condition.hpp^
#include ^iostream^
const int BUF_SIZE = 10;
const int ITERS = 100;
boost::mutex io_mutex;
class buffer
{
public:
typedef boost::mutex::scoped_lock
scoped_lock;
buffer()
: p(0), c(0), full(0)
{
}
void put(int m)
{
scoped_lock lock(mutex);
if (full == BUF_SIZE)
{
{
boost::mutex::scoped_lock
lock(io_mutex);
std::cout <<
"Buffer is full. Waiting..."
<< std::endl;
}
while (full == BUF_SIZE)
cond.wait(lock);
}
buf[p] = m;
p = (p+1) % BUF_SIZE;
++full;
cond.notify_one();
}
int get()
{
scoped_lock lk(mutex);
if (full == 0)
{
{
boost::mutex::scoped_lock
lock(io_mutex);
std::cout <<
"Buffer is empty. Waiting..."
<< std::endl;
}
while (full == 0)
cond.wait(lk);
}
int i = buf[c];
c = (c+1) % BUF_SIZE;
--full;
cond.notify_one();
return i;
}
private:
boost::mutex mutex;
boost::condition cond;
unsigned int p, c, full;
int buf[BUF_SIZE];
};
buffer buf;
void writer()
{
for (int n = 0; n < ITERS; ++n)
{
{
boost::mutex::scoped_lock
lock(io_mutex);
std::cout << "sending: "
<< n << std::endl;
}
buf.put(n);
}
}
void reader()
{
for (int x = 0; x < ITERS; ++x)
{
int n = buf.get();
{
boost::mutex::scoped_lock
lock(io_mutex);
std::cout << "received: "
<< n << std::endl;
}
}
}
int main(int argc, char* argv[])
{
boost::thread thrd1(&reader);
boost::thread thrd2(&writer);
thrd1.join();
thrd2.join();
return 0;
}
— End of Listing —
Listing 5: The boost::thread_specific_ptr class
#include ^boost/thread/thread.hpp^
#include ^boost/thread/mutex.hpp^
#include ^boost/thread/tss.hpp^
#include ^iostream^
boost::mutex io_mutex;
boost::thread_specific_ptr ptr;
struct count
{
count(int id) : id(id) { }
void operator()()
{
if (ptr.get() == 0)
ptr.reset(new int(0));
for (int i = 0; i < 10; ++i)
{
(*ptr)++;
boost::mutex::scoped_lock
lock(io_mutex);
std::cout << id << ": "
<< *ptr << std::endl;
}
}
int id;
};
int main(int argc, char* argv[])
{
boost::thread thrd1(count(1));
boost::thread thrd2(count(2));
thrd1.join();
thrd2.join();
return 0;
}
— End of Listing —
Listing 6: A very simple use of boost::call_once
#include ^boost/thread/thread.hpp^
#include ^boost/thread/once.hpp^
#include ^iostream^
int i = 0;
boost::once_flag flag =
BOOST_ONCE_INIT;
void init()
{
++i;
}
void thread()
{
boost::call_once(&init, flag);
}
int main(int argc, char* argv[])
{
boost::thread thrd1(&thread);
boost::thread thrd2(&thread);
thrd1.join();
thrd2.join();
std::cout << i << std::endl;
return 0;
}
— End of Listing —
NOTES
[1] The POSIX standard defines multithreaded support in what’s commonly known as the pthread library. This provides multithreaded support for a wide range of operating systems, including Win32 through the pthreads-win32 port. However, this is a C library that fails to address some C++ concepts and is not available on all platforms.
[2] Visit the Boost website at http://www.boost.org.
[3] See Bjorn Karlsson’s article, “Smart Pointers in Boost,” in C/C++ Users Journal, April 2002.
[4] Douglas Schmidt, Michael Stal, Hans Rohnert, and Frank Buschmann. Pattern-Oriented Software Architecture Volume 2 — Patterns for Concurrent and Networked Objects (Wiley, 2000).
William E. Kempf received his BS in CompSci/Math from Doane College. He’s been in the industry for 10 years and is currently a senior application developer for First Data Resources, Inc. He is the author of the Boost.Threads library, and an active Boost member. He can be contacted at wekempf@cox.net. Sphere: Related Content
Standard C++ threads are imminent and will derive from the Boost.Threads library, explored here by the library's author.
Important Update
For an update of the Boost.Threads library, see the article What's New in Boost Threads? by Anthony Williams, maintainer of the Boost.Threads Library. This update appears in the November 2008 issue of Dr. Dobb's Journal.
--------------------------------------------------------------------------------
Just a few years ago it was uncommon for a program to be written with multiple threads of execution. Today Internet server applications run multiple threads of execution to efficiently handle multiple client connections. To maximize throughput, transaction servers execute services on separate threads. GUI applications perform lengthy operations in a separate thread to keep the user interface responsive. The list goes on.
The C++ Standard doesn’t mention threads, leaving programmers to wonder whether it’s even possible to write multithreaded C++ programs. Though it is not possible to write standards-compliant multithreaded programs, programmers none the less write multithreaded programs in C++ using the libraries provided by their OS that expose the system’s support for threads. However, there are at least two major problems with doing this: these libraries are almost universally C libraries and require careful use in C++, and each OS provides its own set of libraries for handling multithreaded support. Therefore, the resulting code is not only non-standard, but also non-portable [1]. Boost.Threads is a library designed to address both problems.
Boost [2] is an organization started by members of the C++ Standards Committee Library Working Group to develop new source libraries for C++. Its current membership includes approximately 2,000 members. Many libraries can be found in the Boost source distribution [3]. To make these libraries thread-safe, Boost.Threads was created.
Many C++ experts provided input to the design of Boost.Threads. The interface was designed from the ground up and is not just a simple wrapper around any C threading API. Many features of C++ (such as the existence of constructors/destructors, function objects, and templates) were fully utilized to make the interface more flexible. The current implementation works for POSIX, Win32, and Macintosh Carbon platforms.
Thread Creation
The boost::thread class represents a thread of execution in the same way the std::fstream class represents a file. The default constructor creates an instance representing the current thread of execution. An overloaded constructor takes a function object called with no arguments and returning nothing. This constructor starts a new thread of execution, which in turn calls the function object.
At first it appears that this design is less useful than the typical C approach to creating a thread where a void pointer can be passed to the routine called by the new thread, which allows data to be passed. However, because Boost.Threads uses a function object instead of just a function pointer, it is possible for the function object to carry data needed by the thread. This approach is actually more flexible and is type safe. When combined with functional libraries, such as Boost.Bind, this design actually allows you to easily pass any amount of data to the newly created thread.
Currently, not a lot can be done with a thread object created in Boost.Threads. In fact only two operations can be performed. Thread objects can easily be compared for equality or inequality using the == and != operators to verify if they refer to the same thread of execution, and you can wait for a thread to complete by calling boost::thread::join. Other threading libraries allow you to perform other operations with a thread (for example, set its priority or even cancel it). However, because these operations don’t easily map into portable interfaces, research is being done to determine how they can be added to Boost.Threads.
Listing 1 illustrates a very simple use of the boost::thread class. A new thread is created that simply writes “Hello World” out to std::cout, while the main thread waits for it to complete.
Mutexes
Anyone who has written a multithreaded program understands how critical it is for multiple threads not to access shared resources at the same time. If one thread tries to change the value of shared data at the same time as another thread tries to read the value, the result is undefined behavior. To prevent this from happening, make use of some special primitive types and operations. The most fundamental of these types is known as a mutex (the abbreviation for “mutual exclusion”). A mutex allows only a single thread access to a shared resource at one time. When a thread needs to access the shared resource, it must first “lock” the mutex. If any other thread has already locked the mutex, this operation waits for the other thread to unlock the mutex first, thus ensuring that only a single thread has access to the shared resource at a time.
The mutex concept has several variations. Two large categories of mutexes that Boost.Threads supports include the simple mutex and the recursive mutex. A simple mutex can only be locked once. If the same thread tries to lock a mutex twice, it deadlocks, which indicates that the thread will wait forever. With a recursive mutex, a single thread may lock a mutex several times and must unlock the mutex the same number of times to allow another thread to lock the mutex.
Within these two broad categories of mutexes, there are other variations on how a thread can lock the mutex. A thread may attempt to lock a mutex in three ways:
1. Try and lock the mutex by waiting until no other thread has the mutex locked.
2. Try and lock the mutex by returning immediately if any other thread has the mutex locked.
3. Try and lock the mutex by waiting until no other thread has the mutex locked or until a specified amount of time has elapsed.
It appears that the best possible mutex type is a recursive type that allows all 3 forms of locking. However, overhead is involved with each variation,so Boost.Threads allows you to pick the most efficient mutex type for your specific needs. This leaves Boost.Threads with six mutex types, listed in order of preference based on efficiency: boost::mutex, boost::try_mutex, boost::timed_mutex, boost::recursive_mutex, boost::recursive_try_mutex, and boost::recursive_timed_mutex.
Deadlock may occur if every time a mutex is locked it is not subsequently unlocked. This is the most common possible error, so Boost.Threads is designed to make this impossible (or at least very difficult). No direct access to operations for locking and unlocking any of the mutex types is available. Instead, mutex classes define nested typedefs for types that implement the RAII (Resource Acquisition in Initialization) idiom for locking and unlocking a mutex. This is known as the Scoped Lock [4] pattern. To construct one of these types, pass in a reference to a mutex. The constructor locks the mutex and the destructor unlocks it. C++ language rules ensure the destructor will always be called, so even when an exception is thrown, the mutex will always be unlocked properly.
This pattern helps to ensure proper usage of a mutex. However, be aware that although the Scoped Lock pattern ensures that the mutex is unlocked, it does not ensure that any shared resources remain in a valid state if an exception is thrown; so just as with programming for a single thread of execution, ensure that exceptions don’t leave the program in an inconsistent state. Also, the locking objects must not be passed to another thread, as they maintain state that’s not protected from such usage.
Listing 2 illustrates a very simple use of the boost::mutex class. Two new threads are created, which loop 10 times, writing out an id and the current loop count to std::cout, while the main thread waits for both to complete. The std::cout object is a shared resource, so each thread uses a global mutex to ensure that only one thread at a time attempts to write to it.
Many users will note that passing data to the thread in Listing 2 required writing a function object by hand. Although the code is trivial, it can be tedious writing this code every time. There is an easier solution, however. Functional libraries allow you to create new function objects by binding another function object with data that will be passed to it when called. Listing 3 shows how the Boost.Bind library can be used to simplify the code from Listing 2 by removing the need for a hand-coded function object.
Condition Variables
Sometimes it’s not enough to lock a shared resource and use it. Sometimes the shared resource needs to be in some specific state before it can be used. For example, a thread may try and pull data off of a stack, waiting for data to arrive if none is present. A mutex is not enough to allow for this type of synchronization. Another synchronization type, known as a condition variable, can be used in this case.
A condition variable is always used in conjunction with a mutex and the shared resource(s). A thread first locks the mutex and then verifies that the shared resource is in a state that can be safely used in the manner needed. If it’s not in the state needed, the thread waits on the condition variable. This operation causes the mutex to be unlocked during the wait so that another thread can actually change the state of the shared resource. It also ensures that the mutex is locked when the thread returns from the wait operation. When another thread changes the state of the shared resource, it needs to notify the threads that may be waiting on the condition variable, enabling them to return from the wait operation.
Listing 4 illustrates a very simple use of the boost::condition class. A class is defined implementing a bounded buffer, a container with a fixed size allowing FIFO input and output. This buffer is made thread-safe internally through the use of a boost::mutex. The put and get operations use a condition variable to ensure that a thread waits for the buffer to be in the state needed to complete the operation. Two threads are created, one that puts 100 integers into this buffer and the other pulling the integers back out. The bounded buffer can only hold 10 integers at one time, so the two threads wait for the other thread periodically. To verify that it is happening, the put and get operations output diagnostic strings to std::cout. Finally, the main thread waits for both threads to complete.
Thread Local Storage
Many functions are not implemented to be reentrant. This means that it is unsafe to call the function while another thread is calling the same function. A non-reentrant function holds static data over successive calls or returns a pointer to static data. For example, std::strtok is not reentrant because it uses static data to hold the string to be broken into tokens.
A non-reentrant function can be made into a reentrant function using two approaches. One approach is to change the interface so that the function takes a pointer or reference to a data type that can be used in place of the static data previously used. For example, POSIX defines strtok_r, a reentrant variant of std::strtok, which takes an extra char** parameter that’s used instead of static data. This solution is simple and gives the best possible performance; however, it means changing the public interface, which potentially means changing a lot of code. The other approach leaves the public interface as is and replaces the static data with thread local storage (sometimes referred to as thread-specific storage).
Thread local storage is data that’s associated with a specific thread (the current thread). Multithreading libraries give access to thread local storage through an interface that allows access to the current thread’s instance of the data. Every thread gets its own instance of this data, so there’s never an issue with concurrent access. However, access to thread local storage is slower than access to static or local data; therefore it’s not always the best solution. However, it’s the only solution available when it’s essential not to change the public interface.
Boost.Threads provides access to thread local storage through the smart pointer boost::thread_specific_ptr. The first time every thread tries to access an instance of this smart pointer, it has a NULL value, so code should be written to check for this and initialize the pointer on first use. The Boost.Threads library ensures that the data stored in thread local storage is cleaned up when the thread exits.
Listing 5 illustrates a very simple use of the boost::thread_specific_ptr class. Two new threads are created to initialize the thread local storage and then loop 10 times incrementing the integer contained in the smart pointer and writing the result to std::cout (which is synchronized with a mutex because it is a shared resource). The main thread then waits for these two threads to complete. The output of this example clearly shows that each thread is operating on its own instance of data, even though both are using the same boost::thread_specific_ptr.
Once Routines
There’s one issue left to deal with: how to make initialization routines (such as constructors) thread-safe. For example, when a “global” instance of an object is created as a singleton for an application, knowing that there’s an issue with the order of instantiation, a function is used that returns a static instance, ensuring the static instance is created the first time the method is called. The problem here is that if multiple threads call this function at the same time, the constructor for the static instance may be called multiple times as well, with disastrous results.
The solution to this problem is what’s known as a “once routine.” A once routine is called only once by an application. If multiple threads try to call the routine at the same time, only one actually is able to do so while all others wait until that thread has finished executing the routine. To ensure that it is executed only once, the routine is called indirectly by another function that’s passed a pointer to the routine and a reference to a special flag type used to check if the routine has been called yet. This flag is initialized using static initialization, which ensures that it is initialized at compile time and not run time. Therefore, it is not subject to multithreaded initialization problems. Boost.Threads provides calling once routines through boost::call_once and also defines the flag type boost::once_flag and a special macro used to statically initialize the flag named BOOST_ONCE_INIT.
Listing 6 illustrates a very simple use of boost::call_once. A global integer is statically initialized to zero and an instance of boost::once_flag is statically initialized using BOOST_ONCE_INIT. Then main starts two threads, both trying to “initialize” the global integer by calling boost::call_once with a pointer to a function that increments the integer. Next main waits for these two threads to complete and writes out the final value of the integer to std::cout. The output illustrates that the routine truly was only called once because the value of the integer is only one.
The Future of Boost.Threads
There are several additional features planned for Boost.Threads. There will be a boost::read_write_mutex, which will allow multiple threads to read from the shared resource at the same time, but will ensure exclusive access to any threads writing to the shared resource. There will also be a boost::thread_barrier, which will make a set of threads wait until all threads have “entered” the barrier. A boost::thread_pool is also planned to allow for short routines to be executed asynchronously without the need to create or destroy a thread each time.
Boost.Threads has been presented to the C++ Standards Committee’s Library Working Group for possible inclusion in the Standard’s upcoming Library Technical Report, as a prelude to inclusion in the next version of the Standard. The committee may consider other threading libraries; however, they viewed the initial presentation of Boost.Threads favorably, and they are very interested in adding some support for multithreaded programming to the Standard. So, the future is looking good for multithreaded programming in C++.
Listing 1: The boost::thread class
#include ^boost/thread/thread.hpp^
#include ^iostream^
void hello()
{
std::cout <<
"Hello world, I'm a thread!"
<< std::endl;
}
int main(int argc, char* argv[])
{
boost::thread thrd(&hello);
thrd.join();
return 0;
}
— End of Listing —
Listing 2: The boost::mutex class
#include ^boost/thread/thread.hpp^
#include ^boost/thread/mutex.hpp^
#include ^iostream^
boost::mutex io_mutex;
struct count
{
count(int id) : id(id) { }
void operator()()
{
for (int i = 0; i < 10; ++i)
{
boost::mutex::scoped_lock
lock(io_mutex);
std::cout << id << ": "
<< i << std::endl;
}
}
int id;
};
int main(int argc, char* argv[])
{
boost::thread thrd1(count(1));
boost::thread thrd2(count(2));
thrd1.join();
thrd2.join();
return 0;
}
— End of Listing —
Listing 2: DEAD LINK
Listing 4: The boost::condition class
#include ^boost/thread/thread.hpp^
#include ^boost/thread/mutex.hpp^
#include ^boost/thread/condition.hpp^
#include ^iostream^
const int BUF_SIZE = 10;
const int ITERS = 100;
boost::mutex io_mutex;
class buffer
{
public:
typedef boost::mutex::scoped_lock
scoped_lock;
buffer()
: p(0), c(0), full(0)
{
}
void put(int m)
{
scoped_lock lock(mutex);
if (full == BUF_SIZE)
{
{
boost::mutex::scoped_lock
lock(io_mutex);
std::cout <<
"Buffer is full. Waiting..."
<< std::endl;
}
while (full == BUF_SIZE)
cond.wait(lock);
}
buf[p] = m;
p = (p+1) % BUF_SIZE;
++full;
cond.notify_one();
}
int get()
{
scoped_lock lk(mutex);
if (full == 0)
{
{
boost::mutex::scoped_lock
lock(io_mutex);
std::cout <<
"Buffer is empty. Waiting..."
<< std::endl;
}
while (full == 0)
cond.wait(lk);
}
int i = buf[c];
c = (c+1) % BUF_SIZE;
--full;
cond.notify_one();
return i;
}
private:
boost::mutex mutex;
boost::condition cond;
unsigned int p, c, full;
int buf[BUF_SIZE];
};
buffer buf;
void writer()
{
for (int n = 0; n < ITERS; ++n)
{
{
boost::mutex::scoped_lock
lock(io_mutex);
std::cout << "sending: "
<< n << std::endl;
}
buf.put(n);
}
}
void reader()
{
for (int x = 0; x < ITERS; ++x)
{
int n = buf.get();
{
boost::mutex::scoped_lock
lock(io_mutex);
std::cout << "received: "
<< n << std::endl;
}
}
}
int main(int argc, char* argv[])
{
boost::thread thrd1(&reader);
boost::thread thrd2(&writer);
thrd1.join();
thrd2.join();
return 0;
}
— End of Listing —
Listing 5: The boost::thread_specific_ptr class
#include ^boost/thread/thread.hpp^
#include ^boost/thread/mutex.hpp^
#include ^boost/thread/tss.hpp^
#include ^iostream^
boost::mutex io_mutex;
boost::thread_specific_ptr
struct count
{
count(int id) : id(id) { }
void operator()()
{
if (ptr.get() == 0)
ptr.reset(new int(0));
for (int i = 0; i < 10; ++i)
{
(*ptr)++;
boost::mutex::scoped_lock
lock(io_mutex);
std::cout << id << ": "
<< *ptr << std::endl;
}
}
int id;
};
int main(int argc, char* argv[])
{
boost::thread thrd1(count(1));
boost::thread thrd2(count(2));
thrd1.join();
thrd2.join();
return 0;
}
— End of Listing —
Listing 6: A very simple use of boost::call_once
#include ^boost/thread/thread.hpp^
#include ^boost/thread/once.hpp^
#include ^iostream^
int i = 0;
boost::once_flag flag =
BOOST_ONCE_INIT;
void init()
{
++i;
}
void thread()
{
boost::call_once(&init, flag);
}
int main(int argc, char* argv[])
{
boost::thread thrd1(&thread);
boost::thread thrd2(&thread);
thrd1.join();
thrd2.join();
std::cout << i << std::endl;
return 0;
}
— End of Listing —
NOTES
[1] The POSIX standard defines multithreaded support in what’s commonly known as the pthread library. This provides multithreaded support for a wide range of operating systems, including Win32 through the pthreads-win32 port. However, this is a C library that fails to address some C++ concepts and is not available on all platforms.
[2] Visit the Boost website at http://www.boost.org.
[3] See Bjorn Karlsson’s article, “Smart Pointers in Boost,” in C/C++ Users Journal, April 2002.
[4] Douglas Schmidt, Michael Stal, Hans Rohnert, and Frank Buschmann. Pattern-Oriented Software Architecture Volume 2 — Patterns for Concurrent and Networked Objects (Wiley, 2000).
William E. Kempf received his BS in CompSci/Math from Doane College. He’s been in the industry for 10 years and is currently a senior application developer for First Data Resources, Inc. He is the author of the Boost.Threads library, and an active Boost member. He can be contacted at wekempf@cox.net. Sphere: Related Content
6/6/10
Understanding Parallel Performance
By Herb Sutter, October 31, 2008
Understanding parallel performance. How do you know when good is good enough?
Herb is a bestselling author and consultant on software development topics, and a software architect at Microsoft. He can be contacted at www.gotw.ca.
--------------------------------------------------------------------------------
Let's say that we've slickly written our code to apply divide-and-conquer algorithms and concurrent data structures and parallel traversals and all our other cool tricks that make our code wonderfully scalable in theory. Question: How do we know how well we've actually succeeded? Do we really know, or did we just try a couple of tests on a quad-core that looked reasonable and call it good? What key factors must we measure to understand our code's performance, and answer not only whether our code scales, but quantify how well under different circumstances and workloads? What costs of concurrency do we have to take into account?
This month, I'll summarize some key issues we need to keep in mind to accurately analyze the real performance of our parallel code. I'll list some basic considerations, and then some common costs. Next month, I have a treat in store: We'll take some real code and apply these techniques to analyze its performance in detail as we successively apply a number of optimizations and measure how much each one actually buys us, under what conditions and in what directions, and why.
Fundamentals
To understand our code's scalability, we need to know what to measure and what to look for in the results. First, identify the workload variables: What are the key different kinds of work and/or data? For example, we may want to measure how well a producer-consumer system performs with varying numbers of producer threads and consumer threads, or measure a container by how well it scales when holding data items of different sizes.
Second, use stress tests to measure throughput, or the total amount of work accomplished per unit time, while varying each of these dimensions so that we can measure their relative impact. Look for scalability trends, the change in throughput as we add hardware resources: Can we effectively use more cores to either get the answer faster or to get more work done?
Figures 1 and 2 show two useful visualization tools that will help us understand our code's parallel performance. Figure 1 shows a sample scatter graph that charts throughput results for a selected algorithm against different numbers of two kinds of workers: producer threads and consumer threads. The larger the bubble, the greater the throughput. We can get a sense of how this particular algorithm scales, and in what directions, by examining how and where throughput grows and shrinks. In this example, we have good scalability up to a total of about 15 threads before we start to peak and realize no further gains, and we can see that scalability is better when there are somewhat more producers than consumers in the system.
Figure 1: Sample graph for measuring scalability of the same algorithm for different workloads.
Figure 2 directly compares different candidate algorithms running the same kind of workload. Peak throughput occurs, naturally enough, at the peak of each curve. Scalability shows up directly as the left-hand ascending side of the curve; the steeper it is, and the farther to the right that it goes before topping out, the more scalable our code will be. Here, the blue algorithm demonstrates the horror of negative scalability; it actually manages to get less work done using additional cores, which probably won't earn us our next raise.
Figure 2: Sample graph for measuring scalability of alternative algorithms for the same workload.
But both figures also let us see two basic penalties. Contention arises when different workers interfere with each other by fighting for resources, such as contending for mutexes, cache space, or cache lines via false sharing. In the most extreme case, adding a new worker might actually cost more in contention than it adds in extra work, resulting in less total work being done, and we can see this extreme effect in both graphs: In Figure 1, in several directions we reach areas where adding more workers makes total throughput actually go down. In Figure 2, we see the same effect in the form of the right-hand downslope where adding more work actually decreases throughput even when there are otherwise-idle cores. But Figure 2 also lets us more clearly see the effect of contention before it gets that far: On the left-hand upslope, even as throughout is still rising, the rate of increase is slowing down as the curve begins to bend. That's a classic effect of growing contention.
The other basic penalty is oversubscription, or having more CPU-bound work ready to execute than we have available hardware. These samples were taken from test runs on a 24-core machine; sure enough, in Figure 1 we see a faint diagonal line where #producers + #consumers = 24, above which throughput is noticeably thinner; sometimes the result is even more dramatic. Similarly, in Figure 2 we see even the best algorithms can't scale beyond the available cores, and incur a penalty for trying to exceed that number by adding contention at least for CPU time and often also for other resources.
With these fundamentals in mind, let's consider a few specific costs that arise and impact scalability because of contention, oversubscription, and other effects.
Sources of Overhead, and Threads versus Pools
We incur a basic concurrency overhead from just expressing code in a parallel way. Consider the following toy example that performs three independent subcomputations to generate a result:
// Example 1: Sequential code in MyApp 1.0
//
int NetSales() {
int wholesale = CalcWholesale();
int retail = CalcRetail();
int returns = TotalReturns();
return wholesale + retail - returns;
}
Assume that this code is entirely CPU-bound, and that the subcomputations really are independent and free of side effects on each other. Then we can get the answer faster on more cores by running the subparts in parallel. Here's some pseudocode showing how to accomplish this using futures and the convenience of lambda functions, but with a common mistake:
// Example 2 (flawed): Naïve parallel code for MyApp 2.0?
//
int NetSales() {
// perform the subcomputations concurrently
future wholesale = new thread( [] { CalcWholesale(); } );
future retail = new thread( [] { CalcRetail(); } );
future returns = new thread( [] { TotalReturns(); } );
// now block for the results
return wholesale.value() + retail.value() - returns.value();
}
Seeing the words new Thread explicitly in code is often an indicator that code may not be as scalable as it could be. In most environments, it's much less efficient to spin up a new thread for each piece of work than to run it on a thread pool: First, spinning up a new thread and throwing it away again each time incurs substantially more overhead than giving work to an existing pool thread. Second, spinning up a number of threads and turning them loose to fight for available cores via the operating system scheduler can cause needless contention when we spin up more work than there are cores currently available, which can happen not only on low-core hardware but also on many-core hardware if the application or the system happens to be doing a lot of other work at that moment. Both situations are different kinds of oversubscription, and some threads will have to incur extra context-switching to interleave on an available core. Instead, sharing the core by running one after the other would be both more efficient and more cache-friendly.
Thread pools address these problems because a pool is designed to "rightsize" itself to the amount of hardware concurrency available on the machine. The pool will automatically try to maintain exactly one ready thread per core, and if there is more work than cores the pool naturally queues the work up and lets the machine concentrate on only as many tasks at a time as it has cores to perform. Here's pseudocode for a revised version that uses pools instead:
// Example 3 (better): Partly fixed parallel code for MyApp 2.0
//
int NetSales() {
// perform the subcomputations concurrently
future wholesale = pool.run( [] { CalcWholesale(); } );
future retail = pool.run( [] { CalcRetail(); } );
future returns = pool.run( [] { TotalReturns(); } );
// now block for the results
return wholesale.value() + retail.value() - returns.value();
}
The good news is that this can enable us to get the answer faster on more cores, at least up to three cores. But there are costs too: With today's thread pools, we typically pay a tax of two context switches when we ship work over to a pool thread and then ship the result back. How can we reduce this cost?
Reducing Context Switches
We can eliminate some of the context switches by being smarter about the last line that combines the results, because just calling .value() three times may wake up the calling thread twice only to immediately have to sleep again; instead, use a "wait for a group of futures" facility if your futures library has one, as a well-written version can eliminate the needless wakeups.
We can also avoid a switch by observing that the calling thread isn't going to do anything anyway besides wait for the results, and so it's often a good idea to keep the tail chunk of work and do it ourselves instead of needlessly idling the original thread.
Example 4 shows both techniques in action:
// Example 4: Improved parallel code for MyApp 2.0
//
int NetSales() {
// perform the subcomputations concurrently
future wholesale = pool.run( [] { CalcWholesale(); } );
future retail = pool.run( [] { CalcRetail(); } );
int returns = TotalReturns(); // keep the tail work
// now block for the results—-wait once, not twice
wait_all( wholesale, retail );
return wholesale.value() + retail.value() - returns;
}
Naturally, what matters most is not the total overhead, but how big it is compared to the total work. We want the cost of each chunk of work to be significantly larger than the cost of performing it asynchronously instead of synchronously.
The Cost of Unrealized Concurrency
A key cost in today's environments is the cost of unrealized concurrency. What's the cost of our parallel code compared to the sequential code in the case when the parallel algorithm actually ends up running sequentially? For example, what happens if our parallel-ready code executes on a machine with just one core, so that we don't actually get to realize the concurrency because the tasks end up running sequentially anyway (e.g., there's only one pool thread)? We've added the overhead to express concurrency, and we pay for that overhead even if we don't get to benefit from it on a particular system.
If Example 1 is the code we shipped in MyApp version 1.0 and Example 4 is what we'll ship in MyApp 2.0, then an existing customer with a legacy single-core machine may find that the new application is actually slower than the old one, even though the new application will run better when more cores are available. To mitigate this on low- and single-core machines, we can reduce the overhead by adjusting granularity to use fewer and larger chunks of work, or even switch to a sequential implementation.
The Double-Edged Sword of Fine-Grainedness
Even in Example 4, the code only scales to at most three cores. Can we do better? Yes, if we can make the work more fine-grained, slicing the work to be done more finely into smaller chunks. One approach in Example 4 would be to apply similar techniques within CalcWholesale, CalcRetail, and TotalReturns to further decompose their work into concurrent subtasks. Even better is when we can exploit natural parallelism in algorithms (e.g., divide-and-conquer algorithms like quicksort) and data structures (e.g., trees and graphs) to subdivide our work in a way that scales with the amount of data.
But now we encounter a fundamental tension between scalability and overhead: The smaller the chunks of work, the more chunks we have and the more easily we can distribute them to utilize larger number of cores to get an answer faster—but the more the overhead per chunk starts to dominate in our performance costs.
Let's consider quicksort as a common example. Can you spot the performance flaws in this code?
// Example 5 (flawed, today): Naïve parallel quicksort,
// sorts left and right subranges in parallel
//
void ParallelQuicksort( Iterator first, Iterator last ) {
Iterator pivot = Partition( first, last );
future f1 = pool.run( [&]{ ParallelQuicksort( first, pivot ); } );
future f2 = pool.run( [&]{ ParallelQuicksort( pivot, last ); } );
wait_all( f1, f2 );
}
We can improve this in at least two ways. First, as noted earlier, we should take the tail chunk of work and do it ourselves to mitigate the overhead of shipping work to a pool thread in today's environments. Second, we can notice that most of the spun-off work will be at the leaves of the computation containing subranges of a few elements or even just one. Even in sequential code, it's typical to switch to something like bubble sort when the subrange's size falls below a threshold; similarly in parallel code, for small ranges we should switch to a sequential sort. Example 6 incorporates both of these changes:
// Example 6: An improved parallel quicksort, still
// sorts subranges in parallel but more efficiently
//
void ParallelQuicksort( Iterator first, Iterator last ) {
if( distance(first,last) <= threshold ) {
SequentialSort( first, last );
} else {
Iterator pivot = Partition( first, last );
future f1 = pool.run( [&]{ ParallelQuicksort( first, pivot ); } );
ParallelQuicksort( pivot, last );
f1.wait();
}
}
In general, we want to slice the work as finely as possible, but not to the point where the work is comparable in size to the overhead.
Forward-Looking Note: Work Stealing
Future runtime systems will significantly drive down all of these costs, including the overhead per chunk and the cost of unrealized concurrency, to the point where we will often be able to ignore it and blithely write code like Example 5 without worrying about its performance most of the time.
The basic idea is to use work stealing whereby default "potentially asynchronous work" is actually not shipped elsewhere, but rather queued up to be executed on the original thread. Only if another core runs out of work and sees that our thread has waiting queued work to be performed, will the work be "stolen" and efficiently shipped to the other thread. The idea is to drive down the cost of unrealized concurrency by only actually incurring the overhead of running on another core if it's worth doing at that particular instant—on this hardware and with this amount of other work currently occupying the machine; and the very next execution of the very same function on the very same hardware might not steal, if all the cores are already busy. Sample current and upcoming technologies that feature work stealing runtimes include: Intel Threading Building Blocks; Microsoft's Parallel Patterns Library, Task Parallel Library, and PLINQ; Java 7's Fork/Join framework; and the granddaddy of them all, Cilk, which popularized the technique among implementers.
On Deck
To understand your code's scalability, first identify the key variables that affect the workload, and then measure throughput for workloads with different combinations of those variables. Look for scalability, and how it hits the contention and oversubscription barriers. Prefer thread pools (today) and work stealing (tomorrow).
Next month, we're going to apply the tools we discussed this month to analyze the performance impact of specific optimizations on concrete code. Fasten your seat belts. Sphere: Related Content
Understanding parallel performance. How do you know when good is good enough?
Herb is a bestselling author and consultant on software development topics, and a software architect at Microsoft. He can be contacted at www.gotw.ca.
--------------------------------------------------------------------------------
Let's say that we've slickly written our code to apply divide-and-conquer algorithms and concurrent data structures and parallel traversals and all our other cool tricks that make our code wonderfully scalable in theory. Question: How do we know how well we've actually succeeded? Do we really know, or did we just try a couple of tests on a quad-core that looked reasonable and call it good? What key factors must we measure to understand our code's performance, and answer not only whether our code scales, but quantify how well under different circumstances and workloads? What costs of concurrency do we have to take into account?
This month, I'll summarize some key issues we need to keep in mind to accurately analyze the real performance of our parallel code. I'll list some basic considerations, and then some common costs. Next month, I have a treat in store: We'll take some real code and apply these techniques to analyze its performance in detail as we successively apply a number of optimizations and measure how much each one actually buys us, under what conditions and in what directions, and why.
Fundamentals
To understand our code's scalability, we need to know what to measure and what to look for in the results. First, identify the workload variables: What are the key different kinds of work and/or data? For example, we may want to measure how well a producer-consumer system performs with varying numbers of producer threads and consumer threads, or measure a container by how well it scales when holding data items of different sizes.
Second, use stress tests to measure throughput, or the total amount of work accomplished per unit time, while varying each of these dimensions so that we can measure their relative impact. Look for scalability trends, the change in throughput as we add hardware resources: Can we effectively use more cores to either get the answer faster or to get more work done?
Figures 1 and 2 show two useful visualization tools that will help us understand our code's parallel performance. Figure 1 shows a sample scatter graph that charts throughput results for a selected algorithm against different numbers of two kinds of workers: producer threads and consumer threads. The larger the bubble, the greater the throughput. We can get a sense of how this particular algorithm scales, and in what directions, by examining how and where throughput grows and shrinks. In this example, we have good scalability up to a total of about 15 threads before we start to peak and realize no further gains, and we can see that scalability is better when there are somewhat more producers than consumers in the system.
Figure 1: Sample graph for measuring scalability of the same algorithm for different workloads.
Figure 2 directly compares different candidate algorithms running the same kind of workload. Peak throughput occurs, naturally enough, at the peak of each curve. Scalability shows up directly as the left-hand ascending side of the curve; the steeper it is, and the farther to the right that it goes before topping out, the more scalable our code will be. Here, the blue algorithm demonstrates the horror of negative scalability; it actually manages to get less work done using additional cores, which probably won't earn us our next raise.
Figure 2: Sample graph for measuring scalability of alternative algorithms for the same workload.
But both figures also let us see two basic penalties. Contention arises when different workers interfere with each other by fighting for resources, such as contending for mutexes, cache space, or cache lines via false sharing. In the most extreme case, adding a new worker might actually cost more in contention than it adds in extra work, resulting in less total work being done, and we can see this extreme effect in both graphs: In Figure 1, in several directions we reach areas where adding more workers makes total throughput actually go down. In Figure 2, we see the same effect in the form of the right-hand downslope where adding more work actually decreases throughput even when there are otherwise-idle cores. But Figure 2 also lets us more clearly see the effect of contention before it gets that far: On the left-hand upslope, even as throughout is still rising, the rate of increase is slowing down as the curve begins to bend. That's a classic effect of growing contention.
The other basic penalty is oversubscription, or having more CPU-bound work ready to execute than we have available hardware. These samples were taken from test runs on a 24-core machine; sure enough, in Figure 1 we see a faint diagonal line where #producers + #consumers = 24, above which throughput is noticeably thinner; sometimes the result is even more dramatic. Similarly, in Figure 2 we see even the best algorithms can't scale beyond the available cores, and incur a penalty for trying to exceed that number by adding contention at least for CPU time and often also for other resources.
With these fundamentals in mind, let's consider a few specific costs that arise and impact scalability because of contention, oversubscription, and other effects.
Sources of Overhead, and Threads versus Pools
We incur a basic concurrency overhead from just expressing code in a parallel way. Consider the following toy example that performs three independent subcomputations to generate a result:
// Example 1: Sequential code in MyApp 1.0
//
int NetSales() {
int wholesale = CalcWholesale();
int retail = CalcRetail();
int returns = TotalReturns();
return wholesale + retail - returns;
}
Assume that this code is entirely CPU-bound, and that the subcomputations really are independent and free of side effects on each other. Then we can get the answer faster on more cores by running the subparts in parallel. Here's some pseudocode showing how to accomplish this using futures and the convenience of lambda functions, but with a common mistake:
// Example 2 (flawed): Naïve parallel code for MyApp 2.0?
//
int NetSales() {
// perform the subcomputations concurrently
future
future
future
// now block for the results
return wholesale.value() + retail.value() - returns.value();
}
Seeing the words new Thread explicitly in code is often an indicator that code may not be as scalable as it could be. In most environments, it's much less efficient to spin up a new thread for each piece of work than to run it on a thread pool: First, spinning up a new thread and throwing it away again each time incurs substantially more overhead than giving work to an existing pool thread. Second, spinning up a number of threads and turning them loose to fight for available cores via the operating system scheduler can cause needless contention when we spin up more work than there are cores currently available, which can happen not only on low-core hardware but also on many-core hardware if the application or the system happens to be doing a lot of other work at that moment. Both situations are different kinds of oversubscription, and some threads will have to incur extra context-switching to interleave on an available core. Instead, sharing the core by running one after the other would be both more efficient and more cache-friendly.
Thread pools address these problems because a pool is designed to "rightsize" itself to the amount of hardware concurrency available on the machine. The pool will automatically try to maintain exactly one ready thread per core, and if there is more work than cores the pool naturally queues the work up and lets the machine concentrate on only as many tasks at a time as it has cores to perform. Here's pseudocode for a revised version that uses pools instead:
// Example 3 (better): Partly fixed parallel code for MyApp 2.0
//
int NetSales() {
// perform the subcomputations concurrently
future
future
future
// now block for the results
return wholesale.value() + retail.value() - returns.value();
}
The good news is that this can enable us to get the answer faster on more cores, at least up to three cores. But there are costs too: With today's thread pools, we typically pay a tax of two context switches when we ship work over to a pool thread and then ship the result back. How can we reduce this cost?
Reducing Context Switches
We can eliminate some of the context switches by being smarter about the last line that combines the results, because just calling .value() three times may wake up the calling thread twice only to immediately have to sleep again; instead, use a "wait for a group of futures" facility if your futures library has one, as a well-written version can eliminate the needless wakeups.
We can also avoid a switch by observing that the calling thread isn't going to do anything anyway besides wait for the results, and so it's often a good idea to keep the tail chunk of work and do it ourselves instead of needlessly idling the original thread.
Example 4 shows both techniques in action:
// Example 4: Improved parallel code for MyApp 2.0
//
int NetSales() {
// perform the subcomputations concurrently
future
future
int returns = TotalReturns(); // keep the tail work
// now block for the results—-wait once, not twice
wait_all( wholesale, retail );
return wholesale.value() + retail.value() - returns;
}
Naturally, what matters most is not the total overhead, but how big it is compared to the total work. We want the cost of each chunk of work to be significantly larger than the cost of performing it asynchronously instead of synchronously.
The Cost of Unrealized Concurrency
A key cost in today's environments is the cost of unrealized concurrency. What's the cost of our parallel code compared to the sequential code in the case when the parallel algorithm actually ends up running sequentially? For example, what happens if our parallel-ready code executes on a machine with just one core, so that we don't actually get to realize the concurrency because the tasks end up running sequentially anyway (e.g., there's only one pool thread)? We've added the overhead to express concurrency, and we pay for that overhead even if we don't get to benefit from it on a particular system.
If Example 1 is the code we shipped in MyApp version 1.0 and Example 4 is what we'll ship in MyApp 2.0, then an existing customer with a legacy single-core machine may find that the new application is actually slower than the old one, even though the new application will run better when more cores are available. To mitigate this on low- and single-core machines, we can reduce the overhead by adjusting granularity to use fewer and larger chunks of work, or even switch to a sequential implementation.
The Double-Edged Sword of Fine-Grainedness
Even in Example 4, the code only scales to at most three cores. Can we do better? Yes, if we can make the work more fine-grained, slicing the work to be done more finely into smaller chunks. One approach in Example 4 would be to apply similar techniques within CalcWholesale, CalcRetail, and TotalReturns to further decompose their work into concurrent subtasks. Even better is when we can exploit natural parallelism in algorithms (e.g., divide-and-conquer algorithms like quicksort) and data structures (e.g., trees and graphs) to subdivide our work in a way that scales with the amount of data.
But now we encounter a fundamental tension between scalability and overhead: The smaller the chunks of work, the more chunks we have and the more easily we can distribute them to utilize larger number of cores to get an answer faster—but the more the overhead per chunk starts to dominate in our performance costs.
Let's consider quicksort as a common example. Can you spot the performance flaws in this code?
// Example 5 (flawed, today): Naïve parallel quicksort,
// sorts left and right subranges in parallel
//
void ParallelQuicksort( Iterator first, Iterator last ) {
Iterator pivot = Partition( first, last );
future f1 = pool.run( [&]{ ParallelQuicksort( first, pivot ); } );
future f2 = pool.run( [&]{ ParallelQuicksort( pivot, last ); } );
wait_all( f1, f2 );
}
We can improve this in at least two ways. First, as noted earlier, we should take the tail chunk of work and do it ourselves to mitigate the overhead of shipping work to a pool thread in today's environments. Second, we can notice that most of the spun-off work will be at the leaves of the computation containing subranges of a few elements or even just one. Even in sequential code, it's typical to switch to something like bubble sort when the subrange's size falls below a threshold; similarly in parallel code, for small ranges we should switch to a sequential sort. Example 6 incorporates both of these changes:
// Example 6: An improved parallel quicksort, still
// sorts subranges in parallel but more efficiently
//
void ParallelQuicksort( Iterator first, Iterator last ) {
if( distance(first,last) <= threshold ) {
SequentialSort( first, last );
} else {
Iterator pivot = Partition( first, last );
future f1 = pool.run( [&]{ ParallelQuicksort( first, pivot ); } );
ParallelQuicksort( pivot, last );
f1.wait();
}
}
In general, we want to slice the work as finely as possible, but not to the point where the work is comparable in size to the overhead.
Forward-Looking Note: Work Stealing
Future runtime systems will significantly drive down all of these costs, including the overhead per chunk and the cost of unrealized concurrency, to the point where we will often be able to ignore it and blithely write code like Example 5 without worrying about its performance most of the time.
The basic idea is to use work stealing whereby default "potentially asynchronous work" is actually not shipped elsewhere, but rather queued up to be executed on the original thread. Only if another core runs out of work and sees that our thread has waiting queued work to be performed, will the work be "stolen" and efficiently shipped to the other thread. The idea is to drive down the cost of unrealized concurrency by only actually incurring the overhead of running on another core if it's worth doing at that particular instant—on this hardware and with this amount of other work currently occupying the machine; and the very next execution of the very same function on the very same hardware might not steal, if all the cores are already busy. Sample current and upcoming technologies that feature work stealing runtimes include: Intel Threading Building Blocks; Microsoft's Parallel Patterns Library, Task Parallel Library, and PLINQ; Java 7's Fork/Join framework; and the granddaddy of them all, Cilk, which popularized the technique among implementers.
On Deck
To understand your code's scalability, first identify the key variables that affect the workload, and then measure throughput for workloads with different combinations of those variables. Look for scalability, and how it hits the contention and oversubscription barriers. Prefer thread pools (today) and work stealing (tomorrow).
Next month, we're going to apply the tools we discussed this month to analyze the performance impact of specific optimizations on concrete code. Fasten your seat belts. Sphere: Related Content
Fundamental Concepts of Parallel Programming
By Richard Gerber and Andrew Binstock , June 01, 2010
Parallel programming requires designers to rethink the idea of process flow
Richard Gerber has worked on numerous multimedia projects, 3D libraries, and computer games for Intel. Andrew Binstock is the principal analyst at Pacific Data Works and author of "Practical Algorithms for Programmers". They are the authors of Programming with Hyper-Threading Technology.
--------------------------------------------------------------------------------
Parallel programming makes use of threads to enable two or more operations to proceed in parallel. The entire concept of parallel programming centers on the design, development, and deployment of threads within an application and the coordination between threads and their respective operations. This article examines how to break up traditional programming tasks into chunks that are suitable for threading. It then demonstrates how to create threads, how these threads are typically run on multiprocessing systems, and how threads are run using HT Technology.
Designing for Threads
Developers unacquainted with parallel programming are generally comfortable with traditional programming models such as single-threaded declarative programs and object-oriented programming (OOP). In both cases, a program begins at a defined point -- such as main() -- and works through a series of tasks in succession. If the program relies on user interaction, the main processing instrument is a loop in which user events are handled. From each allowed event -- a button click, for example -- an established sequence of actions is performed that ultimately ends with a wait for the next user action.
When designing such programs, developers enjoy a relatively simple programming world because only one thing is happening at any given moment. If program tasks must be scheduled in a specific way, it's because the developer chooses a certain order to the activities, which themselves are designed to flow naturally into one another. At any point in the process, one step generally flows into the next, leading up to a predictable conclusion, based on predetermined parameters -- the job completed -- or user actions.
Moving from this model to parallel programming requires designers to rethink the idea of process flow. Now, they must try to identify which activities can be executed in parallel. To do so, they must begin to see their programs as a series of discrete tasks with specific dependencies between them. The process of breaking programs down into these individual tasks is known as decomposition. Decomposition comes in three flavors: functional, data, and a variant of functional decomposition, called producer/consumer. As you shall see shortly, these different forms of decomposition mirror different types of programming activities.
Functional Decomposition
Decomposing a program by the functions it performs is called "functional decomposition", also called "task-level parallelism". It is one of the most common ways to achieve parallel execution. Using this approach, individual tasks are catalogued. If two of them can run concurrently, they are scheduled to do so by the developer. Running tasks in parallel this way usually requires slight modifications to the individual functions to avoid conflicts and reflect that these tasks are no longer sequential.
If discussing gardening, functional decomposition would suggest that gardeners be assigned tasks based on the nature of the activity: If two gardeners arrived at a client's home, one might mow the lawn while the other weeded. Mowing and weeding are separate functions broken out as such. To accomplish them, the gardeners would make sure to have some coordination between them, so that the weeder is not sitting in the middle of a lawn that needs to be mowed.
In programming terms, a good example of functional decomposition is word processing software, such as Microsoft Word. When a very long document is opened, the user can begin entering text right away. While the user is doing this, document pagination occurs in the background, as can readily be seen by the quickly increasing page count that appears in the status bar. Text entry and pagination are two separate tasks that, in the case of Word, Microsoft has broken out by function and run in parallel. Had it not done this, the user would be obliged to wait for the entire document to be paginated before being able to enter any text. Many of you will recall that this wait was common on early PC word processors.
Producer/Consumer
Producer/consumer (P/C) is so common a form of functional decomposition that it is best examined by itself. Here, the output of one task, the producer, becomes the input to another, the consumer. The important aspects of P/C are that both tasks are performed by different threads and the second one -- the consumer -- cannot start until the producer finishes some portion of its work.
Using the gardening example, one gardener prepares the tools -- puts gas in the mower, cleans the shears, and other similar tasks -- for both gardeners to use. No gardening can occur until this step is mostly finished, at which point the true gardening work can begin. The delay caused by the first task creates a pause for the second task, after which both tasks can continue in parallel. In computer terms, this particular model occurs frequently.
In common programming tasks, P/C occurs in several typical scenarios. For example, programs that must rely on the reading of a file are inherently in a P/C scenario: the results of the file I/O become the input to the next step, which might well be threaded. However, that step cannot begin until the reading is either complete or has progressed sufficiently for other processing to kick off. Another common programming example of P/C is parsing: an input file must be parsed, or analyzed semantically, before the back-end activities, such as code generation in a compiler, can begin.
The P/C model has several interesting dimensions absent in the other decompositions:
•The dependence created between consumer and producer can cause formidable delays if this model is not implemented correctly. A performance-sensitive design seeks to understand the exact nature of the dependence and diminish the delay it imposes. It also aims to avoid situations in which consumer threads are idling while waiting for producer threads.
•In the ideal scenario, the hand-off between producer and consumer is completely clean, as in the example of the file parser. The output is context-independent and the consumer has no need to know anything about the producer. Many times, however, the producer and consumer components do not enjoy such a clean division of labor, and scheduling their interaction requires careful planning.
•If the consumer is finishing up while the producer is completely done, one thread remains idle while other threads are busy working away. This issue violates an important objective of parallel processing, which is to balance loads so that all available threads are kept busy. Because of the logical relationship between these threads, it can be very difficult to keep threads equally occupied in a P/C model.
Data Decomposition
Data decomposition, also known as "data-level parallelism", breaks down tasks by the data they work on, rather than by the nature of the task. Programs that are broken down via data decomposition generally have many threads performing the same work, just on different data items. For example, consider a program that is recalculating the values in a large spreadsheet. Rather than have one thread perform all the calculations, data decomposition would suggest having two threads, each performing half the calculations, or n threads performing 1/nth the work.
If the gardeners used the principle of data decomposition to divide their work, they would both mow half the property and then both weed half the flower beds. As in computing, determining which form of decomposition is more effective depends a lot on the constraints of the system. For example, if the area to mow is so small that it could not warrant two mowers, it would be better done by just one gardener -- functional decomposition -- and data decomposition could be applied to other task sequences, such as when the mowing is done and both gardeners begin weeding in parallel.
Implications of Different Decompositions
Different decompositions provide different benefits. If the goal, for example, is ease of programming and tasks can be neatly partitioned functionally, then functional decomposition is more often than not the winner. Data decomposition adds some additional code-level complexity to tasks, so it is reserved for cases where the data is easily divided and performance is important.
However, the most common reason for threading an application is performance. And here the choice of decompositions is more difficult. In many instances, the choice is dictated by the problem domain: some tasks are much better suited to one type of decomposition. But some tasks have no clear bias. Consider for example, processing images in a video stream. In formats with no dependency between frames, programmers have a choice of decompositions. Should they choose functional, in which one thread does decoding, another color balancing, and so on, or data decomposition, in which each thread does all the work on one frame and then moves on to the next? To return to the analogy of the gardeners, the query would take this form: If two gardeners need to mow two lawns and weed two flower beds, how should they proceed? Should one gardener only mow -- that is, they choose functional -- or should both gardeners mow together then weed together?
In some cases -- for instance when a resource constraint exists, such as only one mower -- the answer emerges quickly. In others where each gardener has a mower, the answer comes only through careful analysis of the constituent activities. In the case of the gardeners, functional decomposition looks better, because the start-up time for mowing is saved if only one mower is in use. Ultimately, the right answer for parallel programming is determined by careful planning and testing. The empirical approach plays a more significant role in design choices in parallel programming than it does in standard single-threaded programming.
As mentioned previously, P/C situations are often unavoidable, but nearly always detrimental to performance. The P/C relation runs counter to parallel programming because it inherently makes two activities sequential. This sequential aspect occurs twice in a P/C situation: once at the start, when the consumer thread is idling as it waits for the producer to produce some data, and again at the end, when the producer thread has completed its work and is idling as it waits for the consumer thread to complete. Hence, developers who recognize a P/C relation are wise to try to find another solution that can be parallelized. Unfortunately, in many cases, this cannot be done. Where P/C cannot be avoided, developers should work to minimize the delay caused by forcing the consumer to wait for the producer. One approach is to shorten the activity required before the consumer can start up. For example, if the producer must read a file into a buffer, can the consumer be launched after the first read operation rather than waiting for the entire file to be read? By this means, the latency caused by the producer can be diminished. Table 1 summarizes these forms of decomposition.
Challenges
The use of threads enables performance to be significantly improved by allowing two or more activities to occur simultaneously. However, developers cannot fail to recognize that threads add a measure of complexity that requires thoughtful consideration to navigate correctly. This complexity arises from the inherent fact that more than one activity is occurring in the program. Managing simultaneous activities and their possible interaction leads to confronting four types of problems:
•Synchronization is the process by which two or more threads coordinate their activities. For example, one thread waits for another to finish a task before continuing.
•Resource limitations refer to the inability of threads to work concurrently due to the constraints of a needed resource. For example, a hard drive can only read from one physical location at a time, which deprives threads of the ability to use the drive in parallel.
•Load balancing refers to the distribution of work across multiple threads so that they all perform roughly the same amount of work.
•Scalability is the challenge of making efficient use of a larger number of threads when software is run on more-capable systems. For example, if a program is written to make good use of four processors, will it scale properly when run on a system with eight processors?
Each of these issues needs to be handled carefully to maximize application performance.
--------------------------------------------------------------------------------
This article is based on material found in book Programming with Hyper-Threading Technology by Andrew Binstock and Richard Gerber. Sphere: Related Content
Parallel programming requires designers to rethink the idea of process flow
Richard Gerber has worked on numerous multimedia projects, 3D libraries, and computer games for Intel. Andrew Binstock is the principal analyst at Pacific Data Works and author of "Practical Algorithms for Programmers". They are the authors of Programming with Hyper-Threading Technology.
--------------------------------------------------------------------------------
Parallel programming makes use of threads to enable two or more operations to proceed in parallel. The entire concept of parallel programming centers on the design, development, and deployment of threads within an application and the coordination between threads and their respective operations. This article examines how to break up traditional programming tasks into chunks that are suitable for threading. It then demonstrates how to create threads, how these threads are typically run on multiprocessing systems, and how threads are run using HT Technology.
Designing for Threads
Developers unacquainted with parallel programming are generally comfortable with traditional programming models such as single-threaded declarative programs and object-oriented programming (OOP). In both cases, a program begins at a defined point -- such as main() -- and works through a series of tasks in succession. If the program relies on user interaction, the main processing instrument is a loop in which user events are handled. From each allowed event -- a button click, for example -- an established sequence of actions is performed that ultimately ends with a wait for the next user action.
When designing such programs, developers enjoy a relatively simple programming world because only one thing is happening at any given moment. If program tasks must be scheduled in a specific way, it's because the developer chooses a certain order to the activities, which themselves are designed to flow naturally into one another. At any point in the process, one step generally flows into the next, leading up to a predictable conclusion, based on predetermined parameters -- the job completed -- or user actions.
Moving from this model to parallel programming requires designers to rethink the idea of process flow. Now, they must try to identify which activities can be executed in parallel. To do so, they must begin to see their programs as a series of discrete tasks with specific dependencies between them. The process of breaking programs down into these individual tasks is known as decomposition. Decomposition comes in three flavors: functional, data, and a variant of functional decomposition, called producer/consumer. As you shall see shortly, these different forms of decomposition mirror different types of programming activities.
Functional Decomposition
Decomposing a program by the functions it performs is called "functional decomposition", also called "task-level parallelism". It is one of the most common ways to achieve parallel execution. Using this approach, individual tasks are catalogued. If two of them can run concurrently, they are scheduled to do so by the developer. Running tasks in parallel this way usually requires slight modifications to the individual functions to avoid conflicts and reflect that these tasks are no longer sequential.
If discussing gardening, functional decomposition would suggest that gardeners be assigned tasks based on the nature of the activity: If two gardeners arrived at a client's home, one might mow the lawn while the other weeded. Mowing and weeding are separate functions broken out as such. To accomplish them, the gardeners would make sure to have some coordination between them, so that the weeder is not sitting in the middle of a lawn that needs to be mowed.
In programming terms, a good example of functional decomposition is word processing software, such as Microsoft Word. When a very long document is opened, the user can begin entering text right away. While the user is doing this, document pagination occurs in the background, as can readily be seen by the quickly increasing page count that appears in the status bar. Text entry and pagination are two separate tasks that, in the case of Word, Microsoft has broken out by function and run in parallel. Had it not done this, the user would be obliged to wait for the entire document to be paginated before being able to enter any text. Many of you will recall that this wait was common on early PC word processors.
Producer/Consumer
Producer/consumer (P/C) is so common a form of functional decomposition that it is best examined by itself. Here, the output of one task, the producer, becomes the input to another, the consumer. The important aspects of P/C are that both tasks are performed by different threads and the second one -- the consumer -- cannot start until the producer finishes some portion of its work.
Using the gardening example, one gardener prepares the tools -- puts gas in the mower, cleans the shears, and other similar tasks -- for both gardeners to use. No gardening can occur until this step is mostly finished, at which point the true gardening work can begin. The delay caused by the first task creates a pause for the second task, after which both tasks can continue in parallel. In computer terms, this particular model occurs frequently.
In common programming tasks, P/C occurs in several typical scenarios. For example, programs that must rely on the reading of a file are inherently in a P/C scenario: the results of the file I/O become the input to the next step, which might well be threaded. However, that step cannot begin until the reading is either complete or has progressed sufficiently for other processing to kick off. Another common programming example of P/C is parsing: an input file must be parsed, or analyzed semantically, before the back-end activities, such as code generation in a compiler, can begin.
The P/C model has several interesting dimensions absent in the other decompositions:
•The dependence created between consumer and producer can cause formidable delays if this model is not implemented correctly. A performance-sensitive design seeks to understand the exact nature of the dependence and diminish the delay it imposes. It also aims to avoid situations in which consumer threads are idling while waiting for producer threads.
•In the ideal scenario, the hand-off between producer and consumer is completely clean, as in the example of the file parser. The output is context-independent and the consumer has no need to know anything about the producer. Many times, however, the producer and consumer components do not enjoy such a clean division of labor, and scheduling their interaction requires careful planning.
•If the consumer is finishing up while the producer is completely done, one thread remains idle while other threads are busy working away. This issue violates an important objective of parallel processing, which is to balance loads so that all available threads are kept busy. Because of the logical relationship between these threads, it can be very difficult to keep threads equally occupied in a P/C model.
Data Decomposition
Data decomposition, also known as "data-level parallelism", breaks down tasks by the data they work on, rather than by the nature of the task. Programs that are broken down via data decomposition generally have many threads performing the same work, just on different data items. For example, consider a program that is recalculating the values in a large spreadsheet. Rather than have one thread perform all the calculations, data decomposition would suggest having two threads, each performing half the calculations, or n threads performing 1/nth the work.
If the gardeners used the principle of data decomposition to divide their work, they would both mow half the property and then both weed half the flower beds. As in computing, determining which form of decomposition is more effective depends a lot on the constraints of the system. For example, if the area to mow is so small that it could not warrant two mowers, it would be better done by just one gardener -- functional decomposition -- and data decomposition could be applied to other task sequences, such as when the mowing is done and both gardeners begin weeding in parallel.
Implications of Different Decompositions
Different decompositions provide different benefits. If the goal, for example, is ease of programming and tasks can be neatly partitioned functionally, then functional decomposition is more often than not the winner. Data decomposition adds some additional code-level complexity to tasks, so it is reserved for cases where the data is easily divided and performance is important.
However, the most common reason for threading an application is performance. And here the choice of decompositions is more difficult. In many instances, the choice is dictated by the problem domain: some tasks are much better suited to one type of decomposition. But some tasks have no clear bias. Consider for example, processing images in a video stream. In formats with no dependency between frames, programmers have a choice of decompositions. Should they choose functional, in which one thread does decoding, another color balancing, and so on, or data decomposition, in which each thread does all the work on one frame and then moves on to the next? To return to the analogy of the gardeners, the query would take this form: If two gardeners need to mow two lawns and weed two flower beds, how should they proceed? Should one gardener only mow -- that is, they choose functional -- or should both gardeners mow together then weed together?
In some cases -- for instance when a resource constraint exists, such as only one mower -- the answer emerges quickly. In others where each gardener has a mower, the answer comes only through careful analysis of the constituent activities. In the case of the gardeners, functional decomposition looks better, because the start-up time for mowing is saved if only one mower is in use. Ultimately, the right answer for parallel programming is determined by careful planning and testing. The empirical approach plays a more significant role in design choices in parallel programming than it does in standard single-threaded programming.
As mentioned previously, P/C situations are often unavoidable, but nearly always detrimental to performance. The P/C relation runs counter to parallel programming because it inherently makes two activities sequential. This sequential aspect occurs twice in a P/C situation: once at the start, when the consumer thread is idling as it waits for the producer to produce some data, and again at the end, when the producer thread has completed its work and is idling as it waits for the consumer thread to complete. Hence, developers who recognize a P/C relation are wise to try to find another solution that can be parallelized. Unfortunately, in many cases, this cannot be done. Where P/C cannot be avoided, developers should work to minimize the delay caused by forcing the consumer to wait for the producer. One approach is to shorten the activity required before the consumer can start up. For example, if the producer must read a file into a buffer, can the consumer be launched after the first read operation rather than waiting for the entire file to be read? By this means, the latency caused by the producer can be diminished. Table 1 summarizes these forms of decomposition.
Challenges
The use of threads enables performance to be significantly improved by allowing two or more activities to occur simultaneously. However, developers cannot fail to recognize that threads add a measure of complexity that requires thoughtful consideration to navigate correctly. This complexity arises from the inherent fact that more than one activity is occurring in the program. Managing simultaneous activities and their possible interaction leads to confronting four types of problems:
•Synchronization is the process by which two or more threads coordinate their activities. For example, one thread waits for another to finish a task before continuing.
•Resource limitations refer to the inability of threads to work concurrently due to the constraints of a needed resource. For example, a hard drive can only read from one physical location at a time, which deprives threads of the ability to use the drive in parallel.
•Load balancing refers to the distribution of work across multiple threads so that they all perform roughly the same amount of work.
•Scalability is the challenge of making efficient use of a larger number of threads when software is run on more-capable systems. For example, if a program is written to make good use of four processors, will it scale properly when run on a system with eight processors?
Each of these issues needs to be handled carefully to maximize application performance.
--------------------------------------------------------------------------------
This article is based on material found in book Programming with Hyper-Threading Technology by Andrew Binstock and Richard Gerber. Sphere: Related Content
Other Voices: An HTML5 Primer
By Michael Mullany, June 03, 2010
It's easy to get lost in the welter of HTML5-related standards
With Google and Apple strongly supporting HTML5 as the solution for rich applications for the Internet, it's become the buzzword of the month -- particularly after Google I/O. Given its hot currency, though, it's not surprising that the term is starting to become unhinged from reality. Already, we're starting to see job postings requiring "HTML5 experience," and people pointing to everything from simple JavaScript animations to CSS3 effects as examples of HTML5. Just as "AJAX" and "Web 2.0" became handy (and widely misused) shorthand for "next-generation" web development in the mid-2000's, HTML5 is now becoming the next overloaded term. And although there are many excellent resources out there describing details of HTML5, including the core specification itself, they are generally technical and many of them are now out of synch with the current state of the specs. So, I thought a primer on HTML5 might be in order.
HTML5 Core vs. The HTML5 Family
When many folks say "HTML5" (particularly when this is followed with "will replace Flash"), they mean (or at least I think they mean), the broad collection of next-generation technologies that are now being implemented in the Webkit-based browsers (Safari and Chrome), Opera and Firefox. Some of these (like CS S3) were never part of the HTML5 standards process, and some of these (like web workers) were originally part of the spec but were spun out separately. We think the right way to refer to this collection is "the HTML5 Family." The family members of HTML5 (like all families) are in very different stages of maturity and implementation. Some are fully implemented in latest revision browsers, some may never see the light of day, and some will become altered beyond recognition before they show up in the mainstream. As mentioned before, the core W3C HTML5 spec is just one part of the collection of related technologies. I list the following specs as members of the HTML5 Family (more or less):
•The HTML5 spec
•Cascading Style Sheets Version 3 (CSS3)
•Web Workers
•Web Storage
•Web SQL Database
•Web Sockets
•Geolocation
•Microdata
•Device API and File API
The Core HTML5 Spec
The central thrust of the core HTML5 spec is to evolve HTML from the XML-centric approach of the early 2000's that had poor traction among browser makers and developers. HTML5 substantially changes many aspects of the language, although most changes have not resulted in new features visible to most end-users. These "user-invisible" changes include a new content model, accessibility features and browsing contexts. In many cases, HTML5 allows what is currently done with styling, JavaScript or server workarounds to be done in HTML. This results in cleaner, human-readable HTML. Today's blizzard of div tags is replaced with meaningful markup like nav and aside. For example, HTML5 adds semantic tags for common content elements: One specific example is a special form field for email addresses. Another specific example is new markup for menus and navigation sections. For forms, HTML5 adds support for PUT and DELETE form actions, which will simplify server side processing. It also provides native support for adding form elements dynamically, which currently has to be done in JavaScript.
For users, the highest impact change in HTML5 is the addition of audio and video tags and a standard 2D bitmap drawing format (canvas). HTML5 audio and video tags allow playback without the use of plugins, and Canvas allows rich 2D bitmapped graphics.
There are many other features in the HTML5 spec, including a drag-and-drop API, cross-document messaging, persistent content caching directives, and user-editable content. Support for them is still being added to the latest browser revisions. Some parts may still end up being discarded before final implementation.
Finally, HTML5 removes many presentational markup elements that littered earlier HTML specs, like center and font. It also disallows direct table styling, and instead, requires the use of CSS. Frames are also officially eliminated.
CSS3
A lot of what people think is HTML5, is actually CSS3, which is itself a collection of sub-specifications. These are in various states of completion and browser implementation. For example, CSS Animations and CSS Transitions are sub-specs that provide rich dynamic 2D animations and effects for elements. CSS 3D and 2D Transformations provide animations for boxed content. The CSS3 spec family also includes standards for richer layout control, borders and backgrounds (the highly desired "rounded corners" ). It also includes more niche capabilities such as Ruby (not the language, Ruby with a small "r" means visual hints for meaning or pronunciation often used in ideogram based languages), aural style sheets and scrolling marquees.
Web Workers
Web Workers let an application spawn tasks for the browser to work on in the background without freezing the execution of the main application. There are a few types of workers that can be created with slightly different behavior. The intent of web workers is to give application developers the ability to specify what tasks within the application are parallelizeable (in the small), so that the browser can better schedule work for the rapidly increasing core count of today's (and tomorrow's) multi-core processors.
Web Storage and Web SQL
Web Storage is one of the more exciting parts of the HTML5 Family. Web Storage allows a page to store string data in a key-value pair database, specific to that domain. There are two varieties of Web Storage, the first is sessionStorage, that persists data only for a single session (think of it as a more functional cookie storage mechanism). The second is localStorage which allows a domain to store data locally across browser sessions (and system reboots). When you add localStorage to the cache manifest from the main HTML5 spec, you have the ability to run an offline application. The Web Storage spec is itself separate from the Web SQL Database spec which provides for a full SQL-addressable database that is accessible offline. Although varieties of this spec are in implementation by browser makers, the standardization process is blocking on the need for a second interoperable implementation that is not based on SQLite (which all the current versions are.)
Web Sockets
The Web Sockets protocol is in the first stage of the standards process and has also been submitted as an IETF draft because it is a networking protocol. It defines a non-http-based asynchronous client/server protocol that can be used in place of the current AJAX methods for asynchronous server communication. It uses an initial http: request to bootstrap the new protocol.
And all the others…
Geolocation is a simple spec that provides a built-in a geolocation object that scripts can query. It also provides methods for defining location cache freshness requirements. This is fairly non-controversial and already in new browsers. File API allows single and multiple file uploads from the user desktop. It's unclear exactly who will support this, but there doesn't seem to be much confusion about what it's supposed to do. Microdata is a mechanism to allow communities of interest to mark up content with semantic tags (for example, tags that identify an address or a resume.) It doesn't specify what these semantic tags are, just how they should be implemented. Device APIs that allow web browser access to devices such as cameras, BlueTooth etc, are still an early work in progress. These hope to define standardized access to native hardware and sensitive data from web applications. Highest priority are a camera API, and APIs for contact list, SMS history etc. on mobile devices. From Google I/O it appears that Google is going to ship something sooner rather than later that allows camera access from a Chrome web application, but there have been no further details on this.
HTML5 Summed Up
It's easy to get lost in the welter of standards enumerated above. But stepping back you should get the sense that the HTML5 Family authors are on a mission to make web applications as powerful as native applications when it comes to user interface richness, offline capability and hardware access. Since HTML5 family apps will be deployed on the web, they'll have the added benefits that the web has always brought, which are:
• A universal client that works cross platform
• Easy searchability and indexing (including deep linking)
• The ability to trivially include third party services and mashups
• Zero hassle deployment and updating (after all, it's just on the web)
We're excited by our initial HTML5-based development, and we eagerly await these new features as they are implemented and stabilized in the latest browsers.
Recomended links about HTML5 Book: Deploying HTML5 Sphere: Related Content
It's easy to get lost in the welter of HTML5-related standards
With Google and Apple strongly supporting HTML5 as the solution for rich applications for the Internet, it's become the buzzword of the month -- particularly after Google I/O. Given its hot currency, though, it's not surprising that the term is starting to become unhinged from reality. Already, we're starting to see job postings requiring "HTML5 experience," and people pointing to everything from simple JavaScript animations to CSS3 effects as examples of HTML5. Just as "AJAX" and "Web 2.0" became handy (and widely misused) shorthand for "next-generation" web development in the mid-2000's, HTML5 is now becoming the next overloaded term. And although there are many excellent resources out there describing details of HTML5, including the core specification itself, they are generally technical and many of them are now out of synch with the current state of the specs. So, I thought a primer on HTML5 might be in order.
HTML5 Core vs. The HTML5 Family
When many folks say "HTML5" (particularly when this is followed with "will replace Flash"), they mean (or at least I think they mean), the broad collection of next-generation technologies that are now being implemented in the Webkit-based browsers (Safari and Chrome), Opera and Firefox. Some of these (like CS S3) were never part of the HTML5 standards process, and some of these (like web workers) were originally part of the spec but were spun out separately. We think the right way to refer to this collection is "the HTML5 Family." The family members of HTML5 (like all families) are in very different stages of maturity and implementation. Some are fully implemented in latest revision browsers, some may never see the light of day, and some will become altered beyond recognition before they show up in the mainstream. As mentioned before, the core W3C HTML5 spec is just one part of the collection of related technologies. I list the following specs as members of the HTML5 Family (more or less):
•The HTML5 spec
•Cascading Style Sheets Version 3 (CSS3)
•Web Workers
•Web Storage
•Web SQL Database
•Web Sockets
•Geolocation
•Microdata
•Device API and File API
The Core HTML5 Spec
The central thrust of the core HTML5 spec is to evolve HTML from the XML-centric approach of the early 2000's that had poor traction among browser makers and developers. HTML5 substantially changes many aspects of the language, although most changes have not resulted in new features visible to most end-users. These "user-invisible" changes include a new content model, accessibility features and browsing contexts. In many cases, HTML5 allows what is currently done with styling, JavaScript or server workarounds to be done in HTML. This results in cleaner, human-readable HTML. Today's blizzard of div tags is replaced with meaningful markup like nav and aside. For example, HTML5 adds semantic tags for common content elements: One specific example is a special form field for email addresses. Another specific example is new markup for menus and navigation sections. For forms, HTML5 adds support for PUT and DELETE form actions, which will simplify server side processing. It also provides native support for adding form elements dynamically, which currently has to be done in JavaScript.
For users, the highest impact change in HTML5 is the addition of audio and video tags and a standard 2D bitmap drawing format (canvas). HTML5 audio and video tags allow playback without the use of plugins, and Canvas allows rich 2D bitmapped graphics.
There are many other features in the HTML5 spec, including a drag-and-drop API, cross-document messaging, persistent content caching directives, and user-editable content. Support for them is still being added to the latest browser revisions. Some parts may still end up being discarded before final implementation.
Finally, HTML5 removes many presentational markup elements that littered earlier HTML specs, like center and font. It also disallows direct table styling, and instead, requires the use of CSS. Frames are also officially eliminated.
CSS3
A lot of what people think is HTML5, is actually CSS3, which is itself a collection of sub-specifications. These are in various states of completion and browser implementation. For example, CSS Animations and CSS Transitions are sub-specs that provide rich dynamic 2D animations and effects for elements. CSS 3D and 2D Transformations provide animations for boxed content. The CSS3 spec family also includes standards for richer layout control, borders and backgrounds (the highly desired "rounded corners" ). It also includes more niche capabilities such as Ruby (not the language, Ruby with a small "r" means visual hints for meaning or pronunciation often used in ideogram based languages), aural style sheets and scrolling marquees.
Web Workers
Web Workers let an application spawn tasks for the browser to work on in the background without freezing the execution of the main application. There are a few types of workers that can be created with slightly different behavior. The intent of web workers is to give application developers the ability to specify what tasks within the application are parallelizeable (in the small), so that the browser can better schedule work for the rapidly increasing core count of today's (and tomorrow's) multi-core processors.
Web Storage and Web SQL
Web Storage is one of the more exciting parts of the HTML5 Family. Web Storage allows a page to store string data in a key-value pair database, specific to that domain. There are two varieties of Web Storage, the first is sessionStorage, that persists data only for a single session (think of it as a more functional cookie storage mechanism). The second is localStorage which allows a domain to store data locally across browser sessions (and system reboots). When you add localStorage to the cache manifest from the main HTML5 spec, you have the ability to run an offline application. The Web Storage spec is itself separate from the Web SQL Database spec which provides for a full SQL-addressable database that is accessible offline. Although varieties of this spec are in implementation by browser makers, the standardization process is blocking on the need for a second interoperable implementation that is not based on SQLite (which all the current versions are.)
Web Sockets
The Web Sockets protocol is in the first stage of the standards process and has also been submitted as an IETF draft because it is a networking protocol. It defines a non-http-based asynchronous client/server protocol that can be used in place of the current AJAX methods for asynchronous server communication. It uses an initial http: request to bootstrap the new protocol.
And all the others…
Geolocation is a simple spec that provides a built-in a geolocation object that scripts can query. It also provides methods for defining location cache freshness requirements. This is fairly non-controversial and already in new browsers. File API allows single and multiple file uploads from the user desktop. It's unclear exactly who will support this, but there doesn't seem to be much confusion about what it's supposed to do. Microdata is a mechanism to allow communities of interest to mark up content with semantic tags (for example, tags that identify an address or a resume.) It doesn't specify what these semantic tags are, just how they should be implemented. Device APIs that allow web browser access to devices such as cameras, BlueTooth etc, are still an early work in progress. These hope to define standardized access to native hardware and sensitive data from web applications. Highest priority are a camera API, and APIs for contact list, SMS history etc. on mobile devices. From Google I/O it appears that Google is going to ship something sooner rather than later that allows camera access from a Chrome web application, but there have been no further details on this.
HTML5 Summed Up
It's easy to get lost in the welter of standards enumerated above. But stepping back you should get the sense that the HTML5 Family authors are on a mission to make web applications as powerful as native applications when it comes to user interface richness, offline capability and hardware access. Since HTML5 family apps will be deployed on the web, they'll have the added benefits that the web has always brought, which are:
• A universal client that works cross platform
• Easy searchability and indexing (including deep linking)
• The ability to trivially include third party services and mashups
• Zero hassle deployment and updating (after all, it's just on the web)
We're excited by our initial HTML5-based development, and we eagerly await these new features as they are implemented and stabilized in the latest browsers.
Recomended links about HTML5 Book: Deploying HTML5 Sphere: Related Content
Suscribirse a:
Entradas (Atom)