January 23, 2018

MORE is NOT always BETTER

Parallel machines consisting of several dozen cores became a common tool in solving various engineering design problems. Being a powerful integration platform for automation, data analysis and optimization, pSeven provides a set of possibilities (on both internal algorithmic core level and external GUI level) to utilize the performance of your machines to full extent. We refer to the Blog articles HPC Made Easy, Parallelization Made Easy and Parallelization in pSeven: Optimizer Batch Mode, published on our website, that provide most essential information how properly configure parallelization capabilities of pSeven.

It should be stressed that parallelization performed naively might cause significant overall performance degradation, especially when required parallelization degree is high. Many conditions should be taken into account to get the best performance from the parallelized application, primarily, the features of the problem to be solved and particular technical solutions used to parallelize it. In this respect, hyper-threading availability and details of the operating system's scheduler implementation are of crucial importance. The prime goal of this Tech Tip is to provide you with a few recommendations to make the use of pSeven parallelization capabilities as effective as possible.

The whole set of experiments below are run on an Intel Xeon 5670, 2.93 GHz, 2 CPU (6 cores each), total 12 physical cores under Windows Server 2012 R2 unless another is specified.

Test 1

As an example, we consider a problem of building an approximation model (surrogate model) with 18 inputs and 272 responses on user-provided training dataset consisting of 1000 sampled designs. The following workflow was created to solve the problem.

We parallelize the process of building the model in algorithmic level by setting up the GTApprox/MaxParallel option. This option sets the number of threads to use when building a model. Note that this option regulates the number of allowed OpenMP threads.

The results are summarized in the following table:

#	Hyper-threading	MaxParallel	Duration, min
1	ON (24 logical cores)	24	219
2	ON (24 logical cores)	12	209 (↓5%)
3	OFF (12 physical cores)	12	176 (↓20%)

In the 1st experiment with hyper-threading enabled performance significantly deteriorates, even though both physical CPU (2*6 logical cores each) remain fully loaded. To explain this, we note a few aspects of the performed test. First, in this case pSeven utilizes low-level algorithmic parallelization using OpenMP library, which implies that all threads run in parallel using shared memory model. Second, because of hyper-threading, each physical core is being represented to the operating system as two logical ones. Hence physical core resources are shared between two logical units. Therefore, computationally intensive applications with floating point operations run in logical cores and have fewer chances to improve their performance. In other words, hyper-threading is useful when the first logical core performs complex floating-point operations, while the second one runs less computationally intensive applications.

NOTE: To monitor CPU performance in Windows OS you have to run Task Manager and click on the Performance tab. If you want to watch over per-core performance, click on Open Resource Monitor link placed at the bottom of Performance window, and press on CPU tab.

The 2nd experiment is run in 12 parallel threads with hyper-threading enabled. We might expect that the OS’s scheduler allocates these 12 threads to 2 CPU (12 physical cores). However, in this case only a single CPU (12 logical cores) is fully loaded, remaining half of available resources remain idle. According to the article on the official Microsoft website, for better performance, the scheduler takes physical locality into account [1] and this explains observed loading pattern. It is amusing to note, however, that in this case performance increases slightly (5%) despite the fact threads are still executed on a single CPU.

The 3rd experiment is run in 12 parallel threads with hyper-threading totally disabled. It is obvious that these settings provide maximal gain in performance (up to 20%). This is because the number of computationally intensive tasks run in parallel is now in full accord with available resources. Thus, turning hyper-threading off significantly affects performance for this problem.

An evident conclusion is that the number of parallel threads even with hyper-threading disabled should not exceed the number of physical cores. Otherwise, all threads will compete for resources and as a result will increase the total time for execution.

Test 2

Here the same problem of building a single approximation model with 18 inputs and 272 responses on the same data samples as in Test 1 is considered. The difference is that we apply another approach for solving the task and incorporate additional information. The specific feature of this particular problem is that all 272 responses are independent and have similar complexity. It is reasonable then to build all these 272 approximation models in parallel and then combine them in a single model with 272 outputs. The following workflow is created to solve the problem.

The Composite block contains the only ApproxBuilder block, which builds a single model with 18 inputs and 1 response.

The Composite block, which wraps an instance of ApproxBuilder, creates several parallel processes (Max number of parallel instances) for its nested ApproxBuilder block and automatically distributes data samples between them. Each ApproxBuilder block receives the same data samples for inputs and a unique data sample for the corresponding response.

The results of experiments are summarized in the following table:

#	Hyper-threading	MaxParallel	Duration, min
1	ON (24 logical cores)	24	45
2	ON (24 logical cores)	12	72 (↑60%)
3	OFF (12 physical cores)	12	58 (↑30%)

The results of the 1st experiment with 24 parallel processes in a single thread each demonstrate a maximal gain in overall execution time. In this case, 2 CPU (24 logical cores) remain fully loaded (this is essential and we elaborate on this below). Therefore, enabling hyper-threading appears to make sense for this particular application. How might this happen despite of formally the same number of working processes as in the previous example(Test 1)? The ultimate reason is that parallelization is now performed at the higher level of the algorithmic hierarchy, at the level of each (independent) model building. In this case, the most intensive operations in different threads are likely to be separated in time, thus allowing even loads distribution between available physical cores. To the contrary in previous example threads were tied up at the lowest possible algorithmic level and hence constantly compete for computational resources.

In the 2nd experiment, only 1 CPU (12 logical cores) is loaded due to special features of Windows scheduler mentioned above [1]. Although the total execution time is significantly (60%) diminished compared to Test 1, it is instructive to confront this experiment to the first one. Formally, the load of one physical CPU unit remains the same and one could expect that 2nd experiment should last approximately 2 * 45 = 90 minutes (or expected execution time for the 1st experiment is to be 72 / 2 = 36 minutes). However, this is not the case and for a rather simple reason, in fact: 1st experiment loads all the available resources leaving no room for other (and inevitable) system activities. Therefore, its time performance (45 minutes) is slightly worse than one could expect naively (36 minutes).

The 3d experiment is essentially the same as the 2nd one but is performed with hyper-threading disabled. Not surprisingly its time performance appears to be better at the price of constant total machine loading.

Thus, hyper-threading provides better performance for several processes run in parallel, rather than a single process in several threads as in Test 1. The approach implemented in Test 2 significantly decreases total duration of the model construction in comparison with Test 1, by up to 80%. Therefore, the generic rule for specifying Max number of parallel instances and MaxParallel options might be stated as follows:

(Max number of parallel instances) * (MaxParallel) <= (the number of physical / logical cores).

Finally, let us stress that some versions of Windows OS run a single application only on a single CPU regardless of the values of Max number of parallel instances and MaxParallel options, provided that the number of available cores is greater than 64. This unfortunate fact is being discussed on official Microsoft website [1].

We faced with this problem on Intel machine with 2 CPU (22 cores each), total 44 physical cores, total 88 logical cores with hyper-threading enabled. The only remedy to force Windows OS scheduler to locate processes on the second CPU and to use the full performance of this machine was to totally disable hyper-threading.

As a result, we obtained performance gain by 25 times: duration of model construction decreased from ~5000 seconds (the approach described in Test 1 with hyper-threading enabled) to ~200 seconds (the approach described in Test 2 with hyper-threading disabled).

Many software providers recommend to turn-off hyper-threading, especially for computationally intensive applications, for the reasons discussed above. For example, this recipe is widely suggested for ANSYS [2].

Resources

Microsoft. Processor Groups. URL: https://msdn.microsoft.com/ru-ru/library/windows/desktop/dd405503
SimuTech Group. SimuTech Support. ANSYS Hardware Information. URL: https://www.simutechgroup.com/support/ansys-resources/ansys-hardware-support

MORE is NOT always BETTER

Test 1

Test 2

Resources

Interested in the solution?