Balancing act: optimise for scaling or efficiency?
When we parallelise and optimise computational simulation codes we always have choices to make. Choices about the type of parallel model to use (distributed memory, shared memory, PGAS, single sided, etc), whether the algorithm used needs to be changed, what parallel functionality to use (loop parallelisation, blocking or non-blocking communications, collective or point-to-point messages, etc).
From a user's perspective, for the scientist doing their research, I imagine all they really care about is getting accurate science (for some definition of accurate, but that's a whole different blog post ;) ) as quickly as possible. However in the real world (where, unfortunately, we have to exist) there are more issues to consider. Primarily in this context, users have a limited amount of resources (money or allocated compute time) and they'd like their science done as quickly as possible, but probably also as efficiently as possible, and these two requirements/desires aren't always in alignment.
The cost of scaling
For example if you look at Figure 1, the runtime of a code I've been optimising, you can see that using higher numbers of MPI processes reduces the runtime, ie the science is completed more quickly. However, if you look at the parallel efficiency line on the graph you can see that the reduction in runtime is more costly the higher number of MPI processes I use.
Runtime and parallel efficiency for an application.
The cost of parallelising the code using more MPI processes, or the fact that scaling to higher process counts means there is less work per process, means the application becomes less efficient. What process count, what efficiency point on that graph, the user chooses when running simulations will depend on how quickly they need the results, how many resources they are willing to use, and how many other simulations they need to do.
Variability
Note, this is not an absolute efficiency graph, in the sense that it does not represent a fixed efficiency profile for the application. This merely shows me how efficient this application was on a particular HPC system for a particular input file/use case. The efficiency can, and likely will, change if I run it on different systems, use different compilers or MPI libraries, or run different simulations. This is something that can sometimes catch users out; they can (if not experienced) simply run everything on a fixed core or process count because they've seen that described as efficient for a particular simulation for a code they are using.
Indeed, one of the best ways to get efficient utilisation of HPC resources is to educate users in the efficiency characteristics of their applications and how to assess this for new test cases. They can then make more sensible decisions about the core or node counts to use when running jobs, and therefore make more efficient use of their resources.
ARCHER resource usage vs number of cores per job.
This efficiency/scaling trade-off is also why we don't see lots of very large application runs on our HPC systems, ie we don't see lots of users trying to use all 118,020 cores on ARCHER, or even 50% of them. As you can see from Figure 2, which outlines the size of jobs run on ARCHER (albeit with some out-of-date data), we have many more applications being run at 5,000-10,000 cores, ie 5%-10% of the overall system, than at higher core counts.
Optimisation choices
This means we have choices to make when optimising parallel applications for users. Do we improve scaling, to allow large simulations to be undertaken, or do we improve performance for the simulations currently being run? Do we look at where performance degrades when scaling up, or at where performance can be improved even when the scaling is good?
Of course, the answer is either and both. Improving either is good, both is better, and it will be very much dependent on what is currently most important for the user(s) you are working with. Do they have simulations that are taking too long at the moment? Improving scaling should help with that. Do they have lots of simulations to undertake, but they aren't large simulations? Improving current performance is probably more sensible.
And always bear in mind that when you optimise one part of the code, other performance issues will emerge, other things to be tackled in the future will come to the fore.
Exascale choices
However, this question isn't just important for software optimisation. It's also important when considering next generation HPC systems, Exascale-scope machines. Over the next few years Europe, the USA, China, and Japan, amongst others, will be pushing forward with the Exascale computing agenda, aiming to build a system that can undertake an ExaFlop/s of calculations, or can store an Exabyte of data, or whatever else your definition of Exascale is.
But, should we be creating hardware optimised for scale or for efficiency? Should we be focusing on hardware that can enable very large simulations to run efficiently, or can run lots of smaller (but still big by today's standards) jobs efficiently? Do we care about throughput of science on these machines or enabling simulations that cannot currently be performed?
This, of course, is a complicated and difficult decision. It depends not only on user requirement but on policy and politics. However, I would currently say that I've seen more demand for more computing than for large simulations. I see more overall benefits for lots of smaller simulations than a small number of very large simulations.
This could just be I'm not talking to the right users, maybe there are simulations out there that, if we could do them, would significantly change our understanding of the world, or the things we can do with science and technology. I've been told that we need Exascale systems to do full weather simulations at high resolution (ie 1km grids) for the whole globe. I'm sure that's true, but what I've not heard yet is what science or understanding this will enable, ie what difference would society see from this.
Likewise I know, because I've worked with this domain before, that to do proper plasma simulations that can couple ion and electron scales in the same simulation will require very large amounts of computing resources. However, again, it's not yet clear (at least to me) what insights this will bring (I guess it's a hard question to answer, an unknown unknown as it were). Whereas it's easy to see the benefit of enabling more of the research we currently do.
The counter argument is that such a machine is evolutionary, rather than revolutionary. It enables incremental science rather than groundbreaking, field defining, research. This is true, but I guess the groundbreaking and revolutionary, is always going to be more risky.
Revolution... and incremental change
The best outcome, I think, is for both to happen. For me to be able to eat my cake and have it as well. So this blog post is really a plea for some joined up thinking between countries and organisations doing Exascale research, so we get both revolutionary and incremental machines, we get resources that can enable both types of outcomes. After all, it looks like there will be at least four different Exascale machines/programmes, so there is room for plurality.
It'd also be nice to see some work done on estimating the benefits for science from the very large simulations that Exascale systems could enable, although I do appreciate it's not an easy thing to do.
As always, I'd be interested in your comments on this. Please feel free to email me or contact me on Twitter, contact details below.