First of all, I would like to thank all the readers for the interesting comments, both here as well as on reddit. And I would like to correct a mistake I made in the previous post, as well as add some more details.
Going over my notes again, S.Borkar from Intel spoke of a current use of 8B transistors and a foreseen 100B-transistors by 2015, so my figure of 1T was off by one order of magnitude. According to him, a majority of this will go to cache so we can foresee 100MB cache on such a real-estate. Since cache is typically implemented as SRAM (for those that don’t know, SRAM requires 6 transistors/bit while DRAM requires 1 transistor/bit, not counting the access-logic). So when you take into account 100MB-cache or 800Mb-cache, that’s still only 4.8B transistors. To give you an idea of complexity: The Montecito, which had 26MB cache and was probably one of the most complex Intel architectures (VLIW architecture with out-of-order-execution) required on the order of 100M transistors.
Due to what was said before, that clockspeed will no longer increase and possibly decrease, we need parallelism. And to deal with that parallelism we need a communication interface. Traditionally this used to be a bus, but nowadays a lot of research is going into NoC or network-on-chip. The amount of papers on NoC at this year was quite staggering (number-to-be-filled-in). The reasons for NoC are both obvious and non-obvious. On the obvious reasons is that a bus is a shared resource and as your number of masters go up contention only increases, and therefore also congestion. With a NoC you alleviate this pressure a bit by enabling easier communication between processing elements closer to one another. A less obvious reason was presented by one of the papers at the conference, namely that it allows you to have an easy way of dealing with variability by only having to care about variability (in timing) along a link and not of the whole system. This idea was more specifically applied to the clock-tree, though I think it’s a good idea in general. In a worst-case scenario, where variability really becomes an issue and clock-synch becomes increasingly impossible, the NoC could act as an asynchronous interface to tie together the different synchronous processing elements (cores with their local communication-assist and their local memory). I am certain there are many other reasons why NoC are a good solution, as indicated by the plethora of papers that have risen in the last 4-5 years.
Typically at the root of your network (if you implement a tree-like NoC) there will be an interface to the shared memory. Now obviously you want to use burst-transfer to get the data you’re working on to your local scratchpad. However I don’t see this working when you got to manycore ( > 100 core). So really, we can postulate only three possibilities (readers are welcome to suggest alternatives).
- Few tasks per application:
Either applications will not become that complex that we need to scale much further for one application. When that is the case it becomes easier to partition your memory space to enable multiple applications. Of course then one could argue that eventually we won’t need better chips as a user is unlikely to run more than say 10 applications on his embedded device. For those of you that will claim that applications will require more compute-power then I refer you to the fact that clock-speeds won’t scale anymore, or otherwise said, this is not a valid postulate.
Note that I made some bold assumptions in the second and first case of the above list. I assumed that we’d have an idealistic operating system. In the end, since applications will be more dynamic and will use such dynamic memory management, they’re still going through the operating system to ask for memory allocations. As such, they’re still communicating in a sense by the fact that how one application allocates and frees memory affects another application. Since I didn’t want to get too technical, I ignored this problem and assumed it solved. Suffice to say that even when you have non-communicating applications, there will still be contention for and congestion on shared resources.
I would like to come back to my previous post and make some things clear. While I very much enjoy using higher-level languages such as Haskell and Scheme, and fancy most languages with a sense of purity (for instance Smalltalk), I was not advocating for them by saying C is no longer the proper model of computation. Definitely I would advocate for Haskell or Scheme for tool-development, but I will not enter that discussion as I thin
k it’s mostly a matter of personal preference. True, I do believe I am more efficient in Haskell, but I prefer sticking to the idea of “live and let live”. Debating endlessly over which language is the ideal for developing tools and applications is in my opinion very much like beating a dead horse.
That being said, I do think what we need is a programming model that addresses some of the issues I mentioned. Someone on IRC mentioned that I was just pointing out problems and not suggesting solutions. I admit that such is true. However, I think that while that may not be appropriate in a language war, where there are different solutions, it is definitely appropriate in this context for there is no solution. It was a sentiment I also felt in the other attendees of the DATE conference. Does this mean we should drop C? Absolutely not, it still serves its purpose as accurately describing the behaviour on a single core. What is required on top of this is a model of concurrency, and while such have been suggested in past, with Pi-calculus being a very famous one, critics might say that they never got picked up. Well, for one, the push back then was not like the push now. Right now we are facing the issues of the memory-wall and parallelism, and we must overcome them. That being said, I do not think that a simple reinvention of a concurrent model as such invented in the 70s would suffice. One very big difference between then and now is the aforementioned memory wall. Memory accesses cost, a lot in fact. So what I think is required, and now I can’t remember if I explicitly mentioned this in my last post, is a concurrency model that takes into account data. Data-coherency, data-movement, data-replication. All these issues exist in embedded systems nowadays but are not modeled. Examples are respectively: cache-coherency, process-migration, local-copies in scratchpad.
As final note, I would like to point out that the above ideas are meant as postulations. I would love to hear your view on this.