Rotating Header Image

Memory bottlenecks

First of all, I would like to thank all the readers for the interesting comments, both here as well as on reddit. And I would like to correct a mistake I made in the previous post, as well as add some more details.

Going over my notes again, S.Borkar from Intel spoke of a current use of 8B transistors and a foreseen 100B-transistors by 2015, so my figure of 1T was off by one order of magnitude. According to him, a majority of this will go to cache so we can foresee 100MB cache on such a real-estate. Since cache is typically implemented as SRAM (for those that don’t know, SRAM requires 6 transistors/bit while DRAM requires 1 transistor/bit, not counting the access-logic). So when you take into account 100MB-cache or 800Mb-cache, that’s still only 4.8B transistors. To give you an idea of complexity: The Montecito, which had 26MB cache and was probably one of the most complex Intel architectures (VLIW architecture with out-of-order-execution) required on the order of 100M transistors.

Due to what was said before, that clockspeed will no longer increase and possibly decrease, we need parallelism. And to deal with that parallelism we need a communication interface. Traditionally this used to be a bus, but nowadays a lot of research is going into NoC or network-on-chip. The amount of papers on NoC at this year was quite staggering (number-to-be-filled-in). The reasons for NoC are both obvious and non-obvious. On the obvious reasons is that a bus is a shared resource and as your number of masters go up contention only increases, and therefore also congestion. With a NoC you alleviate this pressure a bit by enabling easier communication between processing elements closer to one another. A less obvious reason was presented by one of the papers at the conference, namely that it allows you to have an easy way of dealing with variability by only having to care about variability (in timing) along a link and not of the whole system. This idea was more specifically applied to the clock-tree, though I think it’s a good idea in general. In a worst-case scenario, where variability really becomes an issue and clock-synch becomes increasingly impossible, the NoC could act as an asynchronous interface to tie together the different synchronous processing elements (cores with their local communication-assist and their local memory). I am certain there are many other reasons why NoC are a good solution, as indicated by the plethora of papers that have risen in the last 4-5 years.

Typically at the root of your network (if you implement a tree-like NoC) there will be an interface to the shared memory. Now obviously you want to use burst-transfer to get the data you’re working on to your local scratchpad. However I don’t see this working when you got to manycore ( > 100 core). So really, we can postulate only three possibilities (readers are welcome to suggest alternatives).

  • Few tasks per application:

Either applications will not become that complex that we need to scale much further for one application. When that is the case it becomes easier to partition your memory space to enable multiple applications. Of course then one could argue that eventually we won’t need better chips as a user is unlikely to run more than say 10 applications on his embedded device. For those of you that will claim that applications will require more compute-power then I refer you to the fact that clock-speeds won’t scale anymore, or otherwise said, this is not a valid postulate.

  • Many tasks per application, but not communicating a lot: 3D stacking will save the day and give us 1kb or more bandwidth, completely removing the shared memory bottleneck. Or so it is stated. However, I am somehow doubtful of this option. See for this to be the case there really are two options. Either your different tasks are talking to different memories, and then in essence you could be doing the computation on your local memory. Granted, the higher-bandwidth will allow you to more easily fill your local memory when you’re processing some kind of stream (video, 3d-video, etc..). So in a sense, this is different from the previous case as it is possible to have more tasks than the previous postulate, however they’re not really communicating a lot. If they would be, then you’d have a high contention and congestion on that piece of memory, no matter how wide your bandwidth.
  • Many tasks per application that also communicate a lot: Well this is probably the worst-case scenario. The question, as already posed in the first item, is whether this will ever happen. Precluding it, however, without such evidence would be improper and therefore I think it’s interesting to look at this last case. For if this last case can be solved, then by by subsumption, so can the previous cases. If you have a lot of communicating tasks, and we stick to the shared memory model, then no matter what your bandwidth is, you’ll have issues. The issue, in this case, is namely one of latency. See if processor 1 needs to communicate to processor 2, and it’s using the shared memory for this, then if your bus is 1024 bits wide, it doesn’t matter whether you’re communicating 1 bit or 1024 bit. So the more your processing elements communicate, the more latency you will have, no matter what your bandwidth is. Granted, there are more efficient ways of communicating directly between processors. If you look at network processors, for instance, then they will have FIFO-buffers between processors and even have special registers to communicate with their neighbouring processors. Hardware-wise, there are many features there to enable communication. And NoC will only help this by having more smaller busses resulting in less contention. However, if we return to the discussion of the previous blog, the C-model is one based on shared memory. You could argue that C can call native functions that enable all the above features. But, and I think this is the crucial point that I should’ve made more explicit in my previous post. Application developers are not System developers. The communication medium between these two groups is the programming model. If the programming model is not rich enough to capture all possible parallelisms and communications, it will be very difficult for the system developer to extract this and then map it to the proper hardware parts. And therefore, you risk, in a worst-case, of sticking to shared memory, where you end up with contention and congestion on shared memory that is used for communicating.
  • Note that I made some bold assumptions in the second and first case of the above list. I assumed that we’d have an idealistic operating system. In the end, since applications will be more dynamic and will use such dynamic memory management, they’re still going through the operating system to ask for memory allocations. As such, they’re still communicating in a sense by the fact that how one application allocates and frees memory affects another application. Since I didn’t want to get too technical, I ignored this problem and assumed it solved. Suffice to say that even when you have non-communicating applications, there will still be contention for and congestion on shared resources.

    I would like to come back to my previous post and make some things clear. While I very much enjoy using higher-level languages such as Haskell and Scheme, and fancy most languages with a sense of purity (for instance Smalltalk), I was not advocating for them by saying C is no longer the proper model of computation. Definitely I would advocate for Haskell or Scheme for tool-development, but I will not enter that discussion as I thin
    k it’s mostly a matter of personal preference. True, I do believe I am more efficient in Haskell, but I prefer sticking to the idea of “live and let live”. Debating endlessly over which language is the ideal for developing tools and applications is in my opinion very much like beating a dead horse.

    That being said, I do think what we need is a programming model that addresses some of the issues I mentioned. Someone on IRC mentioned that I was just pointing out problems and not suggesting solutions. I admit that such is true. However, I think that while that may not be appropriate in a language war, where there are different solutions, it is definitely appropriate in this context for there is no solution. It was a sentiment I also felt in the other attendees of the DATE conference. Does this mean we should drop C? Absolutely not, it still serves its purpose as accurately describing the behaviour on a single core. What is required on top of this is a model of concurrency, and while such have been suggested in past, with Pi-calculus being a very famous one, critics might say that they never got picked up. Well, for one, the push back then was not like the push now. Right now we are facing the issues of the memory-wall and parallelism, and we must overcome them. That being said, I do not think that a simple reinvention of a concurrent model as such invented in the 70s would suffice. One very big difference between then and now is the aforementioned memory wall. Memory accesses cost, a lot in fact. So what I think is required, and now I can’t remember if I explicitly mentioned this in my last post, is a concurrency model that takes into account data. Data-coherency, data-movement, data-replication. All these issues exist in embedded systems nowadays but are not modeled. Examples are respectively: cache-coherency, process-migration, local-copies in scratchpad.

    As final note, I would like to point out that the above ideas are meant as postulations. I would love to hear your view on this.

    Be Sociable, Share!

    Leave a Reply

    Your email address will not be published. Required fields are marked *


    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>