blog.poucet.org Rotating Header Image

April, 2007:

Memory bottlenecks

First of all, I would like to thank all the readers for the interesting comments, both here as well as on reddit. And I would like to correct a mistake I made in the previous post, as well as add some more details.

Going over my notes again, S.Borkar from Intel spoke of a current use of 8B transistors and a foreseen 100B-transistors by 2015, so my figure of 1T was off by one order of magnitude. According to him, a majority of this will go to cache so we can foresee 100MB cache on such a real-estate. Since cache is typically implemented as SRAM (for those that don’t know, SRAM requires 6 transistors/bit while DRAM requires 1 transistor/bit, not counting the access-logic). So when you take into account 100MB-cache or 800Mb-cache, that’s still only 4.8B transistors. To give you an idea of complexity: The Montecito, which had 26MB cache and was probably one of the most complex Intel architectures (VLIW architecture with out-of-order-execution) required on the order of 100M transistors.

Due to what was said before, that clockspeed will no longer increase and possibly decrease, we need parallelism. And to deal with that parallelism we need a communication interface. Traditionally this used to be a bus, but nowadays a lot of research is going into NoC or network-on-chip. The amount of papers on NoC at this year was quite staggering (number-to-be-filled-in). The reasons for NoC are both obvious and non-obvious. On the obvious reasons is that a bus is a shared resource and as your number of masters go up contention only increases, and therefore also congestion. With a NoC you alleviate this pressure a bit by enabling easier communication between processing elements closer to one another. A less obvious reason was presented by one of the papers at the conference, namely that it allows you to have an easy way of dealing with variability by only having to care about variability (in timing) along a link and not of the whole system. This idea was more specifically applied to the clock-tree, though I think it’s a good idea in general. In a worst-case scenario, where variability really becomes an issue and clock-synch becomes increasingly impossible, the NoC could act as an asynchronous interface to tie together the different synchronous processing elements (cores with their local communication-assist and their local memory). I am certain there are many other reasons why NoC are a good solution, as indicated by the plethora of papers that have risen in the last 4-5 years.

Typically at the root of your network (if you implement a tree-like NoC) there will be an interface to the shared memory. Now obviously you want to use burst-transfer to get the data you’re working on to your local scratchpad. However I don’t see this working when you got to manycore ( > 100 core). So really, we can postulate only three possibilities (readers are welcome to suggest alternatives).

  • Few tasks per application:

Either applications will not become that complex that we need to scale much further for one application. When that is the case it becomes easier to partition your memory space to enable multiple applications. Of course then one could argue that eventually we won’t need better chips as a user is unlikely to run more than say 10 applications on his embedded device. For those of you that will claim that applications will require more compute-power then I refer you to the fact that clock-speeds won’t scale anymore, or otherwise said, this is not a valid postulate.

  • Many tasks per application, but not communicating a lot: 3D stacking will save the day and give us 1kb or more bandwidth, completely removing the shared memory bottleneck. Or so it is stated. However, I am somehow doubtful of this option. See for this to be the case there really are two options. Either your different tasks are talking to different memories, and then in essence you could be doing the computation on your local memory. Granted, the higher-bandwidth will allow you to more easily fill your local memory when you’re processing some kind of stream (video, 3d-video, etc..). So in a sense, this is different from the previous case as it is possible to have more tasks than the previous postulate, however they’re not really communicating a lot. If they would be, then you’d have a high contention and congestion on that piece of memory, no matter how wide your bandwidth.
  • Many tasks per application that also communicate a lot: Well this is probably the worst-case scenario. The question, as already posed in the first item, is whether this will ever happen. Precluding it, however, without such evidence would be improper and therefore I think it’s interesting to look at this last case. For if this last case can be solved, then by by subsumption, so can the previous cases. If you have a lot of communicating tasks, and we stick to the shared memory model, then no matter what your bandwidth is, you’ll have issues. The issue, in this case, is namely one of latency. See if processor 1 needs to communicate to processor 2, and it’s using the shared memory for this, then if your bus is 1024 bits wide, it doesn’t matter whether you’re communicating 1 bit or 1024 bit. So the more your processing elements communicate, the more latency you will have, no matter what your bandwidth is. Granted, there are more efficient ways of communicating directly between processors. If you look at network processors, for instance, then they will have FIFO-buffers between processors and even have special registers to communicate with their neighbouring processors. Hardware-wise, there are many features there to enable communication. And NoC will only help this by having more smaller busses resulting in less contention. However, if we return to the discussion of the previous blog, the C-model is one based on shared memory. You could argue that C can call native functions that enable all the above features. But, and I think this is the crucial point that I should’ve made more explicit in my previous post. Application developers are not System developers. The communication medium between these two groups is the programming model. If the programming model is not rich enough to capture all possible parallelisms and communications, it will be very difficult for the system developer to extract this and then map it to the proper hardware parts. And therefore, you risk, in a worst-case, of sticking to shared memory, where you end up with contention and congestion on shared memory that is used for communicating.
  • Note that I made some bold assumptions in the second and first case of the above list. I assumed that we’d have an idealistic operating system. In the end, since applications will be more dynamic and will use such dynamic memory management, they’re still going through the operating system to ask for memory allocations. As such, they’re still communicating in a sense by the fact that how one application allocates and frees memory affects another application. Since I didn’t want to get too technical, I ignored this problem and assumed it solved. Suffice to say that even when you have non-communicating applications, there will still be contention for and congestion on shared resources.

    I would like to come back to my previous post and make some things clear. While I very much enjoy using higher-level languages such as Haskell and Scheme, and fancy most languages with a sense of purity (for instance Smalltalk), I was not advocating for them by saying C is no longer the proper model of computation. Definitely I would advocate for Haskell or Scheme for tool-development, but I will not enter that discussion as I thin
    k it’s mostly a matter of personal preference. True, I do believe I am more efficient in Haskell, but I prefer sticking to the idea of “live and let live”. Debating endlessly over which language is the ideal for developing tools and applications is in my opinion very much like beating a dead horse.

    That being said, I do think what we need is a programming model that addresses some of the issues I mentioned. Someone on IRC mentioned that I was just pointing out problems and not suggesting solutions. I admit that such is true. However, I think that while that may not be appropriate in a language war, where there are different solutions, it is definitely appropriate in this context for there is no solution. It was a sentiment I also felt in the other attendees of the DATE conference. Does this mean we should drop C? Absolutely not, it still serves its purpose as accurately describing the behaviour on a single core. What is required on top of this is a model of concurrency, and while such have been suggested in past, with Pi-calculus being a very famous one, critics might say that they never got picked up. Well, for one, the push back then was not like the push now. Right now we are facing the issues of the memory-wall and parallelism, and we must overcome them. That being said, I do not think that a simple reinvention of a concurrent model as such invented in the 70s would suffice. One very big difference between then and now is the aforementioned memory wall. Memory accesses cost, a lot in fact. So what I think is required, and now I can’t remember if I explicitly mentioned this in my last post, is a concurrency model that takes into account data. Data-coherency, data-movement, data-replication. All these issues exist in embedded systems nowadays but are not modeled. Examples are respectively: cache-coherency, process-migration, local-copies in scratchpad.

    As final note, I would like to point out that the above ideas are meant as postulations. I would love to hear your view on this.

    Memory bottlenecks

    First of all, I would like to thank all the readers for the interesting comments, both here as well as on reddit. And I would like to correct a mistake I made in the previous post, as well as add some more details.

    Going over my notes again, S.Borkar from Intel spoke of a current use of 8B transistors and a foreseen 100B-transistors by 2015, so my figure of 1T was off by one order of magnitude. According to him, a majority of this will go to cache so we can foresee 100MB cache on such a real-estate. Since cache is typically implemented as SRAM (for those that don’t know, SRAM requires 6 transistors/bit while DRAM requires 1 transistor/bit, not counting the access-logic). So when you take into account 100MB-cache or 800Mb-cache, that’s still only 4.8B transistors. To give you an idea of complexity: The Montecito, which had 26MB cache and was probably one of the most complex Intel architectures (VLIW architecture with out-of-order-execution) required on the order of 100M transistors.

    Due to what was said before, that clockspeed will no longer increase and possibly decrease, we need parallelism. And to deal with that parallelism we need a communication interface. Traditionally this used to be a bus, but nowadays a lot of research is going into NoC or network-on-chip. The amount of papers on NoC at this year was quite staggering (number-to-be-filled-in). The reasons for NoC are both obvious and non-obvious. On the obvious reasons is that a bus is a shared resource and as your number of masters go up contention only increases, and therefore also congestion. With a NoC you alleviate this pressure a bit by enabling easier communication between processing elements closer to one another. A less obvious reason was presented by one of the papers at the conference, namely that it allows you to have an easy way of dealing with variability by only having to care about variability (in timing) along a link and not of the whole system. This idea was more specifically applied to the clock-tree, though I think it’s a good idea in general. In a worst-case scenario, where variability really becomes an issue and clock-synch becomes increasingly impossible, the NoC could act as an asynchronous interface to tie together the different synchronous processing elements (cores with their local communication-assist and their local memory). I am certain there are many other reasons why NoC are a good solution, as indicated by the plethora of papers that have risen in the last 4-5 years.

    Typically at the root of your network (if you implement a tree-like NoC) there will be an interface to the shared memory. Now obviously you want to use burst-transfer to get the data you’re working on to your local scratchpad. However I don’t see this working when you got to manycore ( > 100 core). So really, we can postulate only three possibilities (readers are welcome to suggest alternatives).

    • Few tasks per application: Either applications will not become that complex that we need to scale much further for one application. When that is the case it becomes easier to partition your memory space to enable multiple applications. Of course then one could argue that eventually we won’t need better chips as a user is unlikely to run more than say 10 applications on his embedded device. For those of you that will claim that applications will require more compute-power then I refer you to the fact that clock-speeds won’t scale anymore, or otherwise said, this is not a valid postulate.
    • Many tasks per application, but not communicating a lot: 3D stacking will save the day and give us 1kb or more bandwidth, completely removing the shared memory bottleneck. Or so it is stated. However, I am somehow doubtful of this option. See for this to be the case there really are two options. Either your different tasks are talking to different memories, and then in essence you could be doing the computation on your local memory. Granted, the higher-bandwidth will allow you to more easily fill your local memory when you’re processing some kind of stream (video, 3d-video, etc..). So in a sense, this is different from the previous case as it is possible to have more tasks than the previous postulate, however they’re not really communicating a lot. If they would be, then you’d have a high contention and congestion on that piece of memory, no matter how wide your bandwidth.
    • Many tasks per application that also communicate a lot: Well this is probably the worst-case scenario. The question, as already posed in the first item, is whether this will ever happen. Precluding it, however, without such evidence would be improper and therefore I think it’s interesting to look at this last case. For if this last case can be solved, then by by subsumption, so can the previous cases. If you have a lot of communicating tasks, and we stick to the shared memory model, then no matter what your bandwidth is, you’ll have issues. The issue, in this case, is namely one of latency. See if processor 1 needs to communicate to processor 2, and it’s using the shared memory for this, then if your bus is 1024 bits wide, it doesn’t matter whether you’re communicating 1 bit or 1024 bit. So the more your processing elements communicate, the more latency you will have, no matter what your bandwidth is. Granted, there are more efficient ways of communicating directly between processors. If you look at network processors, for instance, then they will have FIFO-buffers between processors and even have special registers to communicate with their neighbouring processors. Hardware-wise, there are many features there to enable communication. And NoC will only help this by having more smaller busses resulting in less contention. However, if we return to the discussion of the previous blog, the C-model is one based on shared memory. You could argue that C can call native functions that enable all the above features. But, and I think this is the crucial point that I should’ve made more explicit in my previous post. Application developers are not System developers. The communication medium between these two groups is the programming model. If the programming model is not rich enough to capture all possible parallelisms and communications, it will be very difficult for the system developer to extract this and then map it to the proper hardware parts. And therefore, you risk, in a worst-case, of sticking to shared memory, where you end up with contention and congestion on shared memory that is used for communicating.

    Note that I made some bold assumptions in the second and first case of the above list. I assumed that we’d have an idealistic operating system. In the end, since applications will be more dynamic and will use such dynamic memory management, they’re still going through the operating system to ask for memory allocations. As such, they’re still communicating in a sense by the fact that how one application allocates and frees memory affects another application. Since I didn’t want to get too technical, I ignored this problem and assumed it solved. Suffice to say that even when you have non-communicating applications, there will still be contention for and congestion on shared resources.

    I would like to come back to my previous post and make some things clear. While I very much enjoy using higher-level languages such as Haskell and Scheme, and fancy most languages with a sense of purity (for instance Smalltalk), I was not advocating for them by saying C is no longer the proper model of computation. Definitely I would advocate for Haskell or Scheme for tool-development, but I will not enter that discussion as I think it’s m
    ostly a matter of personal preference. True, I do believe I am more efficient in Haskell, but I prefer sticking to the idea of “live and let live”. Debating endlessly over which language is the ideal for developing tools and applications is in my opinion very much like beating a dead horse.

    That being said, I do think what we need is a programming model that addresses some of the issues I mentioned. Someone on IRC mentioned that I was just pointing out problems and not suggesting solutions. I admit that such is true. However, I think that while that may not be appropriate in a language war, where there are different solutions, it is definitely appropriate in this context for there is no solution. It was a sentiment I also felt in the other attendees of the DATE conference. Does this mean we should drop C? Absolutely not, it still serves its purpose as accurately describing the behaviour on a single core. What is required on top of this is a model of concurrency, and while such have been suggested in past, with Pi-calculus being a very famous one, critics might say that they never got picked up. Well, for one, the push back then was not like the push now. Right now we are facing the issues of the memory-wall and parallelism, and we must overcome them. That being said, I do not think that a simple reinvention of a concurrent model as such invented in the 70s would suffice. One very big difference between then and now is the aforementioned memory wall. Memory accesses cost, a lot in fact. So what I think is required, and now I can’t remember if I explicitly mentioned this in my last post, is a concurrency model that takes into account data. Data-coherency, data-movement, data-replication. All these issues exist in embedded systems nowadays but are not modeled. Examples are respectively: cache-coherency, process-migration, local-copies in scratchpad.

    As final note, I would like to point out that the above ideas are meant as postulations. I would love to hear your view on this.

    Observations from DATE 2007

    Well, I just had a wonderful week in Nice at the DATE’07 conference. For those that don’t know, it’s one of the biggest EDA conferences (this year 5000 attendees, with representatives from all major companies in the field).

    Here are some observations I made as I was attending:

    • Parallelism is a must.The era of scaling clockspeed is over. S.Borkar from Intel showed some very interesting graphs regarding scaling in general. In particular, for clock-speed, in the early days there used to be a quadratic increase in clock-speed for every reduction in technology-size. That then went down to a linear increase. Nowadays, there is no increase, clock-speeds can not be raised as you go to a smaller technology-size, and it might even be foreseeable in the future that clock-speeds might decrease with further reduction in technology size. He mentioned, however, that the latter is not an issue. As long as you can double the amount of money you get for a certain-sized chip with each generation, you’re in business. Additionally, we’re looking at 1T-transistors in the near-future (by 2015). This means that the only way to increase speed is through parallelism, and indeed already today we can see multicore devices, which not so long from now will become manycore devices (a terminology used to indicate > 100 cores). So obviously there’s two issues to make this work. One is that you can not get speed up as long as you don’t have parallel software, and here Ahmdal’s Law is a very good back-of-the-envelope calculation for dictating the speed-up parallelism can get you: 1/(%seq + (1 – %seq)/N) where %seq is the percentage of sequential code and N is the amount of cores.. Looked at it from another way, we can see the amount of cores that are required in case that we have a certain percentage of sequential code to get at a speedup of X: N = (1 – %seq)/(1/X – %seq). Obviously for a speedup of 4 in the case of 0% sequential code, we need 4 cores, but if we go to a typical value of 20%, we see that 16 cores. For a speed-up of 10, we can nog even get a value for the amount of cores in the case of 20% and in the case of 7% we need 31 cores. So obviously the amount of parallelism should be reduced to the absolute minimum.
    • Transactional memory, although very cool, doesn’t scale. There was a paper on a hardware implementation, they replaced the L1 data-cache with their own cache which on top of the caching did the transactional memory support. This allowed them to parallelize different pieces of code and only during the final commit-cycle would there be a check as to whether there were conflicts in memory addresses being read or written. Disregarding the fact that their cache-hit penalty was 13 cycles (as opposed to 2 cycle-hit for normal L1 data cache), the problem is that they are sequentializing all the commit blocks (and incurring recomputation in case there’s a conflict). After talking to the authors, they claimed it is possible to parallelize these commit blocks, but talking to another professor, he said that it would require too much complexity. Hence my conclusion, transactional memory is very cool for when you want to draw coarse-grain parallelism boundaries with few possibilities of conflicts (and then not have to worry about proper locking) but completely fails once you go to more fine-grained parallelism. Combining that idea with the previous point, it is obvious we need something new in this field, the question is .. what?.
    • Compilers should cost money, more so than they do today. Companies are starting to realize that compilers are underpriced. This causes a serious problem as more and more of the complexity in embedded devices resides in software not hardware. If you consider that a typical EDA tool will set you off by 100K euros while you’ll only pay 8K euros, it is clear there is a gap. Tensilica, for instance, mentioned that they only way they are able to continue working in that part is by shifting the cost of research in their compilers in the price-tag for their EDA-tools. But there was a consent that compilers should cost more. This of course beckons the question regarding Open-Source. This company was answered by one of the people on the panel, though I forgot whom. Basically, if you look at GCC, it does not do a lot of very fancy things. As such, it’s running behind in comparison to state-of-the-art compilation techniques. Therefore, if a company finds some novel optimization transformation, they can keep their edge by selling it in their tools, and once these optimizations become more main-stream, they can then be open-sourced.
    • C no longer fits the model of computation.Well this is not perse my observation, as I’ve been a fan of less-main-stream languages for a while, but it is finally becoming noted even in the embedded community. The issue is that when you go to multicore, as developer, you’re writing down this software which might have different parallelisms but forcing yourself to write them in a sequential language, only to have the compiler do the difficult task of re-extracting this parallelism. What is necessary is a new paradigm that allows people to work with parallelism both at the data-level as well as at the functional-level and task-level. The principal Staff Engineer, Prof. Ian Phillips, was very adamant about this. He said that people nowadays try to tackle just enough to be competitive. For instance if the competitor can handle, for instance, 2 out of 10 multicores, people will try for 3 out of 10 multicores. Instead, there should be a long term vision so that there is a clear roadmap to get to 10 out of 10 multicores, and if such a path exists, then he, as hardware engineer, will be more than glad to extend his platform to offer the required hooks and pulls to enable this new model of computation.
    • Platforms are no longer reliable.It used to be the case that a computer would work as designed and once some transistors stopped working, it’d stop working completely. This rather black-and-white boundary is no longer the case in future technologies. We can expect to see many more soft-error-rates and very gradual decay of transistors. Additionally, due to the small size of the components with respect to the injected atoms to create semiconductors, variability is also becoming a serious issue as it is no longer possible to get an accurate view of the time or power that a transistor will take. This results in very variable timing of the critical path in a circuit which can even change at runtime due to temperature. Therefore it becomes impossible to make a clock-rate based on the worst-execution time that is specified at design-time of the hardware. Already we see chip vendors selling different chips at different clock-rates due to this timing-issue. So what next? As transistors become more and more variable and unreliable, there will be a need for fault-tolernace models at the hardware level. A good boundary for this seems to be at the core level, especially in many-core systems. At runtime, the operating system can detect the different clock-speeds of the different cores as well as possible failures and then determine how to allocate tasks to the different cores.

    I think for all the above reasons, right now is a very exciting time. In the 70s, there had been efforts to work on parallel and distributed systems, though at that time it did not launch. Faced with the difficulties of today, the only way to move forward is by reexamining this issues, and failing to do that will lead in a stall of technology, so there is definitely a very big push to make it work.
    an>

    Observations from DATE 2007

    Well, I just had a wonderful week in Nice at the DATE’07 conference. For those that don’t know, it’s one of the biggest EDA conferences (this year 5000 attendees, with representatives from all major companies in the field).

    Here are some observations I made as I was attending:

    • Parallelism is a must.The era of scaling clockspeed is over. S.Borkar from Intel showed some very interesting graphs regarding scaling in general. In particular, for clock-speed, in the early days there used to be a quadratic increase in clock-speed for every reduction in technology-size. That then went down to a linear increase. Nowadays, there is no increase, clock-speeds can not be raised as you go to a smaller technology-size, and it might even be foreseeable in the future that clock-speeds might decrease with further reduction in technology size. He mentioned, however, that the latter is not an issue. As long as you can double the amount of money you get for a certain-sized chip with each generation, you’re in business. Additionally, we’re looking at 1T-transistors in the near-future (by 2015). This means that the only way to increase speed is through parallelism, and indeed already today we can see multicore devices, which not so long from now will become manycore devices (a terminology used to indicate > 100 cores). So obviously there’s two issues to make this work. One is that you can not get speed up as long as you don’t have parallel software, and here Ahmdal’s Law is a very good back-of-the-envelope calculation for dictating the speed-up parallelism can get you: 1/(%seq + (1 – %seq)/N) where %seq is the percentage of sequential code and N is the amount of cores.. Looked at it from another way, we can see the amount of cores that are required in case that we have a certain percentage of sequential code to get at a speedup of X: N = (1 – %seq)/(1/X – %seq). Obviously for a speedup of 4 in the case of 0% sequential code, we need 4 cores, but if we go to a typical value of 20%, we see that 16 cores. For a speed-up of 10, we can nog even get a value for the amount of cores in the case of 20% and in the case of 7% we need 31 cores. So obviously the amount of parallelism should be reduced to the absolute minimum.
    • Transactional memory, although very cool, doesn’t scale. There was a paper on a hardware implementation, they replaced the L1 data-cache with their own cache which on top of the caching did the transactional memory support. This allowed them to parallelize different pieces of code and only during the final commit-cycle would there be a check as to whether there were conflicts in memory addresses being read or written. Disregarding the fact that their cache-hit penalty was 13 cycles (as opposed to 2 cycle-hit for normal L1 data cache), the problem is that they are sequentializing all the commit blocks (and incurring recomputation in case there’s a conflict). After talking to the authors, they claimed it is possible to parallelize these commit blocks, but talking to another professor, he said that it would require too much complexity. Hence my conclusion, transactional memory is very cool for when you want to draw coarse-grain parallelism boundaries with few possibilities of conflicts (and then not have to worry about proper locking) but completely fails once you go to more fine-grained parallelism. Combining that idea with the previous point, it is obvious we need something new in this field, the question is .. what?.
    • Compilers should cost money, more so than they do today. Companies are starting to realize that compilers are underpriced. This causes a serious problem as more and more of the complexity in embedded devices resides in software not hardware. If you consider that a typical EDA tool will set you off by 100K euros while you’ll only pay 8K euros, it is clear there is a gap. Tensilica, for instance, mentioned that they only way they are able to continue working in that part is by shifting the cost of research in their compilers in the price-tag for their EDA-tools. But there was a consent that compilers should cost more. This of course beckons the question regarding Open-Source. This company was answered by one of the people on the panel, though I forgot whom. Basically, if you look at GCC, it does not do a lot of very fancy things. As such, it’s running behind in comparison to state-of-the-art compilation techniques. Therefore, if a company finds some novel optimization transformation, they can keep their edge by selling it in their tools, and once these optimizations become more main-stream, they can then be open-sourced.
    • C no longer fits the model of computation.Well this is not perse my observation, as I’ve been a fan of less-main-stream languages for a while, but it is finally becoming noted even in the embedded community. The issue is that when you go to multicore, as developer, you’re writing down this software which might have different parallelisms but forcing yourself to write them in a sequential language, only to have the compiler do the difficult task of re-extracting this parallelism. What is necessary is a new paradigm that allows people to work with parallelism both at the data-level as well as at the functional-level and task-level. The principal Staff Engineer, Prof. Ian Phillips, was very adamant about this. He said that people nowadays try to tackle just enough to be competitive. For instance if the competitor can handle, for instance, 2 out of 10 multicores, people will try for 3 out of 10 multicores. Instead, there should be a long term vision so that there is a clear roadmap to get to 10 out of 10 multicores, and if such a path exists, then he, as hardware engineer, will be more than glad to extend his platform to offer the required hooks and pulls to enable this new model of computation.
    • Platforms are no longer reliable.It used to be the case that a computer would work as designed and once some transistors stopped working, it’d stop working completely. This rather black-and-white boundary is no longer the case in future technologies. We can expect to see many more soft-error-rates and very gradual decay of transistors. Additionally, due to the small size of the components with respect to the injected atoms to create semiconductors, variability is also becoming a serious issue as it is no longer possible to get an accurate view of the time or power that a transistor will take. This results in very variable timing of the critical path in a circuit which can even change at runtime due to temperature. Therefore it becomes impossible to make a clock-rate based on the worst-execution time that is specified at design-time of the hardware. Already we see chip vendors selling different chips at different clock-rates due to this timing-issue. So what next? As transistors become more and more variable and unreliable, there will be a need for fault-tolernace models at the hardware level. A good boundary for this seems to be at the core level, especially in many-core systems. At runtime, the operating system can detect the different clock-speeds of the different cores as well as possible failures and then determine how to allocate tasks to the different cores.

    I think for all the above reasons, right now is a very exciting time. In the 70s, there had been efforts to work on parallel and distributed systems, though at that time it did not launch. Faced with the difficulties of today, the only way to move forward is by reexamining this issues, and failing to do that will lead in a stall of technology, so there is definitely a very big push to make it work.
    n>

    Delimited continuations revisited

    I thought it would be interesting to return to delimited continuations and explain what they are.

    First of all there is the word continuations. Continuations are, simply put, the remainder of a computation. Take for instance this simple piece of code:


    (begin
    (print 1)
    (return)
    (print 2)
    )

    Now, the print(2) is never reached. If, however, one used continuations, then one could capture the remainder of the computation after the return before returning and store this somewhere to invoke it at a later date.


    (begin
    (print 1)
    (call/cc (lambda (x) (set! globalx x) (return))
    (print 2)
    )

    If later one invokes the globalx, one is returned to where we left off and computation continues. So in essence, a continuation captures the return-path of where it was called, all the way to the end of the program (or in scheme, to the top-level). Notice that it captures the return-path in a dynamic sense, not a static sense. This means that if it is captured inside a loop, it will return to that specific loop-iteration.

    Secondly there is the concept of ‘delimited’. If one thinks of a continuation, then one can see it has a beginning, namely when call/cc is called, but no end (the end is the end of the program). Therefore, one can represent it as a half-closed interval: [) where the start is obviously the return-path from call/cc while the end is the end of the program. One might argue that the end of the program can be seen as the end, but in the presence of continuations, this is not very well defined. Delimited continuations, however, are defined with a certain block, and this block has a very definitive end, namely the end of that block. Therefore, one can represent this as a fully-closed interval: []. This also explains why one needs two continuations to emulate delimited continuations.

    Delimited continuations revisited

    I thought it would be interesting to return to delimited continuations and explain what they are.

    First of all there is the word continuations. Continuations are, simply put, the remainder of a computation. Take for instance this simple piece of code:


    (begin
    (print 1)
    (return)
    (print 2)
    )

    Now, the print(2) is never reached. If, however, one used continuations, then one could capture the remainder of the computation after the return before returning and store this somewhere to invoke it at a later date.


    (begin
    (print 1)
    (call/cc (lambda (x) (set! globalx x) (return))
    (print 2)
    )

    If later one invokes the globalx, one is returned to where we left off and computation continues. So in essence, a continuation captures the return-path of where it was called, all the way to the end of the program (or in scheme, to the top-level). Notice that it captures the return-path in a dynamic sense, not a static sense. This means that if it is captured inside a loop, it will return to that specific loop-iteration.

    Secondly there is the concept of ‘delimited’. If one thinks of a continuation, then one can see it has a beginning, namely when call/cc is called, but no end (the end is the end of the program). Therefore, one can represent it as a half-closed interval: [) where the start is obviously the return-path from call/cc while the end is the end of the program. One might argue that the end of the program can be seen as the end, but in the presence of continuations, this is not very well defined. Delimited continuations, however, are defined with a certain block, and this block has a very definitive end, namely the end of that block. Therefore, one can represent this as a fully-closed interval: []. This also explains why one needs two continuations to emulate delimited continuations.