Well, I just had a wonderful week in Nice at the DATE’07 conference. For those that don’t know, it’s one of the biggest EDA conferences (this year 5000 attendees, with representatives from all major companies in the field).
Here are some observations I made as I was attending:
- Parallelism is a must.The era of scaling clockspeed is over. S.Borkar from Intel showed some very interesting graphs regarding scaling in general. In particular, for clock-speed, in the early days there used to be a quadratic increase in clock-speed for every reduction in technology-size. That then went down to a linear increase. Nowadays, there is no increase, clock-speeds can not be raised as you go to a smaller technology-size, and it might even be foreseeable in the future that clock-speeds might decrease with further reduction in technology size. He mentioned, however, that the latter is not an issue. As long as you can double the amount of money you get for a certain-sized chip with each generation, you’re in business. Additionally, we’re looking at 1T-transistors in the near-future (by 2015). This means that the only way to increase speed is through parallelism, and indeed already today we can see multicore devices, which not so long from now will become manycore devices (a terminology used to indicate > 100 cores). So obviously there’s two issues to make this work. One is that you can not get speed up as long as you don’t have parallel software, and here Ahmdal’s Law is a very good back-of-the-envelope calculation for dictating the speed-up parallelism can get you: 1/(%seq + (1 – %seq)/N) where %seq is the percentage of sequential code and N is the amount of cores.. Looked at it from another way, we can see the amount of cores that are required in case that we have a certain percentage of sequential code to get at a speedup of X: N = (1 – %seq)/(1/X – %seq). Obviously for a speedup of 4 in the case of 0% sequential code, we need 4 cores, but if we go to a typical value of 20%, we see that 16 cores. For a speed-up of 10, we can nog even get a value for the amount of cores in the case of 20% and in the case of 7% we need 31 cores. So obviously the amount of parallelism should be reduced to the absolute minimum.
- Transactional memory, although very cool, doesn’t scale. There was a paper on a hardware implementation, they replaced the L1 data-cache with their own cache which on top of the caching did the transactional memory support. This allowed them to parallelize different pieces of code and only during the final commit-cycle would there be a check as to whether there were conflicts in memory addresses being read or written. Disregarding the fact that their cache-hit penalty was 13 cycles (as opposed to 2 cycle-hit for normal L1 data cache), the problem is that they are sequentializing all the commit blocks (and incurring recomputation in case there’s a conflict). After talking to the authors, they claimed it is possible to parallelize these commit blocks, but talking to another professor, he said that it would require too much complexity. Hence my conclusion, transactional memory is very cool for when you want to draw coarse-grain parallelism boundaries with few possibilities of conflicts (and then not have to worry about proper locking) but completely fails once you go to more fine-grained parallelism. Combining that idea with the previous point, it is obvious we need something new in this field, the question is .. what?.
- Compilers should cost money, more so than they do today. Companies are starting to realize that compilers are underpriced. This causes a serious problem as more and more of the complexity in embedded devices resides in software not hardware. If you consider that a typical EDA tool will set you off by 100K euros while you’ll only pay 8K euros, it is clear there is a gap. Tensilica, for instance, mentioned that they only way they are able to continue working in that part is by shifting the cost of research in their compilers in the price-tag for their EDA-tools. But there was a consent that compilers should cost more. This of course beckons the question regarding Open-Source. This company was answered by one of the people on the panel, though I forgot whom. Basically, if you look at GCC, it does not do a lot of very fancy things. As such, it’s running behind in comparison to state-of-the-art compilation techniques. Therefore, if a company finds some novel optimization transformation, they can keep their edge by selling it in their tools, and once these optimizations become more main-stream, they can then be open-sourced.
- C no longer fits the model of computation.Well this is not perse my observation, as I’ve been a fan of less-main-stream languages for a while, but it is finally becoming noted even in the embedded community. The issue is that when you go to multicore, as developer, you’re writing down this software which might have different parallelisms but forcing yourself to write them in a sequential language, only to have the compiler do the difficult task of re-extracting this parallelism. What is necessary is a new paradigm that allows people to work with parallelism both at the data-level as well as at the functional-level and task-level. The principal Staff Engineer, Prof. Ian Phillips, was very adamant about this. He said that people nowadays try to tackle just enough to be competitive. For instance if the competitor can handle, for instance, 2 out of 10 multicores, people will try for 3 out of 10 multicores. Instead, there should be a long term vision so that there is a clear roadmap to get to 10 out of 10 multicores, and if such a path exists, then he, as hardware engineer, will be more than glad to extend his platform to offer the required hooks and pulls to enable this new model of computation.
- Platforms are no longer reliable.It used to be the case that a computer would work as designed and once some transistors stopped working, it’d stop working completely. This rather black-and-white boundary is no longer the case in future technologies. We can expect to see many more soft-error-rates and very gradual decay of transistors. Additionally, due to the small size of the components with respect to the injected atoms to create semiconductors, variability is also becoming a serious issue as it is no longer possible to get an accurate view of the time or power that a transistor will take. This results in very variable timing of the critical path in a circuit which can even change at runtime due to temperature. Therefore it becomes impossible to make a clock-rate based on the worst-execution time that is specified at design-time of the hardware. Already we see chip vendors selling different chips at different clock-rates due to this timing-issue. So what next? As transistors become more and more variable and unreliable, there will be a need for fault-tolernace models at the hardware level. A good boundary for this seems to be at the core level, especially in many-core systems. At runtime, the operating system can detect the different clock-speeds of the different cores as well as possible failures and then determine how to allocate tasks to the different cores.
I think for all the above reasons, right now is a very exciting time. In the 70s, there had been efforts to work on parallel and distributed systems, though at that time it did not launch. Faced with the difficulties of today, the only way to move forward is by reexamining this issues, and failing to do that will lead in a stall of technology, so there is definitely a very big push to make it work.