blog.poucet.org Rotating Header Image

Observations from DATE 2007

Well, I just had a wonderful week in Nice at the DATE’07 conference. For those that don’t know, it’s one of the biggest EDA conferences (this year 5000 attendees, with representatives from all major companies in the field).

Here are some observations I made as I was attending:

  • Parallelism is a must.The era of scaling clockspeed is over. S.Borkar from Intel showed some very interesting graphs regarding scaling in general. In particular, for clock-speed, in the early days there used to be a quadratic increase in clock-speed for every reduction in technology-size. That then went down to a linear increase. Nowadays, there is no increase, clock-speeds can not be raised as you go to a smaller technology-size, and it might even be foreseeable in the future that clock-speeds might decrease with further reduction in technology size. He mentioned, however, that the latter is not an issue. As long as you can double the amount of money you get for a certain-sized chip with each generation, you’re in business. Additionally, we’re looking at 1T-transistors in the near-future (by 2015). This means that the only way to increase speed is through parallelism, and indeed already today we can see multicore devices, which not so long from now will become manycore devices (a terminology used to indicate > 100 cores). So obviously there’s two issues to make this work. One is that you can not get speed up as long as you don’t have parallel software, and here Ahmdal’s Law is a very good back-of-the-envelope calculation for dictating the speed-up parallelism can get you: 1/(%seq + (1 – %seq)/N) where %seq is the percentage of sequential code and N is the amount of cores.. Looked at it from another way, we can see the amount of cores that are required in case that we have a certain percentage of sequential code to get at a speedup of X: N = (1 – %seq)/(1/X – %seq). Obviously for a speedup of 4 in the case of 0% sequential code, we need 4 cores, but if we go to a typical value of 20%, we see that 16 cores. For a speed-up of 10, we can nog even get a value for the amount of cores in the case of 20% and in the case of 7% we need 31 cores. So obviously the amount of parallelism should be reduced to the absolute minimum.
  • Transactional memory, although very cool, doesn’t scale. There was a paper on a hardware implementation, they replaced the L1 data-cache with their own cache which on top of the caching did the transactional memory support. This allowed them to parallelize different pieces of code and only during the final commit-cycle would there be a check as to whether there were conflicts in memory addresses being read or written. Disregarding the fact that their cache-hit penalty was 13 cycles (as opposed to 2 cycle-hit for normal L1 data cache), the problem is that they are sequentializing all the commit blocks (and incurring recomputation in case there’s a conflict). After talking to the authors, they claimed it is possible to parallelize these commit blocks, but talking to another professor, he said that it would require too much complexity. Hence my conclusion, transactional memory is very cool for when you want to draw coarse-grain parallelism boundaries with few possibilities of conflicts (and then not have to worry about proper locking) but completely fails once you go to more fine-grained parallelism. Combining that idea with the previous point, it is obvious we need something new in this field, the question is .. what?.
  • Compilers should cost money, more so than they do today. Companies are starting to realize that compilers are underpriced. This causes a serious problem as more and more of the complexity in embedded devices resides in software not hardware. If you consider that a typical EDA tool will set you off by 100K euros while you’ll only pay 8K euros, it is clear there is a gap. Tensilica, for instance, mentioned that they only way they are able to continue working in that part is by shifting the cost of research in their compilers in the price-tag for their EDA-tools. But there was a consent that compilers should cost more. This of course beckons the question regarding Open-Source. This company was answered by one of the people on the panel, though I forgot whom. Basically, if you look at GCC, it does not do a lot of very fancy things. As such, it’s running behind in comparison to state-of-the-art compilation techniques. Therefore, if a company finds some novel optimization transformation, they can keep their edge by selling it in their tools, and once these optimizations become more main-stream, they can then be open-sourced.
  • C no longer fits the model of computation.Well this is not perse my observation, as I’ve been a fan of less-main-stream languages for a while, but it is finally becoming noted even in the embedded community. The issue is that when you go to multicore, as developer, you’re writing down this software which might have different parallelisms but forcing yourself to write them in a sequential language, only to have the compiler do the difficult task of re-extracting this parallelism. What is necessary is a new paradigm that allows people to work with parallelism both at the data-level as well as at the functional-level and task-level. The principal Staff Engineer, Prof. Ian Phillips, was very adamant about this. He said that people nowadays try to tackle just enough to be competitive. For instance if the competitor can handle, for instance, 2 out of 10 multicores, people will try for 3 out of 10 multicores. Instead, there should be a long term vision so that there is a clear roadmap to get to 10 out of 10 multicores, and if such a path exists, then he, as hardware engineer, will be more than glad to extend his platform to offer the required hooks and pulls to enable this new model of computation.
  • Platforms are no longer reliable.It used to be the case that a computer would work as designed and once some transistors stopped working, it’d stop working completely. This rather black-and-white boundary is no longer the case in future technologies. We can expect to see many more soft-error-rates and very gradual decay of transistors. Additionally, due to the small size of the components with respect to the injected atoms to create semiconductors, variability is also becoming a serious issue as it is no longer possible to get an accurate view of the time or power that a transistor will take. This results in very variable timing of the critical path in a circuit which can even change at runtime due to temperature. Therefore it becomes impossible to make a clock-rate based on the worst-execution time that is specified at design-time of the hardware. Already we see chip vendors selling different chips at different clock-rates due to this timing-issue. So what next? As transistors become more and more variable and unreliable, there will be a need for fault-tolernace models at the hardware level. A good boundary for this seems to be at the core level, especially in many-core systems. At runtime, the operating system can detect the different clock-speeds of the different cores as well as possible failures and then determine how to allocate tasks to the different cores.

I think for all the above reasons, right now is a very exciting time. In the 70s, there had been efforts to work on parallel and distributed systems, though at that time it did not launch. Faced with the difficulties of today, the only way to move forward is by reexamining this issues, and failing to do that will lead in a stall of technology, so there is definitely a very big push to make it work.
an>

Be Sociable, Share!

17 Comments

  1. Bert says:

    Hello Christophe,

    Thank you for the overview. Interesting times indeed.

    Have you, by any chance, already taken a look at parallel languages such as Erlang or Fortress?

  2. Hello Bert,

    Are you by any chance the same Bert I studied with, if so, hello.

    I have not taken a look at Fortress and my look at Erlang was only very brief. Personally I do not have a strong background in parallel languages though if I understand it correctly, Erlang is based on message passing. I remember one phrase from the conference, though I forgot by whom, that said that Message Passing is a very lowlevel abstraction model for concurrency. That being said, there were some papers that used message passing as model to model concurrency but I will have to relook at the proceedings to remember which paper it was.

    What I see overall as the problem is that most concurrency specifications focus on the computation and not on the data. Hence it is hard to talk about data, and issues such as copying from one private scratchpad to another, or considering what the effects of cache-coherency are.

  3. One extra note is that I mostly looked at the software side of things, but I’m certain other people can point you the interesting things (and there were many) that were said regarding the hardware side.

  4. Anonymous says:

    your comments about gcc are moronic, it costs enormous amounts of time and resources.

  5. Well I’m glad you made that comment anonymously…

    Anyways, I have no doubt that the development of GCC costs serious amounts of time and resources. However, I doubt those resources could also cover additional optimizations required by the embedded community.

  6. John D Jones III says:

    “Anyways, I have no doubt that the development of GCC costs serious amounts of time and resources. However, I doubt those resources could also cover additional optimizations required by the embedded community.”

    Like what? GCC is the number one compiler available, more applications are built using GCC as the compiler than any other. GCC has consistently kept up with current platform needs, and provided optimizations that make even the oldest applications run better and faster. It also compiles most any non-interpreted language, Java, C, C++, ObjC, Ada95…. and it does it faster and with fewer problems than any other compiler available. And where do you get this, compilers need to cost more ?!?, I don’t know of any CFO that thinks spending more money for any product is a ‘good’ thing. They want a product that does the job right, not cost more. I disagree with everything you’ve said here. Please consider the big picture before making such statements statements.. C is still relevant and up to any new hardware/platform available now, or anytime in the future.

  7. Hannes Bender says:

    Hi Christophe,
    Intel is cooperating with IBM in creating a free JVM named Harmony right now. Do you think that managed-code (JVM,CLR) is more likely to run those massive parallel systems in future then the traditional native-compiled software aka C/C++? I always wondered which interest Intel has in pushing Java or the JVM forward so this might be the main-reason.

  8. Jeremy says:

    I work for a large telecomms concern – working on platforms for 3G mobile phones – and I can certainly confirm that one of the major problems we face right now is that we are seeing ever-increasing issues due to use of multiple cores.

    Currently the approach is to ‘debug out’ the issues, but it’s becoming increasingly clear to everyone that this isn’t really viable.

    Unfortunately, a 2,000,000 line legacy of C and C++ doesn’t get changed to something else overnight – and in practice only Erlang (from among the next-generation languages) is sufficiently robust and mature to consider for embedded applications – and it’s not a good fit for the lower layers of a 3G stack (which is where we see many issues).

    By the way, for those who praise GCC to the hills (and it’s a very fine x86 compiler), I can only comment that it is far inferior, in terms of generated code quality, to the ARM compilers (which are pretty expensive) on the ARM platform. In addition, it actually works on the Windows platform, whereas producing a GCC cross compiler for Windows is a very tricky task (look at what Crosstool does on Linux!)

  9. Well,

    I didn’t expect my blog to get this much attention so I will respond to some of the comments. Note that first of all, these were my observations in function of what people said. Nothing that I’m saying here is specifically new or my intellectual property. I was just summarising in my own words.

    @John D Jones: I think that we are targetting different communities. I’m looking at this from the embedded devices world, specifically mp3 players, cellphones, etc.. The problem is that there you -must- split up your application over several cores to get any decent performance. This is something that is typically done manually, or semi-automatically and most importantly -very- painfully. You must choose where to map your data, on which memory hierarchy element. Which tasks goes where, how do they synchronise, etc. When you get a multimedia application from a standards group, they are not written as such and there is a lot of effort for the embedded developers to make these changes. Notice how I said ‘embedded developer’. One of the chairs at the conference mentioned how there is millions of software developers but only a couple hundred thousand embedded developers. So that is a terrible bottleneck. As jeremy rightly points out, GCC is targetted towards compatibility, it is not targetted towards good performance and more importantly, good energy-consumption on embedded devices. Take any chip, their compiler will do better than GCC (well taken with a light grain of salt, I don’t know -all- chips, but in general it holds).

    @Hannes Bender: To be honest, I don’t know what to think of Java. Unless a lot of the computationally expensive blocks are offloaded to C again, I doubt they would compete. The devices I am talking about are not multicore general purpose PCs, but embedded devices. There you typically have a powerbudget of … 1W (at most). So you really have to be close to the platform, at least for your computationally expensive parts. In fact, some hardware vendors will even offer in-hardware implementations of standard blocks that can then be called like a function call (for instance a JPEG transform, or a IDCT, or a FFT or some such). I think that Java might be better off in ‘bigger’ parallel systems (like servers and such), though there’s a lot of other languages there competing for that space.

  10. Winterstream says:

    Hi Christophe,

    I think your comments about compilers having to cost more and at the same time requiring better support for parallelism, conflict a bit.

    The reason I’m saying this, is that better languages, with better concurrency support will reduce most of the work a compiler needs to do to help parallelize something.

    Yes, an optimizing compiler is complex to write, but GCC can blow an ARM compiler out of the water with additional optimization techniques and perhaps a custom code generation back-end (i.e. not a generated one).

    And I’m sorry, how vastly superior can one compiler be over another? A 2X (which I doubt) improvement? The world “vastly” hardly applies.

    I realize a lot of “pragmatic programmers” love to slag of open source software (I know many).

    However, open software helps drive the world in such a fundamental way. Embedded development would be in a much worse state, were it not for GCC.

    As for the choice of Erlang for the low layers of a 3G stack – why isn’t it a good fit? From my telecoms experience, C++ caused me much unnecessary pain.

  11. Anonymous says:

    Could you cite the paper you refer to in your second point?

  12. @Wynand: I think you bring up a good point and this is where points of view differ between different companies. In fact this was also a point of discussion at the panel regarding compiler technology. The question really is: Where do we put all the toolwork for parallelization and mapping? Is this part of the compiler, or is this a separate tool on top. Personally I think it’s just an issue of syntax, and in the end I consider that as part of the ‘compiler’. As such, what the main issue is that you have nowadays is that your application developers are writing applications in C/C++ and have a good understanding of the implicit parallelism of the application but less of an understanding of the hardware. Then you have the embedded developers that must split and map this application onto multiple cores and multiple memory hierarchies and try to reextract the parallelism that got lost once the application was written in C. You are right in saying that the actual low-end compiler for 1 processor could reuse GCC. However, what is missing is everything on top of that, to get you a good partitioning of your application, a good mapping of your data onto memory. And in fact, for that last part, even GCC does not perform very well. Though I know that nowadays there is work being done in improving that bit for single-core by putting the polyhedral model into GCC.

    As for the reference on the hardware implementation for transactional memory (which although my remarks I still think is a very cool idea). You can find it on Christos Kozyrakis’ Publications: “ATLAS: A Chip-Multiprocessor with Transactional Memory Support,” Njuguna Njoroge, Jared Casper, Sewook Wee, Yuriy Teslyar, Daxia Ge, Christos Kozyrakis, Kunle Olukotun. Proceedings of the Conference on Design Automation and Test in Europe (DATE), Nice, France, April 2007.

  13. John D Jones III says:

    OK, i regress a bit, my experience with embedded applications isn’t as vast as other platforms, GCC is built for compatibility, and I can see where that would end up requiring more computing power in the end, especially for battery powered devices, but still I don’t believe that any compiler or program being more expensive would make compiling embedded applications any ‘better’. My point was GCC being free, compared to say Visual Studio which costs quite a bit, in the end is still a better compiler, faster and targets more platforms, and it’s free and using either MinGW or LCC can even be used to port most UNIX code for Windows. I do apologize for misreading the main topic of your post.

  14. @John D Jones: I definitely agree with you that making something more expensive is not going to make it better. And I agree also that GCC is a great tool. But the problem is that it is just not a profitable position to create a compiler for an embedded system and still pay off the novel research (with an emphasis on novel as that is definitely required for multicore solutions) with just (for instance) 8000 euros. That’s why the compiler field is so small, there is no money for it, while on the other side companies are willing to pay 100K for EDA tools. Perhaps, as a colleague of mine mentioned, the best idea is just to rename the embedded compiler (along with the parallelisation and all that) to “EDA tool”.

    I do believe that GCC will always have its place, and that is, as you rightly pointed out, compatibility. However you also need more advanced compilers that perform specialized optimisations, and you need to pay the people that invent these optimisations. And once those optimisations are done by most compilers and the strategic advantage is gone, nothing is to stop from those ip blocks from being opensourced.

  15. Jim says:

    GCC is better (best?) on x86 because thats where the majority (if not effectively ‘all’, given the penetration of non-x86 architecture) of the “sockets” are, and therefore, where the effort went.

    Even then, the developers of GCC have no access to Intel’s “Yellow Books”, so GCC can never be as optimal as a well-engineered compiler from people who ‘do’.

    Until quite recently, it hasn’t mattered much. Intel (and AMD) would release a clock-speed boost into the market fast enough that it wasn’t worth the application (and linux) developer’s time to pressure the gcc developers to generate better code.

    If not multicore, then manycore will blow this scenario out of the water, and gcc (and friends) won’t matter (or rather, will coast down slowly.)

    I spoke to RMS about some of this recently, and he was … unimpressed. He views himself as a political activist now, not a hacker. He doesn’t know that the advance of technology is about to sweep his “free software” movement under the rug of history.

    However, this same sea-change in computing technology will also likely leave the linux kernel (and Windows) under the same rug. Linux (and Xen) will enjoy a brief rise in popularity, as a way to take advantage of multi-core (via virtualization), but, due to communication overhead, running this on 100s of cores is going to be a waste.

    I (re)-predict the rise of functional programming, and functional compounding, as it gets quite possible to do speculative execution in the absence of side-effects. If nothing else, you don’t have to ‘undo’ them, so you can simply throw the results away if they’re wrong, or not needed.

    I don’t see any reason why a company that invents a great manycore CPU wouldn’t also need to implement the toolchain for it.

    They will, most certainly, give it away.

    I also find it quite likely that the toolchain will be built… in Lisp.

  16. rektide says:

    re: transactional memory, i dont see that a hardware solution offers any compelling advantages over building a really good runtime for yourself. to me, transactional memory is more about opportunistic concurrency and knowing what programs are doing than it is transactions per se.

    the root idea to me is generating a n evidentiary SSA form log for each execution context or job, which is a pretty easy trick with some basic shadow logging of your state or program flow.

    i agree that after a very long field day, cs is finally going to be able to pursue some very interesting very technical problems.

  17. Kim says:

    Thanks for the information.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>