Svoboda | Graniru | BBC Russia | Golosameriki | Facebook

To install click the Add extension button. That's it.

The source code for the WIKI 2 extension is being checked by specialists of the Mozilla Foundation, Google, and Apple. You could also do it yourself at any point in time.

4,5
Kelly Slayton
Congratulations on this excellent venture… what a great idea!
Alexander Grigorievskiy
I use WIKI 2 every day and almost forgot how the original Wikipedia looks like.
Live Statistics
English Articles
Improved in 24 Hours
Added in 24 Hours
What we do. Every page goes through several hundred of perfecting techniques; in live mode. Quite the same Wikipedia. Just better.
.
Leo
Newton
Brights
Milds

From Wikipedia, the free encyclopedia

Open64 is a free, open-source, optimizing compiler for the Itanium and x86-64 microprocessor architectures. It derives from the SGI compilers for the MIPS R10000 processor, called MIPSPro. It was initially released in 2000 as GNU GPL software under the name Pro64. The following year, University of Delaware adopted the project and renamed the compiler to Open64. It now mostly serves as a research platform for compiler and computer architecture research groups. Open64 supports Fortran 77/95 and C/C++, as well as the shared memory programming model OpenMP. It can conduct high-quality interprocedural analysis, data-flow analysis, data dependence analysis, and array region analysis. Development has ceased, although other projects can use the project's source.

YouTube Encyclopedic

  • 1/1
    Views:
    5 396
  • Re-Configurable EXASCALE Computing

Transcription

>> I'm pleased to introduce our guest today, Steve Wallach, has been involved in computing from mini computers to super computers for a lot time. He started in the '70s working on Data General's, MV 8000; one of the architects of that system which has the distinction, as far as I know, of being the only computer development that was the subject of a fairly well-selling book in the general market that even won a Pulitzer Prize. He went from there via a little detour to found Convex Computers which was probably the first mini super computer companies and were the first, as far as I know, the only really successful one. I--it eventually sold out to HP where he stayed for a little while. He's been through a couple of ventures since then and, about three years, ago, he started Convey Computer which was sort of in to response DoD-DoE studies of exascale computing and various things there which we'll hear a little bit about today. He's won--he has won a number of awards and other honors, the most recent of which is the 2008 Seymour Cray Award for Innovation in High Performance Computing. And today, we'll--he's going to talk about exascale computing and things related to that. Steve. >> WALLACH: Okay. Thank you. I like to keep my presentations as interactive. I realize usually you have questions at the end, but if someone at least had a question or a critique and would like to start a discussion or debate, if it's live, to go for it. Okay? So, acknowledge--when I did this, I also worked with someone who now works at Oakridge and, as you can see, we were perhaps overzealous on what we did. So, I will discuss first HPC software and then how that leads to exascale computing. In the old times, we were told beware of Greeks bearing gifts, remember that. Well, today in today's world, we can contemporize it as beware of geeks bearing gifts. And what I mean by that is most of the world in high performance computing has been very floating-point intensive, working with Fortran and C et cetera. But today, the HPC world is now moving to what people are calling data intensive and data intensive is very different. One, we tend not to have any floating point or minimal. Secondly, the paradigms are very different than the classic HPC. And I'm not sure whether this is good or bad for general HPC, because all the people will be thinking about solvers, and floating-point are now thinking more about "How do I do grafts, how do I do multi-threading and everything else?" And we'll see that a little bit later. So, what problems are we trying to solve? You can see what they are. The major thing is uniprocessor performance is leveling, clock cycles are kind of constant, so how do we get more performance? Obviously, it's through various levels of parallelism. We have some parallel processing, multi-threading. To me, this is déjà vu and based on the audience, maybe five people will under--know who Yogi Berra was. Two, okay. >> [INDISTINCT] >> WALLACH: Yogi Berra is credited with saying, "It's déjà vu, all over again." Basically the same problems we're trying to solve today people were trying to solve in the new early '80s accept this with a VAX. How do I make a VAX run faster as opposed to how do I make an make an X86 run faster? Or how do I make an ARM run faster? And what we've seen is you need more performance, so people look at GPUs. They used to look at IBM Cell, FPGAs. So what we basically have is heterogeneous computing again. And some of the reasons that they've both succeeded and failed in 1980; people were not old enough or fall into the same traps again as they did 30 years ago. We'll discuss some of that. Current languages, Fortran, C. I was telling Jim, "How many people here under 40 have reprogrammed with Fortran?" One? One and a half. Okay. I've proven my point. Today--we can check driver's licenses afterwards. The point here is Fortran, which is still, in many cases, the standard especially for older code for scientific, is not basically taught anymore in school. Compilers, in many cases, are not--newer compilers are not dealing with Fortran. Well, we still have a lot of code that's written in Fortran. But, of course, today we have C, C++. Intel has CStar or C# is Microsoft, StreamC, UPC, et cetera. Or we have a bump in the road. The standard HPC benchmark is Matrix Multiply. Any machine, the first thing you test with, a vectorizing compiler or whatever, is how well do you do on Matrix Multiply. So, this is an example of programming in CUDA, which is a language developed by NVIDIA to do high performance computing. So, this is an example of programming Fortran, generating CUDA and what is four lines of Fortran. Because it's an attached processor, you have to move the data back and forth, you have to do mallocs, you have to create explicit threading, et cetera. This is for a simple program. You know, if you have 50,000 lines of code, it's basically impossible to do this. So as a way of perhaps showing my distaste for this, I developed a term pornographic programming. You can't define it but you know when you see it. And CUDA is an example of that for me. So, this is something from a colleague at Los Alamos. He gave me permission to use this slide as presentation. As you can see, there are three people getting to the end of the hurdles, and one person who's tripped over every hurdle. So he says, "Find the accelerator." So that's the accelerator. Now, it goes, "Find the programmer." That's the program. So, there's always a tradeoff between programmer productivity and CPU performance. And the question everyone has to ask is where do they want to be in that tradeoff? As computers get cheaper and cheaper, and people get more expensive either to hire or productivity, you really want to emphasize more computer--user cycles, not computer cycles, because computer cycles are becoming throwaways. So, what's the programming model? Today, of course, we've moved away, we just don't have Fortran and C. We have Java, Javascript, Server-Side, things you all know. And the reason I had, tongue in cheek, beware of geeks bearing gifts is a colleague found from this website the following chart of--and, you know, it's statistics; you can draw your own conclusions--that show the language that's on the fastest rise is Objective-C. Well, the reason is simple, if I have an iPhone or I come out of college and want to start a company or do an app either to program--maybe program something for an iPhone or Android as supposed to anything else. And you can see the other languages. So, this is the top 20. Now, what's interesting is [INDISTINCT] which I happen to like a lot is a interpretive-based matrix language. And you could see how it's used. The bottom 44--and what's interesting is COBOL is higher than Fortran. So, again, you know, take these statistics with a grain of salt but it shows you where the effort is. And even RPG, most of you don't even know what RPG is. It needs a report generator for those who are old and would like to know a piece of trivia. >> [INDISTINCT] >> WALLACH: RPG was designed mainly for card systems. But the point here is to see which language is--you know, take it at face value but it makes sense that languages like Objective C et cetera would be on the rise, were used in other languages. So, here's another statement about languages and then we'll move into some hardware. Some of you, I'm sure, went to UC Berkeley and may have had Kathy Yelick as a teacher. So, when she was at--a keynote speaker at [INDISTINCT] some years ago, she put this slide up. First, when vector machines were king, we had parallel languages with hoop adaptations that means ignore vector independency. Then we have SIMD machines, we had to create in languages. Then we had shared memory, we created new paradigms. So, at the end, clusters, we created MPI. So, a tan line is we've been at the mercy of hardware where the architects listen now. So, my colleague had to keep me from standing up, but he wouldn't let me stand up, and he said no. And what this means is when you design a computer system from scratch, we really like to have a co-design from hardware and software. When the people here surprise you, we actually worked together with David Jo and it pretty much was a co-design and it--and it worked. And the reason it worked is, one, it was a team of maybe 30, 40 people, we trusted each other, and we had a common objective and goal in mind. It was not an adversarial relationship though other people may try to make it that way. And it worked out well. So, the key here is sometimes when we do define languages, we don't necessarily take into account what the consequences on the hardware, that it may run too slow or the software may not be as resilient as we would like because the hardware is not a good match for the software. So, now we get into exascale. The Department of Energy--I was at several workshops--created their view, is they'd like to get to exascale by the decade, end of the decade, 2020. And they asked for potentially a billion dollars to help develop the new technology to do this. So, my view is I'll do it for a lot less and I'm going to show you my take on how to do this. And the key here is what's scared a lot of people is they believed, to get an exascale, they're going to have to have one billion independent threads. So, you talk about embarrassing levels of parallelism. I think a billion is more than embarrassing. I mean, what the right adjective would be. So, part of their summary is this says 2019, they want a peak of an exaflop, it will consume 20 megawatts, system memory would be 50 petabytes, particular node is 10 teraflops. And you could see all this. But here's [INDISTINCT]. MTBF is 24 hours. So, tongue in cheek, there's a reason it has to be so fast so you have the answer before it fails. >> [INDISTINCT] >> WALLACH: I'm sorry. >> That was a really [INDISTINCT] >> WALLACH: Yeah. Well, part of the problem is, when you run something as sophisticated as this, it's not clear you know when you get the wrong answer other than running the problem three times. Remember, you have a billion threads but it's one application, so it's a time to solution of one application that could take a billion threads. And that's a major part of the research is how to build in resiliency for this applications. So, let me tell you how I'm going to approach this. UC Berkeley--and actually, I did this quite independently--said, "You know something, we can take all these applications and put them into 13 bins which we're going to call a motif," and here's the URL. So, they said it's finite state machines, computorial grafts traversal, structured grid, Mapreduce, et cetera. That's the rows. The columns are various applications, speech, music, games, database. And said rather than just haphazardly designing architectures which are very application specific approaches and, contrary to what people may think, rather than a hundred different paradigms, we can reduce it to 13. That's pretty significant because doing 13 paradigms as opposed to a hundred is a lot more tenable. So, quite independently, I took their chart and--I know Patteron reasonably well--I said, "No, you got to extend it to take your paradigms, but then it's what memory system do you need to get optimum performance?" So certain applications like a structured grid can deal with caches but draft--graft traversal generally doesn't cache because it's all over memory. So, there's another third dimension, if you will. And then, of course, you have compiler complexity from single instructions single data, SIMD, MIMD, and a full custom of personality. So, if you'd really like to do a good job at this, this is--this is, you know, what I think is a reasonable approach. A lot of other people are--share the same thing. So, how do we create architectures to do this when to create a full custom architecture could cost a billion dollars in five years? So, what happened is when I got involved in this, I said, "I think I know how to do this." And here's what I did. I started a company, Convey. So, the first company was Convex. You know, it was a no-brainer. Let's take X, go to Y. Though certain former employees said, "No, it should have been Donvex and go on and on and have fun with that." The Convex environment was much like this, so we're pretty much a free for all with things. So, what we did is we started out with the x86 ISA and with Intel's approval--Intel is the best in my company--our company--we create different what we call personalities that look like extensions to the x86 architecture and they're somewhat mutually exclusive. So, we have instructions that just help data mining, sorting and retrieval, but you're not going to use that for computations with dynamics, for example, and things that would be a bit logical, et cetera. I'll show you certain examples of this. I'll show you a performance you obtain. So what we did is we actually have our own compilers, which we generate the x86 instruction set and what we call the custom ISA. And because we're shared virtual memory unlike [INDISTINCT] back and forth is cache coherent virtual memory. No different than every other machine I ever designed with coding point units, et cetera. This is what we're--we've been shipping this for 18 months. It's a 2U box, the bottom is a standard Intel one new server board, second box is a co-processor with an attachment right now on front side box. Inside where we get a lot of our performance is we have a memory system that doesn't use caches. A highly [INDISTINCT] memory system and it's 31, the prime number 31 weigh Intelly and the reason is I got sick and tired of recoding FFTs for power two. And as I was discussing with other people most people create a raise that it be 256 by 256 by 256. And to those who have any experience you make leading dimension 257 runs four times faster or something like that. And so we have very high bandwidth memory system. I'll show you some benchmarks. And this is--works the cache coherent with the x86 memory. Now what we've done to help people develop their instruction sets whether sort of our customers or we've done this ourselves, we create what we call a PDK, which says let us define into phase two memory to the x86 et cetera because you don't want to worry about that. We have a whole bunch of logic designs to do that. And day one with our compilers is initially based on open 64, but we'll get into that. We let you define in the table your instructions of the compiler when it sees a construct A = Google one.bc. it automatically generates the instruction with Google one.bc. So to take this one step further, we're now doing the following. One of the problems in doing that, switched your compiler purse you understand this, you have no instructions, you don't know what the side effects are. Can global flow go over the instruction? Does it modify anything else? So what we've done is created a compiler where you define what the graph transformations can be done without instruction, which means it makes a development of a unique personality pretty straightforward, support in a compiling go model with C or C++ or Fortran. I probably shouldn't mention Fortran because you hear programs of Fortran, but C or C++, which means in a relatively quick way you can add new instructions that will optimize sharing all of memory and the compiling go to get the benefit of the acceleration. >> How complex is the instructions? >> WALLACH: Well, let's see how complex, once you're trained to use it, it can take a day or two per instruction. I don't think I want to have a hardware engineer do it but, you know, if you--if--the question is how complex is definition. If you understand what it means to have side effects, how many [INDISTINCT] with the result goes, things like that, it's relatively straightforward. So, here is an example to benchmark. This is just simply load store memory were the [INDISTINCT] are sparse as opposed to stride one or whatever. So because we actually have its 3131 interleave, which all of use--for the moment let's skip it. On the cache base machine as a stride goes beyond when the performance basically flat wised to one or two gigabytes a set because you're breaking cache on and it reruns. Because we only have caches, and this is against the [INDISTINCT] core, we basically have flat at 45 gigabytes a second. And, of course, when it's--the stride is 31 or 62, the performance stops then, but in actual operation or in running user code this doesn't happen with everyone. Most of the time, we're over here in flat one. This is an example of a custom personality to do a search kernel for bioinformatics. Bioinformatics is a lot of sweating and searching. It--you take a genome, you're looking for certain pairs, and all of those things. So we did this for updated test customer UCSD, San Diego, and we created fetching sub screens, updates, et cetera. We have 440 what we call state machines. And as you can see, the benchmark with the performance is, in some cases, approaching over ten times in eight core Intel over a hundred times the signal core. Yes? >> Would you be saying something about the memory consistency properties? >> WALLACH: It's a--it's... >> [INDISTINCT] >> WALLACH: Yes, I can do that. I'll--let me do that right now. Okay. I'm going to go back because... >> And in particular how that affects the complier. >> WALLACH: I understand. Trust me, I understand that question because--okay. Let's do with this one. It's the same virtual [INDISTINCT] base. We implement virtual memory in the core processor. It's a 64-bit virtual memory. It is identical to the x86, 64. We don't do the I32E or this other stuff. It's cache coherent. You do a right from Intel and if we have the data we update it and vice versa. So, from the viewpoint of the compiler, it looks like an SMP. The simple takeaway, it's in SMP. Did that answer your question? >> It does with the--there are a lot of open questions around maintaining the consistency [INDISTINCT]. Other than that, is there any explicit software protection [INDISTINCT]? >> WALLACH: The answer is zero. There's no software protection. This is--the team who built this is--has been with me for 25 years with both SMPs, we understand all the issues about of out of order execution, strong ordering, weak ordering, you can go on and on and on. The answer is it's in SMP period. There's no--there's no exceptions. If there's an exception, we fix it. And one of the ways we're able to do that to be [INDISTINCT] is Intel is an investor. We've had access to, let's say, data and how their stuff works that unless you have access to them there's no way you'd ever get that way. So we understand to do certain things if you don't set CSR 10 with bit by--quickly by CSR 50 with bit 53 it won't work, if I can give you examples. Now, it's like an SMP period. This gentleman had it quick. >> [INDISTINCT] >> WALLACH: Yes? >> [INDISTINCT] >> WALLACH: Right now--a question about data improvement and power consumption. There's no magic. We give you [INDISTINCT] to say where the data should go. I get to see a compiler algorithm that says "Put it here versus here." >> [INDISTINCT] >> WALLACH: The virtual memory hides it. The real data consume--power consumption, I'm sorry, is when you access particular memory, not moving it but [INDISTINCT] benchmark where you get 50 gigabytes a second because you have old memory working all the time. That's where the power consumption is. Here's an example of something we just released for graphs. And what we did is we created a personality that can support up to 8,000 threads simultaneously. Since our memory system can support out of order execution, going to your question, with weak ordering, we can--we have cache on the data coming back. So if you reference one, two, three, four, five, six and four comes back we know where in the cue four goes because it has--because that memory reference has a tag, okay? So what we've done is create something for something called the broom graph, which is something used in bioinformatics where we--where we have hash tables going into memory to look at the overlap of segments. This is bioinformatics. And the memory system is the same but it's a different personality. And the way we do it is we have--FPGA is actually doing the execution. So for each personality we create a different bit file for the FPGA and that's how we create these different instruction sets that are mutually exclusive. So now, how do I take this to be an Exascale system? You know, most people say when we take Moore's law and Moore's law says in eight years is going to be ten times faster, that's what they'll do, that's the wrong way to do it. So what I did is I assume a whole bunch of somewhat independent variables, floating point IP, how many arithmetic engines, language-directed design, and I did a--basically what you would do basing the system analysis to see how this factors out. So everyone knows Moore's law; it says every two years twice the logic not necessarily twice the clock [INDISTINCT]. So I'm going to see by 2020, I'm going to assume a mean factor of seven, sigma plus and minus two, so seven times performance, same clock rate, same internal algorithm. A lot of people have never heard of Rock's law. Arthur Rock was the venture capitalist that started Intel. And he said that the cause of the semiconductor [INDISTINCT] doubles every four years and, in fact, that's kind of happening. And that explains why in some sense you can argue the number of independent [INDISTINCT] are going down because the cost is so expensive that unless you have the value, you--you can't have your own [INDISTINCT]. So you have boundaries like TSMC, et cetera. So, this is--most people have never heard of this but this sort of what draws the economic aspect of Moore's law. So application-specific. A standard benchmark in HPC is Linpack, which is basically matrix solved. When people had vector machines like NEC, Cray, this is going back to 1993, they got 95 % of peak or 90% of peak. What was called the earth simulator with 5,100 processors at 87%. And the last impact--impact it's got a lot of press. The Chinese, mainly Chinese had the highest impact out of 86,000 cores but only got 53% of peak. So it's a broad force way of getting the performance as opposed to efficiency. So you talk about power consumption. That's a lot more power efficient because it's in an inefficient approach to solving the new problem. So, if we can get 90% of peak we actually have better power performance. So how do you that? One thing we do in Convey is we're creating major accumulators. You know, vector machines had vector accumulators and in return vector processing. But all the applications are really matrixes. The reason it's called vector processing because when it came out, let's say with the Cray 1, there wasn't enough physical memory to really have a matrix. It was really one [INDISTINCT] into deep processing. Now that's got the issue. So if you have the matrix accumulator, you can get much more efficient coding and this is example what I mean of I'd rather put the SRAMs into application-specific machine state as opposed to cache and let the compilers deal with matrix accumulators or some other machine structures that could be appropriate to an application. So it's under your control. So in this case it does stands in one operation. It's very typical operation for a lot of code. And if you do that you actually get an excess of 90% efficiency. Right now, floating point is--with that--is not [INDISTINCT] IP we hope in the future. And tomorrow, [INDISTINCT] so this should get a giggle there. Now we get more performance. Then you can take--trail up some memory and put arithmetic units in memory. More and more you get a notion of pin processor in memory that is "Let's put the processing closer to memory with the latency in going to physical merit." So this would be similar to that type of mechanism. So my view is since as you see I'm very much interested in what compilers can do in programmer productivity, I said here's--here's my matrix multiply written in UPC. You explicitly say there's 60,000 nodes. And I believe the compiler can take this and generate the code directly if all my nodes have a shared virtual memory. That's not cache coherent. We have 16,000 nodes, I don't need cache coherency. Not that I can't, I don't--I can't afford to. And by having a language like UPC where you say how many nodes and--so it's global physical, global virtual but not cache coherent. And that's where a lot of the industries moving in that direction. You can develop a compiler that can automatically paralyze and depolarize that without going to any language extensions. So you do all the math [INDISTINCT] for matrix arithmetic, et cetera. You get a mean of 800 times what is today, best case 2,300, worst case 448. And this system, what it looks like, is it uses 24 megawatts, 32 terabytes of memory per nodes, so 288 megabytes. Physical memory is about 60% of the power. But that gets you an Exascale machine. The way you hook it together, I believe each node will have optics coming out for the Interconnect, so. I actually used this slide five years ago when I was the keynote speaker at super computing and people thought, you know, this is another--I was crazy with another--with a metric of point one bytes per peak flop, which is recently IBM announced that they develop basically that technology in a research group, which shows a processor, a memory and optical routing going [INDISTINCT]. The general consensus is if you want high band with going this, then you have to go optics. Now, this is not a product yet but it does shows you the type of effort it's being spent to, in fact, do something like that. So all the [INDISTINCT] system could look like this, 16,000 nodes. You have to consider that 10 meters is 50 nanosecond delay in an optical switch. So before I conclude, these are some of the things I've been saying. Heterogeneous here to stay, Smarter Memory Systems, et cetera. Not only take a total different argument and debate myself, which I do quite frequently. So when you ask me a year and a half ago, I was doing yet another computer, Convez computer is what's Convey does its thing, you know, X, Y, Z. and I--in 2009, I presented this at a workshop in Minneapolis, Maryland. So I said, "Well, if I'm going to do Convez computer, I'm going to design the iPhone 6.0." That's where all the action is at--I'm sorry, an android. So, we have to look at the user interface, the processor, the external communications. Now I, you know--I did this a year and a half ago, so my idea of the antenna. If that goes away, an iPhone antenna is should go like--so I said, "What's going to be in 2020?" Well, if you are smartphone, the power budget is 300 milliwatts typically. I believe whether it be ARM or IA64 I think all of their RSAs would either be gone in each players. There'll be a 64-bit virtual address space certainly that [INDISTINCT] and ARM has announced, you know, publicly that that's in development. This is now in the smartphone, 2 to 4 gigabytes physical memory and then FPGA co-processor to add special functions. You may have a terabyte of Flash and you may have 10 to 20 gigaflops of real star 8 performance for 3D video tried with the EHP performance. In some sense, you know, what exist on a laptop will exist on this model, if not, worse one. I don't think there's any debate about that. So if I had this technology, what do I do with it? Well, Xilinx just announced in essence something similar. An on dual-core with a FPGA, which can have common accelerators integrated to within the instructions at a memory. And this is a simple chip. So now, the question is, if I want to build an exascale computer, shall I build it first way with optical memory or should I, you know, put a million of those guys? And I think the answer is, they have both because certain applications may look better with a million of those, certain applications may work better with 16,000 high-performance nodes. Now, this is something I throw in and just went in every downloads here so we'll get a kick out of this. Well, I honestly believe 64 bits is not enough on the virtual address space. What I find interesting is the communications people are taking a lead on this. When I ran out of bits in an IPv4, they said, "Let's just end this revering and do IPv6 and have 64-bit New IDS, and life becomes wonderful hopefully than ever. Maybe whenever we're born and then we'll have--looks like a Social Security number, we'll have IPv6. And, you know, when people get born today, you know what they do today when you they get born, they get Social Security number and a frequent flyer. So, if you were to surprise me, if we could deal with privacy issues, in that we'll have an IPv6 sample. Whether it happens or not, I don't know. But it makes sense to me. So I know it's a unique thing. And if we do that and have a virtual address space that, let's say, has IPv6, then we can actually unify communications in computing. Because when we reference something, it's the same way you're referencing them; whether it's on a cell phone, or on a computer system or on a network. And I think that really helps the whole programming productivity and everything else. And in fact, there was a patent issued in 1987 which kind of discusses that. So, it's not a new idea or it's just a question if it hasn't been implemented yet. So, just finally, I will always end it with Dilbert because Dilbert has this one and everyone in this building. "As a lead software engineer, I'll give you the first unit of our ten thousand copy production run." "Wow I wish we'd designed with all the features listed on the box. That would have been awesome." "I'll put this with the other reminders on how life could have been excellent." Thank you. I've been speaking pretty fast to make sure I have time at the end to open for any questions. Do you have any question? >> Sir, could you make a comment that we should be both focusing on programs which are [INDISTINCT]. And one observation is the problems grow, you know, run out of [INDISTINCT] programs. So [INDISTINCT] somebody [INDISTINCT]? >> WALLACH: Okay. So tell me something I don't know. If you're not pushing the edge from what--where I come from, you're not doing a good job. When someone comes into me and said, "I've done that five times before. I've done that in a week." Certain things, that's fine. If it's like something else I won't--I'm not pushing you, him or her enough. I find that these modest people are the ones who [INDISTINCT] your turn all the time. Push, let's say. What gets my juices going is when someone says, "No one's ever done that before and we would really like to have it done. No one's done it before." >> [INDISTINCT] my point was that, you've got to be careful not to exclude the people into [INDISTINCT] bits. If you will, you'll end up one of these situations [INDISTINCT] you know, despite the [INDISTINCT] can't do on that teams if we use the whole machine. >> WALLACH: I understand. You asked me a very general question, I'll give you a very general answer. I have a saying--separate presentation, I have a saying that, the first 90% is easy; it's the second 90% that's difficult. And what you're saying is you're addressing the second--and you know what, the second 90% screw up the first 90%. So I totally agree, in all my experience of designing computers for 30 years, when you ship a product, you're lucky if what you shipped achieves 89% of the objectives you started out with. So if you don't kid yourself and when you started projects that this project what a ship should have A, B, C, D, E hardware-software. At the end, put a checkmark or make it--or grade yourself. I would be very surprised if you get more than 90% of that list. I mean, if you mark it fairly. And the gentleman is nodding. Okay. >> [INDISTINCT] >> WALLACH: Could you please help me what variability means? >> [INDISTINCT] basically [INDISTINCT] >> WALLACH: Okay. My colleagues in Intel, when I had this discussion with them, it's generally, trust us, okay? I know--I know that's not an answer you want but--I do not have a PhD in Material Science or Physics, so on. I'm not going to try and give you a right answer. It's very clear, one of the reasons why they have to go with four billion dollars and eight billion dollars, etcetera, is to address some of those issues that--how do I go from 28 nanometers to 22 to 18, you know, etcetera, and one of the ways they do that [INDISTINCT] reliable products is have new process technology, that could push billions of dollars to the [INDISTINCT] and some of these machines, I've seen them, if you look around that they have, you know, a particular way for test that I have seen which is big as a bus, its 50 million dollars and have multiple because they have that [INDISTINCT] and that's replicated, you know, in times. Let alone if you ever walk in [INDISTINCT], you know, I'm on a bunny suit, you know, you're all dressed down and--this is serious business, you know, they make it to the point someone once said tongue and cheek, not me, and maybe we'll have to put a [INDISTINCT] on the moon where there is no gravity because that position may be affected by gravitational forces. >> But what [INDISTINCT] >> WALLACH: You know... >> [INDISTINCT] >> WALLACH: You know, one of things is, I mentioned to people just like I ask about how many people know FORTRAN. In 1960, there was a whole school of papers called Designing Reliable Machines Built out of Unreliable Components. And a lot of people have never seen those papers because--let's say we're scanned where you went to school like I did, you never knew people wrote papers like that and it's amazing--like there once in a conference, this discussion I said, "Maybe we should go to majority vote logic or threshold logic." I'd go, "What do you mean?" You know, it's not bullying but it's a duplication within a gate itself were known to approach to and we may have to do that. I honestly--I'm sorry? >> [INDISTINCT] >> WALLACH: See, I have this... >> [INDISTINCT] >> WALLACH: I have this belief to be bi candidate; I want to make the hardware of work. If the hard--if hardware complexity reduces so offer a complexity, that's what I want to do. So, like when you ask a question about cache coherency, because of this example they would [INDISTINCT]. I don't want that because that ends, that doesn't help me. So, you have--yes. >> How would you get to a fraction of a thousand problem at the back of [INDISTINCT] >> WALLACH: I'm sorry; I don't quite understand the question. >> I was--earlier in your slide [INDISTINCT] increase. >> WALLACH: Yes, I took all these number. I said like [INDISTINCT] gets me this. >> [INDISTINCT] why can't you get build a [INDISTINCT] >> WALLACH: Okay. Let's get--this is the math I used. Several years ago, I did this analysis for some people in the government about using LINPACK as a predictor of Moore's law. And when I took the published numbers, it was much more than doubling every two years, let's say. So, I looked at the micros, the micro with went to 16 bits to 32 bits to 64 bits, so it was a memory band with aspect. And how fast you can do a multiply went from eight cycles to four cycles to like a pipeline. So, you can't just use gate, so what I said is, a factor seven is Moore's law, a factor of four is using Matrix arithmetic's versus Vector arithmetic's. A factor of eight is having hardware float as supposed to FPGA flow. Another factor was putting processor in their--that's what this line is using this with pipelines per node. So, you take 7 X 4 X 0.9 X 8 X 4 and I'm--did this again because when I was NOYCE like this, the first time I would do when I would see this is, I multiply it out to see if the--present or got the wrong answer. And you get nominally a mean of 800 times, you multiply with the numbers out. >> I don't see [INDISTINCT] player. >> WALLACH: Because in the current implementation using FPGAs where you have to use lots to create the multiply add. The pipeline is like 13 cycles, etcetera. If you have hot, I hardly, custom logic to do that that drops down by factors, orders and magnitude. It really does, is that you wanted? >> You're kind of get FPGA a custom projects. >> WALLACH: That's right. And so, I'm saying if FPGAs have custom logic islands like a DSP is custom logic. So, find a custom logic 64 bit fuse multiply it, I would see that benefit. I--we'll find it so I could tell you that this--that is not a random number. It's based on a lot of detailed analysis. >> [INDISTINCT] communication, is that part of this [INDISTINCT] >> WALLACH: Yes, okay. The question is the IBM that example of the [INDISTINCT]. Yes or no? It's not predicated on what IBM is doing but I firmly believe that to get to [INDISTINCT]. If you're doing high level computer, there are several matrix of how much bandwith per node you need based on flows. The desired bandwith is 0.1 bytes per second per peak flow. If you work out the numbers, if I want to get a terabit or so out of a node, sure I can use copper, maybe but--or single lambda optics, if I could use that term but that becomes more difficult to build. Where today [INDISTINCT] if I go from San Francisco to New York, I'm over a terabit fiber that's using DWDM dense wavelength division multiplexing. I--there's no reason why that technology--no reasons caused but from a technology perspective that technology appropriately produced in size and cause could just as evenly be done node per node and there are benefits of optics generally lower power and they don't have RFI. I recall--I remember my first company, we had optical links and they were put over fluorescent lights I go, "Are you crazy? [INDISTINCT]?" So, there's a benefit in the resiliency aspect not just a power and bandwith and that, you know, I can--even though it's not the same but people who just say--I'm saying one thing. I can use optical links between my receivers, you know, it's a very low bandwith but it's still optical. And I think you see more and more--what happens that dot-com bust hurt the use of optics in computers because a lot of technology being put into optical transceivers and stuff and when that dot-com bust happen, a lot of that [INDISTINCT] went away pretty rapidly. Gentleman at the back. >> [INDISTINCT] I mean, it is kind of cool to be able to sort of get inside the [INDISTINCT] but it already [INDISTINCT] but nevertheless, we are going to assume that you're still working at a much lower issue in terms of [INDISTINCT] operation [INDISTINCT] exist? >> WALLACH: Well, okay. We're going to try to repeat the question. The way you get the performance--let's just say yes, step back. If I have a vector computer, one instruction issues, it takes hundreds of thousands of equivalent loads of stores. So you... >> [INDISTINCT] >> WALLACH: And... >> I'm assuming your server [INDISTINCT]. >> WALLACH: So there's two--let's do character by character. One is a simple approach where you--you only read one instruction to generate hundreds of thousands of loads. So the instruction issue rate is not as relevant. It's more the upper hand issue rate. The other approach is multi-threading where, like an example I gave with the graphs, we have 8,000 threads running at the same time. So in that case, what counts in performance is not the instruction issue rate but the thread synchronization and how many threads can I go--can the memory system support 8,000 threads. >> What I'm actually trying to get at here is [INDISTINCT]. And yes, you can extract performance and precision, I'm just--we're looking at this continuum in terms of what degree these super sized instructions [INDISTINCT] you've got mapped into essentially kernels to solve this. >> WALLACH: Okay. >> You're--you've gotten this from science down to some degree. I'm trying to get some feel for what the... >> WALLACH: Okay. >> ...applications are like what we do here than [INDISTINCT]. >> WALLACH: Right. Okay. If--the question is how can you map kernels into instructions or [INDISTINCT] efficiency or whatever. Ideally, you'd like to have one kernel, you know, be responsible for 90%, 95% of the execution. And most--sometimes it happens, sometimes it doesn't. >> You do physics, you can do that. >> WALLACH: Right. >> [INDISTINCT] >> WALLACH: I understand. So, the reason I showed the example of the graphs is that's an example where it's--it's called a flat profile. That is, one kernel does not--but it can be multi-threaded. So you get the benefit of the 100x speedup because now you have 8,000 threads executed at the same time hashing into memory. So that's an example of having--think of it as 8,000 x86 threads. It's not x86 with--so therefore the profile is flat among the instructions of the x86 but your application at a higher level is embarrassingly parallel between threads and that's fine and that's what we do with the graph thing. It--it's not a kernel, it's the application is multi-threaded. That's the, if I can use your term, that's the kernel. So how do we create a multi-threaded environment? And that's what we've done. Did I--I'm sorry. Did that answer your question? >> Yes. >> WALLACH: Okay. >> But I guess I would--probably one is, you know, you mentioned that you used a reaper consistency model where the memory [INDISTINCT] and often you used a large thread analysis to hide the latency... >> WALLACH: Right. >> ...or indirect access to look up. And I was wondering if as you bring those two things together whether that means [INDISTINCT]. >> WALLACH: Well, okay. Without knowing the test, I obviously can't answer. I could only tell you--I was having a discussion with Jim. When we had looked at graph problems, for the one case I showed where each kernel, each thread is doing--hash looked up the memory, it scales linearly. Another approach for graphs, which is different than threading in certain applications can use adjacency matrices where you have bits. So someone, for example, may want to have a bit matrix that's a billion rows and 64,000 columns, which is a different way of representing perhaps a network, which is what we're using for graphs. And with our, let's say, limited experience but it--we do have actual experience and at this time, unfortunately, I can't go into the details--inappropriate to go into the details. There are certain phases of the program--so to get to your question about kernels, there may be phase A of the program that may work with threads and graphs. Phase B may work with adjacency matrices because there's a nice [INDISTINCT] back and forth or for application A, you do A, in application B, you do the other approach. And the only thing that's common among the both of them is if you don't have a high performance memory system that can deal with thousands of outstanding loads and stores, it doesn't work, it doesn't scale. >> So you need to feel the cause analysis [INDISTINCT] store is structured in such a way [INDISTINCT]. >> WALLACH: I can give you some more answers but I can't. But, yes, you're right about the conflict. A lot of the conflict is a result of the way you program it. >> Absolutely. >> WALLACH: So it--you know, it's garbage in, garbage out, if I can use that term. You know, if you program, you're probably right. The best compiler can unravel it and we can tell you, you know, this doesn't look--you know, I can't do this optimization with this but I can't restructure. And in my experience doing this, the performance, if you build a computer as a--I always start with the memory system. Sooner or later, your application has a memory bandwidth that's limited. I don't care. I'm going to use it for free now. So, how do you build a memory system that can have thousands of outstanding loads in store in that scales? So the first step is if you have a cache then probably it doesn't--these threads are not going to have cache bin anymore, only a thousand. And if you have more, then you can get [INDISTINCT]. Then you have--well, our designers who worked with me for 25 years, we just--this is normal for vector machines. It so happens that memory system could be used for things other than vectors. And I can--well, I can tell you there's a lot of customers ready for reduction applications using that approach. And in the case I showed, rather than the threads being an instruction sequence, it was a [INDISTINCT] but it's irrelevant. It's more of the way we see things up and create--and create the threads and et cetera. That's really where the magic is. >> [INDISTINCT] I have some similar experience because that's the question of how much [INDISTINCT] >> WALLACH: Well, I've been--I've been asked that questions a thousand times in the last 30 years and... >> [INDISTINCT] >> WALLACH: Show me the code. I see a code where, you know, 50,000 lines of code, an hour later it runs. I've seen 50,000 lines of code where three months later it runs. I mean, it runs efficiently. I have a saying when you benchmark a machine you tend to benchmark the analyst as much as you benchmark the system. And a lot of the tools, compiling tools and debugging tools in many cases are used more by the analyst than the end user. And I've had many cases where someone says, you know, that probably we should do that. And if you have a small team, you pick--you pick up--you go--hey, Harry and Jonnie, want to come here for a second? >> [INDISTINCT] >> WALLACH: You see this? That's not really good. You want to fix that. With [INDISTINCT] no you didn't read my lips. He said, "When are you going to have that optimization due here?" And it happens. That's when you have highly motivated people who are on this team and that's a good reason. I mean, I don't--I try to be flip, but if you're in a high performance game, you have to have that type of mentality. If you don't, you lose because you [INDISTINCT], user productivity that we show which is the easiest to program. I've been saying that for 30 years and it's absolutely right. For my first common contact, I started--we made a box compatible compiler. We could take people hundreds of thousands of lines of box code and in an hour have it running. That came to us and said, "You must be building a box compatible machine, I'm going to assume." And I said, "No, we're not." I remember the compiling guys come and say, "God, do we have [INDISTINCT] box is doing true and false and it violates the standard." And I go, "Yes. That is really terrible, please do it." I don't know what else to say. >> So, the vectors, I think [INDISTINCT] and you get these large scales that's across [INDISTINCT] exposing hardware failures and caching it into software and the last few things you said earlier on [INDISTINCT]. >> WALLACH: I was on the [INDISTINCT] committee so I used to go to Berkley and [INDISTINCT] >> [INDISTINCT] >> WALLACH: What? >> [INDISTINCT] >> WALLACH: No. I don't have [INDISTINCT]. >> [INDISTINCT] >> WALLACH: We actually--first of all, I only see it as decimal. We actually have one person who can answer that and [INDISTINCT] because you can't have round local parallels or... >> I understand. Would you do that [INDISTINCT] what's your estimated end user you have based on the [INDISTINCT] software, you know, memory? >> WALLACH: I'll answer it. Hopefully, it's more than one day, to be candid. One of the things that... >> And what coverage is that? >> WALLACH: I--a lot of this stuff can [INDISTINCT] about coverage or resiliency. We still buy the DRAMs. We have no impact. You know, pick up the--go buy them [INDISTINCT]. If 60% of the logic is DRAMs, yes, we get the [INDISTINCT] also, but if you have two errors or whatever. We still have a micro, which we have no control over. >> [INDISTINCT] >> WALLACH: We don't have contention. That's--that's what I said. I said I know who's going to say it. >> Right. >> WALLACH: With web contextual control and--look, there've been a lot of known cases of when you have systems especially not at sea level but at 7,000 feet... >> [INDISTINCT] >> WALLACH: Okay. And talk about--I'll give you another example. Once my wife and I went to Hawaii visiting Maui and at the telescope plain on a mountain, it turns out--I didn't notice--it was 10,000 to 200 feet and computers are only specced at 10,000 feet so their work stations did not need the spec, if you will. So if they broke, there was no warranty. I--so, you know, there's a lot of issues that come up like that. >> [INDISTINCT] >> WALLACH: I'm sorry? >> [INDISTINCT] >> WALLACH: Right. But there are certain things--I guess what I'm trying to say, there are environmental issues, you know, if it's at 10,000 feet at Colorado or wherever that could have impact on today's systems let alone--and it certainly--once it [INDISTINCT] shallows in, someone from Intel had a photo micro grail that overlaid a virus on a chip. There's, you know, some sort of cold virus, whatever. The virus was the equivalent of like 100 games or so. That was his way of saying, "Hey, guys, we're getting so small that we're dealing at, you know, some virus, you know, that is medical virus level." And as I said, I don't have--I don't understand what the real science there. They don't offer anything and, in fact, it's well known, I assume the process people, that's part of their specification. >> [INDISTINCT] >> WALLACH: The compiler doesn't know any--the compiler simply sees this as a--as a [INDISTINCT] issue. I would say that the cache coherency, the shared virtual memory is well done in hardware. We run Linux. We had to modify about a dozen Linux kernels to--you know, with bios and stuff to deal with, you know, those issues. It's real simple. You can take a data outfall, I'll be as clear as I can. And we actually created [INDISTINCT]. So we--the compiler will generate x86 code or x86 code co-processor. You execute the [INDISTINCT]. If he doesn't see the co-processor, just execute the [INDISTINCT]. If it sees the co-processor, you want to see it's called processor instructions which should look like extend x86, like real dependent [INDISTINCT]. It goes--I won't tell you how we do that because--it's not here. It's there. Don't start that one. It's--it can either be done on instruction by instruction or certainty by certainty. Generally, it's going to be by [INDISTINCT] rather than instruction by instruction. We--either one of those. Either one of those. Okay. Well, I thank you. I hope at this hour [INDISTINCT]. Thank you very much. >> Thank you.

The infrastructure

Its major components are the frontend for C/C++ (using GCC) and Fortran 77/90 (using the CraySoft front-end and libraries), Interprocedural analysis (IPA), loop nest optimizer (LNO), global optimizer (WOPT), and code generator (CG). Despite being initially written for a single computer architecture, Open64 has proven that it can generate efficient code for CISC, RISC, and VLIW architectures, including MIPS, x86, IA-64, ARM, and others.

Intermediate representation

A hierarchical intermediate representation (IR) with five main levels is used in this compiler to serve as the common interface among all the frontend and backend components. This IR is named WHIRL.

Versions

The original version of Open64 that was released in 2002 was missing its very advanced software pipelining code generator, and had only a rudimentary code generator for Itanium. The entire original MIPSPro compiler, with this code generator, is available under a commercial license as the Blackbird compiler from Reservoir Labs. The Showdown Paper documents the code generator that was not included in Open64. The very advanced compiler from Tilera, for its 64-core TILE64 chip, is based on Blackbird.

Open64 exists in many forks, each of which has different features and limitations. The "classic" Open64 branch is the Open Research Compiler (ORC), which produces code only for the Itanium (IA-64), and was funded by Intel. The ORC effort ended in 2003, and the current official branch (which originated from the Intel ORC project) is managed by Hewlett-Packard and the University of Delaware's Computer Architecture and Parallel Systems Laboratory (CAPSL).

Other important branches include the compilers from Tensilica and the AMD x86 Open64 Compiler Suite.[1]

Nvidia is also using an Open64 fork to optimize code in its CUDA toolchain.[2]

Open64 is used as the backend for the HPE NonStop OS compilers on the x86-64 platform.[3]

Open64 releases

Version Release date
5.0 2011-11-11
4.2.4 2011-04-12
4.2.3 2010-04-09
4.2.1 2008-12-08
4.2 2008-10-01
4.1 2007-12-03
4.0 2007-06-15
3.1 2007-04-13
3.0 2006-11-22
2.0 2006-10-02
1.0 2006-09-22
0.16 2003-07-07
0.15 2002-11-30
0.14 2002-03-04
0.13 2002-01-10

AMD x86 Open64 releases

Version Release date
4.5.2.1 2013-03-28
4.5.2 2012-08-08
4.5.1 2011-12-19
4.2.4 2010-06-29
4.2.3.2 2010-05-17
4.2.3.1 2010-01-29
4.2.3 2009-12-11
4.2.2.3 2009-11-23
4.2.2.2 2009-08-31
4.2.2.1 2009-06-03
4.2.2 2009-04-24

Current development projects

Open64 is also used in a number of research projects, such as the Unified Parallel C (UPC) and speculative multithreading work at various universities. The 2010 Open64 Developers Forum describes projects done at Absoft, AMD, Chinese Academy of Sciences, Fudan University, HP, National Tsing Hua University, Nvidia, Tensilica, Tsinghua University, and University of Houston.[4] The Chinese Academy of Sciences ported Open64 to the Loongson II platform.[5]

AMD has extended and productized Open64 with optimizations designed for x86 multi-core processor advancements and multi-threaded code development.[6] AMD supports Open64 as a complementary compiler to GCC.[7]

The University of Houston's OpenUH project, which is based on Open64, released a new version of its compiler suite in November 2015.[8]

See also

References

  1. ^ "x86 Open64 Compiler Suite". AMD. Archived from the original on 13 November 2013. Retrieved 12 November 2013.
  2. ^ NVIDIA’s Experience with Open64
  3. ^ "John Reagan Interview on LLVM, part 2". ecubesystems.com. 2019-05-01. Archived from the original on 2020-11-25. Retrieved 2020-12-21.
  4. ^ "2010 Open64 Developers Forum, August 25, 2010". Archived from the original on June 12, 2010. Retrieved September 6, 2010.
  5. ^ Open64 on MIPS: porting and enhancing Open64 for Loongson II
  6. ^ Nigel Dessau, AMD CMO (June 22, 2009). "Sweet Suite, blog posting". Archived from the original on 2010-09-06.
  7. ^ "AMD Open64 download page". Archived from the original on 2013-03-13. Retrieved 2012-11-13.
  8. ^ OpenUH downloads page

External links

This page was last edited on 27 April 2023, at 16:43
Basis of this page is in Wikipedia. Text is available under the CC BY-SA 3.0 Unported License. Non-text media are available under their specified licenses. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. WIKI 2 is an independent company and has no affiliation with Wikimedia Foundation.