-
Notifications
You must be signed in to change notification settings - Fork 1.7k
image package works slower when compiled with dart2native compared to JIT #39367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It is a common misconception - which comes often enough that we should probably add it to an FAQ (/cc @mit-mit). AOT and JIT compilation have different performance trade offs. JIT has access to accurate runtime profile of your application (including information about which parts of the code are hot, which classes are allocated and which receiver types are seem by each individual call site). Using this information JIT can speculate and produce very good machine code tailored for how your program is actually running. This speculation does not even have to be correct for an arbitrary inputs to your program - because JIT can always fallback to a slower version dynamically. That is why JIT usually gives you very good peak performance. However you have to pay for this with startup and warmup latency - which is visible when you need to run a lot of code before your application puts the first pixel on the screen. AOT has a different story - it does not actually know how your code will run. It has to look at the application as whole, run various global analyses and try to recover information that JIT gets by observing the execution. It can't speculate - it has to produce the code that is guaranteed to work. Sometimes AOT can figure things out, sometimes it falls short of following the flow of types through the program and has to produce generic and rather inefficient code. You might ask here: wait, is not Dart statically typed? why do we even need any sort of global analyses? The answer to this question is: yes, Dart is statically typed, but static types don't necessarily give you enough information to produce good code. Take for example a variable This just scratches to surface of the problem - in reality situation is even more complex. That said we do try to bring difference between AOT and JIT down as much as possible where it matters. |
Updating FAQ in dart-lang/site-www#2098 |
Yep, I understand all of that. And for complex functions, I would expect the JIT might be faster. But for very simple functions, just sheer number crunching, AOT ought to be faster, since it doesn't need to do any of the run-time analyses or optimizations that the JIT needs to do. And once you enter a very long loop, you know the data-type of the list before you start processing it, so at the very least, that should be faster? It isn't:
I'd expect The code is accurately type-hinted here and here to avoid e.g. type-checking list elements, so AOT really ought to be faster at least for this case, I think? There should be enough static information available in this case for an AOT to at least beat the JIT on a tight closed loop with well-known types? I'm not trying to be poignant here, but if AOT is going to be consistently slower than JIT, why even compile to native binary in the first place? Wouldn't it be more efficient to compile to bytecode and link the JIT run-time into the executable? Wouldn't it be considerably less work and maintenance, too? You have the bytecode compiler and JIT run-time available anyhow - I'm sure maintaining a cross-platform binary back-end for the language is a pretty substantial effort. Beyond producing stand-alone executables, what value proposition does I was (am) excited about being able to produce stand-alone executables, but maybe compiling to native binaries isn't the best or simplest approach? If you could simply embed the JIT engine in a a stand-alone executable instead, we'd have the same portability, ease of deployment, better performance, access to reflection, etc. without any further ado. Perhaps the main benefit of a native binary over an embedded VM approach would be the smaller file size - but is that very important in this day and age? My example with a web server that resizes images comes out around 8 megabytes anyway. I don't know how big an embedded JIT would be, but the bytecode likely would be a few hundred KB, so for many common use-cases, I suspect an embedded VM might even be comparable in terms of size? |
To get peak performance you need to know data-type of the list at compile time - knowing that list type is invariant of the loop could theoretically help, but you would still have some sort of virtual dispatch in the loop itself. [Note that type annotation
Again it is not a straightforward comparison. JIT for example has the chance to speculate on bitwidth of the numbers involved. AOT has to be conservative and prove things. In general we do have a problem that our AOT compiler does not produce the best numeric code for tight loops (especially with integers) and this is something that we plan to eventually fix. I looked at the code generated for
As I have indicated before - we would really like to bring AOT performance as close to the JIT performance as possible. We are working on it continuously. It takes time because it is not a trivial problem. It is much easier to make a fast JIT for a language like Dart than a fast AOT, especially if you take certain additional constraints like code size into account. (Dart AOT was originally created for mobile devices - so every byte counts). The reason to use AOT in the first place is low latency startup and good performance (it might not be as high as JIT performance in all cases, but it is still good enough for many kinds of applications). Also you can use AOT in places where you can't use JIT (e.g. iOS). If you don't care about startup latency and care about peak performance - then you should certainly use JIT at the moment. |
Just found out this issue, happening the same here My code is nearly 80 lines, fully type annotated (no dynamics) and makes use of const / final variables and const constructors and fixed length lists when possible. Also, the part of the code where most of the time is spent is on a switch (this is normally optimized into a jump table in some compilers). I think more consideration should be given to optimizations, specially since it took a few time to compile the dart code AOT (I thought part of it was due to optimizations?). I will leave the code here, in case you may use it for further improvements, |
There is a ray of hope in this sentence and just because of this I am switching to Dart AOT for full stack(back-end, front-end) development. Having said that, I think we can try to learn from other statically typed AOT complied languages like Go, Rust, Crystal, and Nim. |
The issue is not about learning other languages. I can code in almost 20. The thing is that as far as I see, the AOT compilation is only meant for start-up sensible apps. However, most people expect run-time performance rather than startup performance. |
@ConsoleTVs you can replace |
Great to hear! This could make the AOT version run faster, taking |
In this new server less world of Aws Lamda, startup performance, run-time performance and CPU/Memory efficiency are important. |
What are you trying to prove? |
Nothing. |
May I ask, why g++ is able to produce so much faster code with AOT. Beating almost every JIT in existence. Yet Dart, with all those static types and ahead of time information becomes incompetent in front of JIT. The languages using JIT, actually have their complex logic(and often their core library) implemented in statically typed and AOT compiled language. And the false propaganda I have seen here, right after dart's AOT release is that JIT is faster than AOT. even Java's JIT cannot beat C/C++'s AOT. |
@thomasb892 Because of this: We currently loose some type information in the backend to produce good code |
@thomasb892
Because g++ is compiling C++, which is a much lower-level language. Imagine you write something like this in C++: struct S {
int f;
};
int foo(int a, std::vector<int>& b, S* p) {
return a + b[0] + p->f;
} When a C++ compiler compiles this function it does not have to worry that In Dart none of this are true. class S {
final int f;
};
int foo(int a, List<int> b, S p) {
return a + b[0] + p.f;
}
class SImpl implements S {
get f => throw "Hahaha";
} and so on and so forth. So comparing Dart AOT to C++ does not really help. Compiling C++ is easier. (As a sidenote: even C++ compiler can be assisted by PGO, e.g. you can get significant performance improvements from relayouting binaries or doing profile guided devirtualization - which highlights pure AOTs shortcomings). |
@mraleph I think Dart was supposed to get strict nulls soon? Which should address that issue at least. |
Yes, we definitely plan on doing VM perf optimisations once we have null safety landed. |
@mit-mit any plan on aot perf optimisation? flutter use aot on ios and also possible on android. and for serverless, startup speed, memory usage and runtime perf is all important |
@windrunner414 we are continuously working on improving performance of AOT code. If you have some specific code in mind which you think runs slow please file a separate issue. Then we can take a look and suggest if we can do something on our side to make the code faster or if the code could be changed to make it faster. |
@mraleph the raw number crunching performed by the image library I mentioned in this issue is definitely good candidate? It ought to perform better with AOT, as it's all statically-typed and, well, this is what CPU's do best. Getting close to bare-metal performance ought to be possible. 😄 |
@mindplay-dk while in general we want to improve performance of working with numbers, I would say that using pure Dart ports of image manipulation routines does not make sense to me - if performance is important - instead I'd recommend calling some native library to do the image manipulation (you can sandbox it if you are worried about vulnerabilities). |
List b is only supposed to be passed by reference. So it can be a pointer. Therefore it can be null. In C/C++ they are mostly pointers otherwise it's slow. We could use everything as pointers. Also that OOP code, even C++ does it. And does it rather fast. Dart AOT has a lot of potential. |
When the new null safety land, we can know there are never null. for nullable object maybe it can be forced to check, can't call anything if u do not check if it's null.
if call s(anyImpl), pass the *S1,I think we can know what type anyImpl is, unless it's dynamic. if it's dynamic, check the runtimeType is nessecary, but if not, this step can be skip |
There are possibly more tricks one could use to speed up AOT because of Dart being very similar to Java. Android shifted from Dalvik VM(JIT) to ART runtime(AOT) and it has only been faster ever since. Maybe we could learn from ART. |
@thomasb892
I am not sure I understand what you are trying to say here. Yes, That's where C++ differs a lot from Dart - variables of primitive types can't ever be
Yes, there are tricks to speedup AOT. If you actually look though git history you will discover that we are constantly applying new :) Note that these days ART does not actually use a simple AOT - since Android N it actually uses profile guided AOT which is driven by profiles collected in runtime. You don't compile the whole app on installation - instead you run application in a JIT and then use some background process to recompile hot parts of your application based on the profile information. Since Android 8 this profile information contains among other thing inline cache states - which allows "AOT" (I'd rather call it asynchronous JIT though) to perform speculative optimisations. Also as I have said before: when compiling Java you don't face all the same challenges that you face when compiling Dart - for example Java
It is true, though it must be clarified that initially most applications would be run in hybrid opt-in/opt-out mode in which you can actually violate non-nullability promises. Only if your application is fully opted in (no dependencies are opted-out) and you are running in strong checking mode you can be sure that
Yeah, I know how C++ implements inheritance. I am not sure why are you bringing it up here though. Notice that original example with It's an interesting question whether there is a lot of performance sensitive code like that to begin with. Leaving that aside (assuming for example this sort of code was important and we wanted to apply this technique), I can see at least few challenges applying it:
|
@mraleph We can know at compile time if it might be a getter / setter, and let S and any implementation of S to have a getter&setter, not just int f. It may improve performance but u are right, there are many challenges and it's complicated |
Correct me if I'm wrong. Now dart is a real statically typed language(beginning from Dart 2.0) and being used for develop mobile apps. Flutter is a native performance cross-platform framework. To better complete with native apps(written in java/kotlin/swift), high performance is important. [1]State of Valhalla. The Road to Valhalla(https://cr.openjdk.java.net/~briangoetz/valhalla/sov/01-background.html) |
@hooluupog Feature Request for value types is better raised at dart-lang/language, because it is a language design decision. We have discussed adding value types for many years now - and so far there have been much higher priority issues to tackle. |
@mraleph Okay, got it. |
Forgive me if I’m getting this wrong but it sounds like it comes down to losing type metadata for the sake of file sizes? Are there other performance issues that this is important for? I can understand why this would be important for mobile apps and for dart2js but for serverside apps and cli’s, performance would be much more important than file size IMO. |
No, I am not sure which part of this thread made you think this way. It is true though that we take code size in consideration - which impacts for example our inlining decisions (AOT inlining is much less aggressive than JIT inlining as a result), but that's a somewhat separate topic. |
Forgive but this is kind of off topic: For a backend server application like aqueduct, is it recommended to deploy to production in AOT or JIT on say google's cloud run (semi serverless)? @DEVisions has been adding AOT support to aqueduct here |
That would depend on a bunch of factors, such as how frequently your backend spins down/up, what kind of code is runs, etc. I'd recommend doing some benchmarking for your particular workload. |
@sjapps It is true that - as @mit-mit Michael said - some stress testing would be needed on both AOT (native) and JIT (non-native), as your application behavior and JIT optimizations may something respond better than the native version. Indeed, startup time and memory usage may favor your needs and expectation. This is applicable to all other similar platforms, such as Java (more specifically, look for Quarkus with GraalVM). Oh, and the lovely AOT capability of Aqueduct has been added by the Aqueduct Team and @joeconwaystk himself. I am start investing time into it as I would love to contribute, and in this particular case I was just a messenger and gave back some feedback. 😊 |
I tested a simple fib and puppeteer program in dart. Test case: Fibnum fib(num n) {
if (n <= 1) return 1;
return fib(n - 1) + fib(n - 2);
}
void main() {
print(fib(46));
} Running the compiled, jit, aot, kernel and vm version,
Verdict: The compiled and AOT version is 1.7x time slower than the vm, jit and kernel versions. Test case: PuppeteerThis will run a headless chromium and close it after creating a new page. import 'package:puppeteer/puppeteer.dart';
void main() async {
var browser = await puppeteer.launch(headless: true);
var myPage = await browser.newPage();
await browser.close();
} Running it:
Verdict: the compiled and aot version is much faster, and the vm version is 10x times slower. |
Would it be possible for the Dart VM to write JIT profiles to disk while a program is running, possibly when an option is passed to Dart or when a specific function is called and then use those profiles when compiling in AOT mode? |
I am going to rename the issue from "Dart native slower than Dart VM?" to a more concrete "image package works slower when compiled with dart2native compared to JIT". Since 2019 we have improved many things in Dart AOT compiler (including TFA precision and better local optimizations). If I take the original benchmark which made @mindplay-dk file this issue, I see that we want from 80% speed difference to 11% speed difference (measuring on $ dart compile exe bin/main.dart
$ bin/main.exe
Time taken: 5822!
$ dart bin/main.dart
Time taken: 5203! I have taken a quick look at the difference and the code quality has improved significantly, for example we are now unboxing Now it seems that the a lot of difference originates from JITs ability to speculatively specialise for specific receiver types, e.g. class InputBuffer {
List<int> buffer;
// ... In JIT mode we figure out that If I replace $ bin/main.exe
Time taken: 5682! I think there are other cases like this, I looked at the flame-graph and I can clearly see places where we perform virtual dispatch to array methods (including typed array methods) instead of properly handling these cases inline. I think at least some of these cases can be attributed to potential issues in the compilation pipeline (I will file bugs for those), but some of these (like
@Keithcat1 this is possible though a question arises how to use these profiles. They can be used to guide inlining decisions, but they will not help to fully close the gap between JIT and AOT because AOT will not be able to apply these profiles speculatively, it will need to keep the fallback case around - which will degrade the code quality. |
I imagined it would work similar to the way Clang does it with -fprofile-generate and -fprofile-use |
I was (am) pretty excited about the
dart2native
announcement, and decided to test it.Where I would really expect this to shine, is with the sort of heavy number crunching that generally makes scripting languages fall short or delegate the heavy work to C.
So I installed the
image
package version2.1.8
, and wrote a very basic script:And a basic console front-end:
I'm feeding it a big photo of 5760 x 3840 px, and as you can see, I'm using presumably the most expensive
Interpolation
algo available.Run this with the VM:
Let me interject here and say, this is by far the fastest I've ever seen any scripting language resize an image of this size - this library is single-threaded, so that is really incredibly fast! Kudos on delivering probably one of the fastest scripting language VMs ever created! 🤩
But (obviously?) I was expecting this to be even faster when compiled to a native binary.
So I built it:
And ran it:
Almost 80% slower?
I ran both many times, and the results are pretty consistent.
I also pulled up a CPU monitor, and it does look like the Dart VM uses more CPU power - I see a spike on two CPU cores, whereas with the compiled binary, I see a spike only on a single core. Presumably the code runs single-threaded on the Dart VM, and the second CPU core spike is due to the VM making optimizations or doing garbage collection on the fly or something?
Anyhow, this result is more than a little surprising to me. 🤔
Note that I'm using the 64-bit Windows build of the with Dart VM version
2.6.1 (Mon Nov 11 13:12:24 2019 +0100)
- perhaps this isn't fully optimized for Windows yet?Or perhaps the compiler has not been optimized for raw number crunching yet? I suppose the VM has been around for a lot longer and the native compiler is still very new, so maybe the VM has optimizations that the native compiler doesn't have yet?
The text was updated successfully, but these errors were encountered: