-
Notifications
You must be signed in to change notification settings - Fork 4
Very low FPS #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, I am afraid the bottleneck is the graphics pipeline. Based on studying the subreddit /r/macgaming and other GitHub issues that had similar issues with 32 bit software in connection with directx 9. If I understand the underlaying problem correctly it's simply due to several translation layers on top of each other. Directx9 is translated using WineD3D to OpenGL calls. Unfortunately OpenGL has been deprecated with the release of MacOS Catalina. Whatever is in the modern Mac OS is another translation layer that turns those OpenGL calls into metal. A solution to this would be a modern DirectX9 to Metal translation layer. But as far as I am aware there is none currently. All the best |
Hi, Does this file work for you? do you have any FPS problems? Is there no way we can make a work around? I was soooo happy to see it booting and working but 30fps is not really thanks again! |
Hi, Apple silicon user with iMac and M1 processor here who tested getting your fix up and running with CrossOver. I was really happy to see I could start WOTLK 3.3.5a client without error 132, joining the world after character selection is possible. Sadly there is a big performance hit, the game is unplayable at a resolution of 2560x1440 even with minimal graphic settings in my setup. What works much better is starting the WOTLK 3.3.5a client via Parallels or VMware Fusion sadly. The best setup I achieved so far was using MoP 5.4.8 client, but for sure this is a 64-bit client which works natively without an error in crossover. The issue is the 32-bit architecture with clients up to 3.3.5a. Would be nice if performance can be increased, but so far the stuttering every second makes it unplayable sadly. |
Hi once again, As I previously mentioned this is an issue due to the enormous amount of layers stacked on top of each other. Winerosetta can't and won't fix the performance issue. Someone has to build a translation layer that directly casts directX9 to metal api. |
First of all, thanks for this project, it's really cool to see the old clients running on Apple Silicon without having to use a VM. Sadly, I can confirm it's not really playable, at least on low-end chips: This is 1.12.1 running on a M1 MacBook Air (8C/8C CPU/GPU, 16GB RAM, macOS Sequoia 15.1.1) in a Windows 10 bottle (using Whisky) with VanillaFixes 1.5.2 and all the graphics settings at default, except with the resolution set to 800p, using no addons. I also tried using OpenGL (as instructed here) but that didn't improve the performance (it seems to be the same). I assume it would work with playable performance on more current/higher end chips; I'll report back if I get the chance to test that. Disclaimer: I didn't do extensive research here and I might be totally wrong about this potentially being a viable (or even technically possible) solution, but I thought it might be worth mentioning: I found dgVoodoo2 and was thinking that it could perhaps be utilized to translate DX9 to DX11/12 and then be able to use D3DMetal, with hopefully improved performance over the current |
I don't think this works because all those DX11/12 translation layers only work for 64 bit meanwhile this version of the game is 32 bit. Currently there is no solution to this problem I am afraid. |
Ah, right, thanks; I somehow didn't consider that. |
Stacking abstraction layers on top of each other is generally not a problem. It is not the fundamental problem here. Just imagine how many layers indirection are in between Metal (or any other API) and the actual GPU. The problem is that the translation layers that are in place are not optimized for performance. In this case Apples OpenGL driver. No matter if you select DirectX9 or OpenGL in WoW: It will always go through Apples OpenGL driver. In case of DirectX, wine will do the translation to OpenGL. This is a trivial one since DirectX9 and OpenGL are very similar. Unlikely to cost a lot of performance. However, going to something like Vulkan or Metal requires some clever engineering (read pipeline caching) to make it performant. Apple didn't bother. This is also why almost all translation layers to modern APIs (Vulkan, DX12, Meta)only target higher DirectX versions. Faster hardware also doesn't help. I have a maxed out M4 Max and even there the fps are unplayable. The CPU is probably stalling the GPU all the time due to creating render pipelines ad-hoc. To further proof my point that indirections itself are not the problem: Even on Windows the game, is slow without any tweaks. If you use DXK (DX9 -> VK) the performance goes up dramatically. Like A LOT. I went from 30fps to 400fps in Stormwind on my RX4070. So we actually increased performance by adding a level of indirection. Or did we? Maybe Nvidia implemented their DX9 driver on top of their Vulkan or DX12 driver. Nobody knows. All we know is that DXVK is amazing and not a naive translation like Apples driver. So can we use DXVK an macOS? Sadly, no. While Vulkan is available on macOS via MoltenVK, it only implements Vulkan up to 1.2 while DXVK requires Vulkan 1.3 (at least for a working DX9 translation). The development of that feature seems a bit stale. For the whole of MoltenVK to be honest. However, there are no fundamental reason why 1.3 can't work on top of metal. I am convinced if we get that support we will see a similar dramatic improvement as on Windows. That is how inefficient those legacy drivers are. So it you can help out there at all it would be great.
This wasn't a constraint of the translation layer itself AFAIK. It was that wine64 couldn't run 32bit programs. Since Apple dropped 32bit support we could only use wine64 while Linux users could still run the normal wine on their 64bit operating system. But this feature was added to wine in the meantime. However, I just downloaded the latest game porting toolkit 2.0 beta from Apple and it does indeed only contain 32bit dlls for directX. That is a bummer. Maybe they will add 32bit versions later because there is nothing stopping them from doing it now. So yes, is kinds where this dgVoodoo2 tool fails at currently at for me, too. I can't make it use apples DX12 -> Metal translation and falls back to winws translation layer which goes to Vulkan for DX >= 11. However, I would be surprised if dgVoodoo2 would come close to the performance of DXVK. |
@Lifeisawful Thanks for experimenting further! Are you already using VanillaFixes? Because under “normal circumstances” (running the game directly on x86_64 Windows and not using any translation layers) and even when running the game through a Windows 11 ARM VM via Parallels on macOS, it does result in large performance boosts, even without the optional DXVK that comes bundled with it; the tests I posted here before were also done with VanillaFixes. Granted, I haven’t done any testing without it, so I’m not sure which impact it has in this scenario. |
As in your own custom DX9 to Metal translation layer? Did you publish the code anywhere, yet? Would be very interested to look into it.
AFAIK all it does make the game use a more precise timer. Not sure how relevant that it under Wine. |
Doesn't this only mean the application can't keep up with the display refresh no matter where the bottleneck is? Sorry if this is a dumb question. I thought the low gpu load would just tell us that it is a GPU bottleneck. I mean this was expected. How do you derive from this numbers that it is not bottlenecked by the OpenGL driver which creates new pipeline states all the time?
Is this actually used to a degree by the application that is is problematic? Wondering if there are some performance counters to check which x87 instructions were executed. But I assume if this is the only FPU available there must be quite a lot of multiplication going on. Even if I would have assumed most is offloaded to the GPU.
AFAIK Windows on ARM is doing exactly that (accepting the precision hit) that comes with using NEON. But I am really curious how you are planning to modify Rosetta2 in such a sophisticated way since it is closed source. |
Apple explains it better than I ever could. I was not discussing the wined3d screenshot but rather my custom metal backend for the game.
The game is compiled using an ancient version of MSVC which defaulted to use x87 instructions for floating point operations. For example when the game engine performs culling: It calculates which units and what map chunks are to be rendered. This is a heavy burden on Apple Silicon.
Rosetta2 is split into many components. It has a Here is a list of those exported symbols which are part of the x87 emulator: void x87_init(rosetta::runtime::library::X87State *);
void x87_state_from_x86_float_state(rosetta::runtime::library::X87State *, rosetta::runtime::X86FloatState64 const*);
void x87_state_to_x86_float_state(rosetta::runtime::library::X87State const*, rosetta::runtime::X86FloatState64 *);
void x87_pop_register_stack(rosetta::runtime::library::X87State *);
void x87_f2xm1(rosetta::runtime::library::X87State *);
void x87_fabs(rosetta::runtime::library::X87State *);
void x87_fadd_ST(rosetta::runtime::library::X87State *, unsigned int, unsigned int, bool);
void x87_fadd_f32(rosetta::runtime::library::X87State *, unsigned int);
void x87_fadd_f64(rosetta::runtime::library::X87State *, unsigned long long);
void x87_fbld(rosetta::runtime::library::X87State *, unsigned long long, unsigned long long);
void x87_fbstp(rosetta::runtime::library::X87State const*);
void x87_fchs(rosetta::runtime::library::X87State *);
void x87_fcmov(rosetta::runtime::library::X87State *, unsigned int, unsigned int);
void x87_fcom_ST(rosetta::runtime::library::X87State *, unsigned int, unsigned int);
void x87_fcom_f32(rosetta::runtime::library::X87State *, unsigned int, bool);
void x87_fcom_f64(rosetta::runtime::library::X87State *, unsigned long long, bool);
void x87_fcomi(rosetta::runtime::library::X87State *, unsigned int, bool);
void x87_fcos(rosetta::runtime::library::X87State *);
void x87_fdecstp(rosetta::runtime::library::X87State *);
void x87_fdiv_ST(rosetta::runtime::library::X87State *, unsigned int, unsigned int, bool);
void x87_fdiv_f32(rosetta::runtime::library::X87State *, unsigned int);
void x87_fdiv_f64(rosetta::runtime::library::X87State *, unsigned long long);
void x87_fdivr_ST(rosetta::runtime::library::X87State *, unsigned int, unsigned int, bool);
void x87_fdivr_f32(rosetta::runtime::library::X87State *, unsigned int);
void x87_fdivr_f64(rosetta::runtime::library::X87State *, unsigned long long);
void x87_ffree(rosetta::runtime::library::X87State *, unsigned int);
void x87_fiadd(rosetta::runtime::library::X87State *, int);
void x87_ficom(rosetta::runtime::library::X87State *, int, bool);
void x87_fidiv(rosetta::runtime::library::X87State *, int);
void x87_fidivr(rosetta::runtime::library::X87State *, int);
void x87_fild(rosetta::runtime::library::X87State *, long long);
void x87_fimul(rosetta::runtime::library::X87State *, int);
void x87_fincstp(rosetta::runtime::library::X87State *);
void x87_fist_i16(rosetta::runtime::library::X87State const*);
void x87_fist_i32(rosetta::runtime::library::X87State const*);
void x87_fist_i64(rosetta::runtime::library::X87State const*);
void x87_fistt_i16(rosetta::runtime::library::X87State const*);
void x87_fistt_i32(rosetta::runtime::library::X87State const*);
void x87_fistt_i64(rosetta::runtime::library::X87State const*);
void x87_fisub(rosetta::runtime::library::X87State *, int);
void x87_fisubr(rosetta::runtime::library::X87State *, int);
void x87_fld_STi(rosetta::runtime::library::X87State *, unsigned int);
void x87_fld_constant(rosetta::runtime::library::X87State *, rosetta::translator::x87::X87Constant);
void x87_fld_fp32(rosetta::runtime::library::X87State *, unsigned int);
void x87_fld_fp64(rosetta::runtime::library::X87State *, unsigned long long);
void x87_fld_fp80(rosetta::runtime::library::X87State *, rosetta::runtime::library::X87Float80);
void x87_fmul_ST(rosetta::runtime::library::X87State *, unsigned int, unsigned int, bool);
void x87_fmul_f32(rosetta::runtime::library::X87State *, unsigned int);
void x87_fmul_f64(rosetta::runtime::library::X87State *, unsigned long long);
void x87_fpatan(rosetta::runtime::library::X87State *);
void x87_fprem(rosetta::runtime::library::X87State *);
void x87_fprem1(rosetta::runtime::library::X87State *);
void x87_fptan(rosetta::runtime::library::X87State *);
void x87_frndint(rosetta::runtime::library::X87State *);
void x87_fscale(rosetta::runtime::library::X87State *);
void x87_fsin(rosetta::runtime::library::X87State *);
void x87_fsincos(rosetta::runtime::library::X87State *);
void x87_fsqrt(rosetta::runtime::library::X87State *);
void x87_fst_STi(rosetta::runtime::library::X87State *, unsigned int, bool);
void x87_fst_fp32(rosetta::runtime::library::X87State const*);
void x87_fst_fp64(rosetta::runtime::library::X87State const*);
void x87_fst_fp80(rosetta::runtime::library::X87State const*);
void x87_fsub_ST(rosetta::runtime::library::X87State *, unsigned int, unsigned int, bool);
void x87_fsub_f32(rosetta::runtime::library::X87State *, unsigned int);
void x87_fsub_f64(rosetta::runtime::library::X87State *, unsigned long long);
void x87_fsubr_ST(rosetta::runtime::library::X87State *, unsigned int, unsigned int, bool);
void x87_fsubr_f32(rosetta::runtime::library::X87State *, unsigned int);
void x87_fsubr_f64(rosetta::runtime::library::X87State *, unsigned long long);
void x87_fucom(rosetta::runtime::library::X87State *, unsigned int, unsigned int);
void x87_fucomi(rosetta::runtime::library::X87State *, unsigned int, bool);
void x87_fxam(rosetta::runtime::library::X87State *);
void x87_fxch(rosetta::runtime::library::X87State *, unsigned int);
void x87_fxtract(rosetta::runtime::library::X87State *);
void x87_fyl2x(rosetta::runtime::library::X87State *);
void x87_fyl2xp1(rosetta::runtime::library::X87State *); On a first glimpse instructions like |
Ahh okay we would be still be trapping into the runtime for such instructions but replace them with a more performant version. So you think even something like |
Sure. Right now it takes many instructions to perform the same math you can do with one instruction. Less precise tho, but fine for WoW. It seems like an easy to grasp fix until Apple does a move. |
Nice. How were you able to understand the |
The layout of the structure became clear after studying the X87 instruction set documentation. I managed to achieve better performance improvements. Here's a comparison: Below is a scene using Rosetta's unoptimized X87 instruction handlers: And here's the same scene using optimized (less precise) X87 instruction handlers: |
I'm quite interested in how you're modifying libRosettaRuntime, but I'm not finding any evidence of anyone else trying such a thing. I haven't really been able to find any documentation or anything on modifying rosetta at all for that matter. Like athei mentioned, are you reverse engineering the struct? Another question, does this modify Rosetta system-wide for this faster x87 instruction set? Or is it somehow containerized for this specific application? Thanks! |
I initially searched around in the internet and stumboled upon some interesting resources by people who have been exploring Rosetta2 which was very useful to me:
The appendix in ProjectChampollion was the clue I needed to get going.
My current tooling launches the application of your choice and applies the modification at the start up. It requires SIP to be disabled tho. |
Impressive! How did you find the exports on |
Try: This section is consumed by struct Exports {
uint64_t version; // 0x1560000000000
const Export* x87_exports;
uint64_t x87_export_count;
const Export* runtime_exports;
uint64_t runtime_export_count;
};
struct Export {
void* address;
const char* name;
}; |
Thanks. I expected |
Okay, for the record. Each frame created on the screen lowers FPS. I don't know why, something is not that he is not able to parallele the layers of game frames, what he does there and how I don't understand, but even 'Parallels' every new frame lowers FPS. |
What do you mean by that? |
You can forget about my comment. |
What is CreateFrame()? Is it a Lua function? How is it going with the x87 emulation replacement? I would love if you could share what you have already. Even if it is just the setup/tooling to hook the librosetta. |
I've been implementing unit tests to validate expected behaivor. Also time is a constraint. While I can enter a game and mostly play WoW just fine, there is a bug that lets you clip through certain objects. Yes I'd love to publish the code but first I need to polish it. I moved fast throwing together a prototype not caring about code quality. |
I understand that you don't want to publish anything that is in a rough state. But in case you stop working on it please consider just releasing it no matter the state. Others can build on your work then :) Its just too often that for some hobby projects life happens and then the progress would be lost. |
Also I'm just unreasonably curious about these modifications and would love to take a look at the source code 😀 |
Same :) |
Thank you for your hard work, I just set up everything and have lowest fps too. |
Hi all, With this configuration I got a 30 fps on my Mac Mini M2 I hope this was helpful. Lifeisawful -> thanks for your great work! |
@Lifeisawful how is it coming along? Still interested in this project myself! |
@Lifeisawful dude, we are you? We need you here! |
Currently lacking the time to do some serious work on the rosetta stuff. I've thrown my current working state onto a repository for someone else to expand upon it. It's not ready for the end user. Click Here. |
Good afternoon,
I'm so happy with this fix! I finally got able to run it on my Macbook M2 but I seem to be getting max 30fps on lowest settings.
When comparing to the actual classic WOTLK we had a few months back I was getting easy to 70FPS?
is it because of the whole work around that it asks so much from the system?
I'm very new to all this so pardon my poor explanation!
Thanks in advance
The text was updated successfully, but these errors were encountered: