-
Notifications
You must be signed in to change notification settings - Fork 116
MSSQL support #588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As far as I am aware, nobody is actively working on it, but SQL Server support would be welcome. The catch I think would be that FreeTDS is GPL licensed and so development of an ADBC driver using it wouldn't be able to happen here. |
Now that we can develop drivers in Go and use them in Python/R/C++/etc, we could use https://github.com/microsoft/go-mssqldb (the official Microsoft MSSQL client) which is BSD licensed. |
That ia good to hear.
…On Thu, Nov 2, 2023, 7:20 PM David Li ***@***.***> wrote:
Now that we can develop drivers in Go and use them in Python/R/C++/etc, we
could use https://github.com/microsoft/go-mssqldb (the official Microsoft
MSSQL client) which is BSD licensed.
—
Reply to this email directly, view it on GitHub
<#588 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMZCR5DJCCHJI6L3J7ESUDYCOQITAVCNFSM6AAAAAAW7FDIJKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJQG43TKNZYGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Not to hijack this, but could we also work on Sybase support at the same time? Both products support FreeTDS. The initial version of MS-SQL was a Unix port of Sybase for Windows.. SQL Server 6.0 is released in 1995, marking the end of collaboration with Sybase; Sybase would continue developing their own variant of SQL Server, Sybase Adaptive Server Enterprise, independently of Microsoft. SQL Server 2005, released in 2005, finishes the complete revision of the old Sybase code into Microsoft code. SQL Server 2017, released in 2017, adds Linux support for these Linux platforms: Red Hat Enterprise Linux, SUSE Linux Enterprise Server, Ubuntu & Docker Engine.[4] |
Since the community of Microsoft database users would probably be more comfortable in C#, I'd wonder if we could also write the driver in C#.
As with Go, a native C# client already exists: https://www.nuget.org/packages/microsoft.data.sqlclient / https://github.com/dotnet/SqlClient It is not fully clear to me what the consequences would be of the language choice. |
@CurtHagenlocher correct me if I'm wrong but in theory we can package C#/.NET drivers as native DLLs/SOs/DYLIBs too so that could work (with possibly the same caveats as Go, where on some platforms the runtime may conflict if there are multiple drivers using .NET). (I'm not looking forward to setting up that packaging build, but...) |
My preference would probably be to use the Go library as a base since we already have the packaging set up and so on. But that's also personal preference haha. I think the hardest part will probably be having some sort of integration testing because I don't know of any option other than actually having an MSSQL DB instance to test against. |
MSSQL is easy to test! There's a docker container and a free tier. |
Oooo! Didn't know that! Then that seems like a good candidate... |
Apologies in advance; this reply is going to be all over the place... It's definitely possible in principle to take C# source and use "AOT" compilation to produce a native dynamic library. The main blocker for doing this with ADBC is a gap in the C# Arrow libraries. This gap is addressed by apache/arrow#40992, which I haven't yet had time to analyze. Once that's done, the C# implementations of Spark and Databricks ADBC drivers can be AOT-compiled and consumed using the C APIs. But this underscores the point that not all C# code is compatible with AOT compilation. The Google code we're using in the C# BigQuery implementation, for instance, uses Reflection in a way that is not. Now I'd vaguely expect the latest C# SqlClient implementation to be AOT-compatible due to its relative importance in the ecosystem -- but pretty much the only way to determine this conclusive is to try it. From a perhaps overly-selfish perspective, the only personal benefit I would see from an MSSQL ADBC driver is if it offered better performance than using SqlClient via its ADO.NET API. But because there's no way to build an ADBC driver on top of SqlClient without using its ADO.NET API, there's not actually a way to achieve better performance. And in fact, if the ultimate consumer is more .NET code then it will likely be somewhat worse in that the strings would be converted from UTF-16 to UTF-8 for the Arrow API and then back to UTF-16 for the .NET consumer. (It's possible that this would be an inefficiency no matter how it's implemented if the TDS protocol specifies that "wide" strings are UTF-16.) Now just because I don't get any value out of this doesn't mean it's not worthwhile to have it. After all, one of the goals of this project is to establish ADBC as a more widely-used standard, and that requires drivers. But if the driver is strictly worse than what you get with other technologies, then at best it's a stopgap to help bootstrap an ecosystem rather than something valuable in its own right. To that end, I think it would be more interesting to build a generic ADO.NET-to-ADBC wrapper -- just like we have the ADBC-to-ADO.NET wrapper already in the codebase. Such a thing could be used not only with SqlClient but with Oracle's ADO.NET driver, Teradata's ADO.NET driver, the MySql ADO.NET driver, npgsql, etc. It would be the C# equivalent of an ADBC wrapper on top of ODBC, but without all that sketchy C code making people nervous ;). And if the underlying driver happened to support AOT compilation, then so would the combination of the driver and the wrapper. This isn't to say that I'm not interested in an MSSQL ADBC driver; far from it! Like the other row-oriented database protocols, ADO.NET is pretty inefficient in the inner loop. To this, it often adds overhead in terms of boxing and other taxes. But of course, you can't fix that by building on the outside, only by changing the internals. I'd previously considered forking the SqlClient code and changing it so that we can fill Arrow structures directly instead of going through an extra layer. I'd also considered investigating the use of the "tiberius" crate to build a Rust-based ADBC driver. (I both prefer the Rust language to Go and appreciate its lack of a runtime for use in interop scenarios.) But I've also got several lifetimes of side projects I'd like to do and something of a problem staying focused on any one of them, so ... . It's also occurred to me that my "real" goal isn't so much "ADBC everywhere" as it is "Arrow everywhere", and ADBC is more of a means to an end. From that perspective, one could argue that it's sufficient (and probably lower-cost) to modify existing database APIs to support Arrow instead of using a new API. So perhaps the C# Arrow project could define the following:
And then SqlClient or any other ADO.NET provider could be modified such that the |
That's reasonable to me.
I also slightly prefer this but we do have the infra already set up for Go-based drivers. On the other hand this and/or an ODBC wrapper might be a good excuse to set up the same for Rust. (I suppose it depends on whether tiberius or go-mssqldb is more maintained/complete, though I just noticed that go-mssqldb has gone 2 years without development...)
That would also be desirable, and one of the things I'd like to do someday is refactor the Dremio JDBC driver to support (1) working with ADBC more generally, not just Flight SQL and (2) having a way to get the Arrow data even while using the JDBC APIs |
So maybe the go driver is a better place to start from, since it will always be able to produce AOT compiled binaries?
There are tools which exclusively support ADBC drivers. So even to have it available for compatibility reasons could be good enough.
That's a lot of ifs :) Is the reason why you'd not consider the go driver pure personal preference? |
The cost of having the Go runtime in the same process as the CLR doesn't outweigh (for me) the benefit of a single API. (And it wouldn't even be a single API because there would still be other connectors based on ADO.NET.) There's also the problem described in #1841 to consider. |
I do think that continuing to build Go-based drivers on undefined behaviour should be considered.
Just a note that I'm happy to review Rust code and that I'm hoping to upstream some of our recent work in the next month or two. There are some sorely needed helpers that need writing (e.g., for options and catalogs) but the packaging is dirt easy and arrow-rs has great helpers to make the tests easy. |
I did give Rust a shot recently FWIW, but given the comments in #2694 I couldn't figure out a way to make the lifetimes more relaxed to match what Felipe was requesting there (tiberius ties the result set's lifetime to the connection's and I couldn't figure out a way to defer that check to runtime; tiberius essentially forces you into doing the same) |
Rust also has its own runtime issues to contend with (right now the Datafusion driver for instance unilaterally embeds its own copy of Tokio and it's not clear to me if Rust wouldn't end up with the same symbol clash/runtime clash issues as Go and C++ in the end) |
This probably requires some
There's quite a bit of Rust Python packaging as prior art with respect to symbol clashes (I'm not aware of this being a problem)...as far as I know multiple tokio runtimes are not undefined behaviour (more like...supported, but creating more than one incurs some overhead). (None of this should dissuade anybody from writing any Go! Just an excited Rust user over here 🙂 ) |
Mixing environments and interacting over a C API is always going to have elements of inefficiency and/or danger: multiple allocators, multiple thread pools, conflicting imported symbols, etc., each unaware that the others exist. The one I worry about the most -- perhaps without sufficient grounds -- is having multiple garbage-collected heaps in the same process. This is because the one I have the most experience with (in the .NET CLR), the signals used to drive garbage collection haven't typically "played well with others". I don't know how Go's GC works. If it's more like Python where the primary signal for deallocation is a refcount going to zero and the formal GC is largely "just" to clean up cycles, then my concern may be less well founded. |
Go's garbage collector is a concurrent mark-and-sweep. Rather than maintaining refcounts, the GC goroutine will periodically look at the current heap size (with various settings configurable via env vars) and then trace the live heap objects to mark things that are reachable, and then do a sweep to clean up anything that is unreachable. You can see it all explained here. Rather than being "signaled" there's just a goroutine that concurrently runs (and is scheduled by the runtime) doing GC work when there is work to do (or isn't scheduled/is off when there's no GC-related work to do). The cost of this is based on Go-owned heap memory. While the Go drivers, when building the shared-library objects, utilize an allocator that utilizes C-owned memory to pass data through the C interface (i.e. the go runtime doesn't know anything about the Arrow data or anything we construct with the Allocator, and therefore the Go GC doesn't scan it). This is to ensure it's safe to pass the memory through the FFI boundaries without fighting the Go garbage collector, or building up a large memory base that it has to manage (as the lifetime of the memory will be managed by the consumer calling the While there's definitely some questions surrounding multiple Go-based drivers in the same process that need to be addressed (the same issue would arise with any drivers that require protobuf/gRPC, or Rust async runtimes), I don't see any larger issue beyond this. |
I think my main concern is the undefined behaviour bit...it happens to work today but I don't believe there's any guarantee it will continue to work? |
I was reading the documentation and open issues to check on MSSQL implementation. Let me know if there is a plan in place to support it.
Thanks
The text was updated successfully, but these errors were encountered: