Skip to content

MSSQL support #588

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zacqed opened this issue Apr 15, 2023 · 20 comments
Open

MSSQL support #588

zacqed opened this issue Apr 15, 2023 · 20 comments

Comments

@zacqed
Copy link

zacqed commented Apr 15, 2023

I was reading the documentation and open issues to check on MSSQL implementation. Let me know if there is a plan in place to support it.

Thanks

@lidavidm
Copy link
Member

As far as I am aware, nobody is actively working on it, but SQL Server support would be welcome. The catch I think would be that FreeTDS is GPL licensed and so development of an ADBC driver using it wouldn't be able to happen here.

@lidavidm
Copy link
Member

lidavidm commented Nov 2, 2023

Now that we can develop drivers in Go and use them in Python/R/C++/etc, we could use https://github.com/microsoft/go-mssqldb (the official Microsoft MSSQL client) which is BSD licensed.

@zacqed
Copy link
Author

zacqed commented Nov 2, 2023 via email

@davlee1972
Copy link

davlee1972 commented Jan 19, 2024

Not to hijack this, but could we also work on Sybase support at the same time?
https://github.com/SAP/go-ase

Both products support FreeTDS.

The initial version of MS-SQL was a Unix port of Sybase for Windows..

SQL Server 6.0 is released in 1995, marking the end of collaboration with Sybase; Sybase would continue developing their own variant of SQL Server, Sybase Adaptive Server Enterprise, independently of Microsoft.

SQL Server 2005, released in 2005, finishes the complete revision of the old Sybase code into Microsoft code.

SQL Server 2017, released in 2017, adds Linux support for these Linux platforms: Red Hat Enterprise Linux, SUSE Linux Enterprise Server, Ubuntu & Docker Engine.[4]

@sdebruyn
Copy link

sdebruyn commented Apr 11, 2025

Since the community of Microsoft database users would probably be more comfortable in C#, I'd wonder if we could also write the driver in C#.

Currently, new drivers can be written in C#, C/C++, Go, and Java. A driver written in C/C++ or Go can be used from either of those languages, as well as C#, Python, R, and Ruby. (C# can experimentally export drivers to the same set of languges as well.) The Rust API definitions for ADBC are still under development, but we plan for them to be on par with C#, C/C++, and Go in this respect.

It is so far preferable to write new drivers in Go. C/C++ have had issues with dependencies and in particular some not uncommon dependencies in that ecosystem tend to cause conflicts when loaded into Python processes and elsewhere. (For example, the ADBC Flight SQL driver was originally written in C++ but would have conflicted with the grpcio and pyarrow Python packages.) It also tends to be easier for us to package and distribute additional Go libraries than it is for C/C++.

As with Go, a native C# client already exists: https://www.nuget.org/packages/microsoft.data.sqlclient / https://github.com/dotnet/SqlClient

It is not fully clear to me what the consequences would be of the language choice.

@lidavidm
Copy link
Member

@CurtHagenlocher correct me if I'm wrong but in theory we can package C#/.NET drivers as native DLLs/SOs/DYLIBs too so that could work (with possibly the same caveats as Go, where on some platforms the runtime may conflict if there are multiple drivers using .NET). (I'm not looking forward to setting up that packaging build, but...)

@zeroshade
Copy link
Member

My preference would probably be to use the Go library as a base since we already have the packaging set up and so on.

But that's also personal preference haha. I think the hardest part will probably be having some sort of integration testing because I don't know of any option other than actually having an MSSQL DB instance to test against.

@lidavidm
Copy link
Member

MSSQL is easy to test! There's a docker container and a free tier.

@zeroshade
Copy link
Member

Oooo! Didn't know that! Then that seems like a good candidate...

@CurtHagenlocher
Copy link
Contributor

Apologies in advance; this reply is going to be all over the place...

It's definitely possible in principle to take C# source and use "AOT" compilation to produce a native dynamic library. The main blocker for doing this with ADBC is a gap in the C# Arrow libraries. This gap is addressed by apache/arrow#40992, which I haven't yet had time to analyze. Once that's done, the C# implementations of Spark and Databricks ADBC drivers can be AOT-compiled and consumed using the C APIs.

But this underscores the point that not all C# code is compatible with AOT compilation. The Google code we're using in the C# BigQuery implementation, for instance, uses Reflection in a way that is not. Now I'd vaguely expect the latest C# SqlClient implementation to be AOT-compatible due to its relative importance in the ecosystem -- but pretty much the only way to determine this conclusive is to try it.

From a perhaps overly-selfish perspective, the only personal benefit I would see from an MSSQL ADBC driver is if it offered better performance than using SqlClient via its ADO.NET API. But because there's no way to build an ADBC driver on top of SqlClient without using its ADO.NET API, there's not actually a way to achieve better performance. And in fact, if the ultimate consumer is more .NET code then it will likely be somewhat worse in that the strings would be converted from UTF-16 to UTF-8 for the Arrow API and then back to UTF-16 for the .NET consumer. (It's possible that this would be an inefficiency no matter how it's implemented if the TDS protocol specifies that "wide" strings are UTF-16.)

Now just because I don't get any value out of this doesn't mean it's not worthwhile to have it. After all, one of the goals of this project is to establish ADBC as a more widely-used standard, and that requires drivers. But if the driver is strictly worse than what you get with other technologies, then at best it's a stopgap to help bootstrap an ecosystem rather than something valuable in its own right.

To that end, I think it would be more interesting to build a generic ADO.NET-to-ADBC wrapper -- just like we have the ADBC-to-ADO.NET wrapper already in the codebase. Such a thing could be used not only with SqlClient but with Oracle's ADO.NET driver, Teradata's ADO.NET driver, the MySql ADO.NET driver, npgsql, etc. It would be the C# equivalent of an ADBC wrapper on top of ODBC, but without all that sketchy C code making people nervous ;). And if the underlying driver happened to support AOT compilation, then so would the combination of the driver and the wrapper.

This isn't to say that I'm not interested in an MSSQL ADBC driver; far from it! Like the other row-oriented database protocols, ADO.NET is pretty inefficient in the inner loop. To this, it often adds overhead in terms of boxing and other taxes. But of course, you can't fix that by building on the outside, only by changing the internals. I'd previously considered forking the SqlClient code and changing it so that we can fill Arrow structures directly instead of going through an extra layer. I'd also considered investigating the use of the "tiberius" crate to build a Rust-based ADBC driver. (I both prefer the Rust language to Go and appreciate its lack of a runtime for use in interop scenarios.) But I've also got several lifetimes of side projects I'd like to do and something of a problem staying focused on any one of them, so ... .

It's also occurred to me that my "real" goal isn't so much "ADBC everywhere" as it is "Arrow everywhere", and ADBC is more of a means to an end. From that perspective, one could argue that it's sufficient (and probably lower-cost) to modify existing database APIs to support Arrow instead of using a new API. So perhaps the C# Arrow project could define the following:

interface IDbArrowCommand
{
    Task<IArrowArrayStream> ExecuteReaderAsync(CancellationToken cancellationToken);
    Task<IArrowArrayStreams> ExecuteMultipleReaderAsync(CancellationToken cancellationToken);
}

interface IArrowArrayStreams
{
    Task<IArrowArrayStream> GetNextResult(CancellationToken cancellationToken);
}

And then SqlClient or any other ADO.NET provider could be modified such that the DbCommand implementation also implements IDbArrowCommand, and a consumer could check for that support by doing a cast. And at that point, an ADBC wrapper arguably becomes more trivial to implement.

@lidavidm
Copy link
Member

To that end, I think it would be more interesting to build a generic ADO.NET-to-ADBC wrapper -- just like we have the ADBC-to-ADO.NET wrapper already in the codebase.

That's reasonable to me.

I'd also considered investigating the use of the "tiberius" crate to build a Rust-based ADBC driver. (I both prefer the Rust language to Go and appreciate its lack of a runtime for use in interop scenarios.)

I also slightly prefer this but we do have the infra already set up for Go-based drivers. On the other hand this and/or an ODBC wrapper might be a good excuse to set up the same for Rust. (I suppose it depends on whether tiberius or go-mssqldb is more maintained/complete, though I just noticed that go-mssqldb has gone 2 years without development...)

It's also occurred to me that my "real" goal isn't so much "ADBC everywhere" as it is "Arrow everywhere", and ADBC is more of a means to an end. From that perspective, one could argue that it's sufficient (and probably lower-cost) to modify existing database APIs to support Arrow instead of using a new API.

That would also be desirable, and one of the things I'd like to do someday is refactor the Dremio JDBC driver to support (1) working with ADBC more generally, not just Flight SQL and (2) having a way to get the Arrow data even while using the JDBC APIs

@sdebruyn
Copy link

sdebruyn commented May 5, 2025

But this underscores the point that not all C# code is compatible with AOT compilation. The Google code we're using in the C# BigQuery implementation, for instance, uses Reflection in a way that is not. Now I'd vaguely expect the latest C# SqlClient implementation to be AOT-compatible due to its relative importance in the ecosystem -- but pretty much the only way to determine this conclusive is to try it.

So maybe the go driver is a better place to start from, since it will always be able to produce AOT compiled binaries?

From a perhaps overly-selfish perspective, the only personal benefit I would see from an MSSQL ADBC driver is if it offered better performance than using SqlClient via its ADO.NET API.

There are tools which exclusively support ADBC drivers. So even to have it available for compatibility reasons could be good enough.

And if the underlying driver happened to support AOT compilation, then so would the combination of the driver and the wrapper.

That's a lot of ifs :)

Is the reason why you'd not consider the go driver pure personal preference?

@CurtHagenlocher
Copy link
Contributor

Is the reason why you'd not consider the go driver pure personal preference?

The cost of having the Go runtime in the same process as the CLR doesn't outweigh (for me) the benefit of a single API. (And it wouldn't even be a single API because there would still be other connectors based on ADO.NET.) There's also the problem described in #1841 to consider.

@paleolimbot
Copy link
Member

There's also the problem described in #1841 to consider.

I do think that continuing to build Go-based drivers on undefined behaviour should be considered.

I both prefer the Rust language to Go and appreciate its lack of a runtime for use in interop scenarios.

Just a note that I'm happy to review Rust code and that I'm hoping to upstream some of our recent work in the next month or two. There are some sorely needed helpers that need writing (e.g., for options and catalogs) but the packaging is dirt easy and arrow-rs has great helpers to make the tests easy.

@lidavidm
Copy link
Member

lidavidm commented May 5, 2025

I did give Rust a shot recently FWIW, but given the comments in #2694 I couldn't figure out a way to make the lifetimes more relaxed to match what Felipe was requesting there (tiberius ties the result set's lifetime to the connection's and I couldn't figure out a way to defer that check to runtime; tiberius essentially forces you into doing the same)

@lidavidm
Copy link
Member

lidavidm commented May 5, 2025

Rust also has its own runtime issues to contend with (right now the Datafusion driver for instance unilaterally embeds its own copy of Tokio and it's not clear to me if Rust wouldn't end up with the same symbol clash/runtime clash issues as Go and C++ in the end)

@paleolimbot
Copy link
Member

I couldn't figure out a way to make the lifetimes more relaxed to match what Felipe was requesting there

This probably requires some unsafe {} or unsafe impl or maybe a separate method that some drivers can choose not to implement until we figure it out. Feel free to put up a PR and I can try to fight with the compiler to make something work 🙂

it's not clear to me if Rust wouldn't end up with the same symbol clash/runtime clash issues as Go and C++ in the end

There's quite a bit of Rust Python packaging as prior art with respect to symbol clashes (I'm not aware of this being a problem)...as far as I know multiple tokio runtimes are not undefined behaviour (more like...supported, but creating more than one incurs some overhead).

(None of this should dissuade anybody from writing any Go! Just an excited Rust user over here 🙂 )

@CurtHagenlocher
Copy link
Contributor

Mixing environments and interacting over a C API is always going to have elements of inefficiency and/or danger: multiple allocators, multiple thread pools, conflicting imported symbols, etc., each unaware that the others exist. The one I worry about the most -- perhaps without sufficient grounds -- is having multiple garbage-collected heaps in the same process. This is because the one I have the most experience with (in the .NET CLR), the signals used to drive garbage collection haven't typically "played well with others". I don't know how Go's GC works. If it's more like Python where the primary signal for deallocation is a refcount going to zero and the formal GC is largely "just" to clean up cycles, then my concern may be less well founded.

@zeroshade
Copy link
Member

Go's garbage collector is a concurrent mark-and-sweep. Rather than maintaining refcounts, the GC goroutine will periodically look at the current heap size (with various settings configurable via env vars) and then trace the live heap objects to mark things that are reachable, and then do a sweep to clean up anything that is unreachable.

You can see it all explained here. Rather than being "signaled" there's just a goroutine that concurrently runs (and is scheduled by the runtime) doing GC work when there is work to do (or isn't scheduled/is off when there's no GC-related work to do). The cost of this is based on Go-owned heap memory. While the Go drivers, when building the shared-library objects, utilize an allocator that utilizes C-owned memory to pass data through the C interface (i.e. the go runtime doesn't know anything about the Arrow data or anything we construct with the Allocator, and therefore the Go GC doesn't scan it).

This is to ensure it's safe to pass the memory through the FFI boundaries without fighting the Go garbage collector, or building up a large memory base that it has to manage (as the lifetime of the memory will be managed by the consumer calling the release callback.

While there's definitely some questions surrounding multiple Go-based drivers in the same process that need to be addressed (the same issue would arise with any drivers that require protobuf/gRPC, or Rust async runtimes), I don't see any larger issue beyond this.

@paleolimbot
Copy link
Member

I don't see any larger issue beyond this.

I think my main concern is the undefined behaviour bit...it happens to work today but I don't believe there's any guarantee it will continue to work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants