-
-
Notifications
You must be signed in to change notification settings - Fork 361
perf: Gzip model serialization #2103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Option to zip/unzip when serializaing/deserializing
GZIP Serialization option.
Gzip serialization
Gzip serialization
* @throws IOException | ||
* if some IO error occurs | ||
*/ | ||
void save(Factory f, OutputStream out) throws IOException; | ||
void save(Factory f, OutputStream out, boolean zipIt) throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it wouldn't be interesting to propose the compression by default.
Then to keep the old signature of those methods, and to add a boolean zipIt
and a new setter in the API for deactivating it.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! I'll try to do it tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! In Spoon, we care a lot about backward compatibility, so we generally prefer to add new methods than changing the signature of existing ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to use an enum rather than a boolean. This allows to add further compression types (e.g. lzma) in future.
Could you at least add a new unit test where you call the serializer with the compression: I don't see in your code any usage of save() or load() with the true value. |
Sure, I'll see how to do that. |
+1, excellent idea. an enum is also better for readability.
|
I am thinking about the following enum, would it be the right place?
Not changing the signature is indeed the way to go. So I assume that I should add the option in StandardEnvironment (and its interface), for example:
Then, it's easy for SerializationModelStreamer#save, as it has the factory as argument, thus access to the environment. However, for SerializationModelStreamer#load, we don't have access to the environment. Is there a way to get it as a singleton? I am stucked here :) |
Would it be possible to detect the type of serialized stream by some header? |
The idea is good perse @pvojtechovsky, but I was not able to find a way for now. For Gzip, the few first bit of the inputstream can tell you, but I was not able to make it work for the moment (see e.g., https://www.programcreek.com/java-api-examples/?class=java.util.zip.GZIPInputStream&method=GZIP_MAGIC ). For now I just try/catch ZipException, as we have just two different types of serialization. Checking just the header is OK if we change the compression technique, but if we propose also to use other serialization libraries, try/catch should be ok, but this is not very formal. |
I guess it should be possible to create buffered input stream then to read e.g. first 8 bytes and detect what kind of stream it is. Then reset the content of the stream and to create appropriate stream wrapper (e.g. GZip) and then do it again. |
/** | ||
* Different types of serialization. | ||
*/ | ||
public enum SerializationType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a difference between serialization and compression. Serialization is a mapping to another format, for instance, JSON, XML, and so on whereas compression is a method to reduce the number of bytes requirevd to represent the same entity. The name of you Enum suggests different serialization strategies, though, it also contains different compression techniques. I don't think that Spoon will ever use a serialization technique other than the Java default serialization as, due to the complexity of the model, it is hard to develop and even harder to maintain. The compression technique, on the other hand, may vary.
tl;dr
I suggest to rename this Enum to CompressionType with: NONE and GZIP as values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking to a lot of other serialization techniques I arrived to the same conclusion. Keep basic java serialization; propose different compression types. Given my own (yet short) experience of spoon (working with >5GB uncompressed serialized factory), it's enough. Other types of serialization require a lot of work, and as you said, very hard to maintain.
I did as suggested, it should also be easy to add new compression types such as LZMA.
Can now use either no or GZIP compression for factory serialization. Other compression type can be easily added in the future.
Can now use either no or GZIP compression for factory serialization. Other compression type can be easily added in the future.
As suggested, we now have:
I hope that it's not too late for next spoon release :) For other types of compression, there are two interesting things
I'll do some tests on the very large model I am working with to see if it's worth it, and will create a new PR if it's positive. |
|
||
@Test | ||
public void testGZipCompression() throws IOException { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to specify explicitely that you're using GZIP compression type (e.g. to set it in the environment): the default value might evolve in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! I'll do that tonight after work.
It's a really good job @mehdi-kaytoue! I only have a small comment, but I think we should be able to include it in the next release ;) |
Thanks! I like your comments, very instructing! |
All cases of compression type are tested now: The default value (set in the Environment, that is GZIP for now), explicit NONE and explicit GZIP. + typo correction in a comment.
API changes: 2 (Detected by Revapi) Old API: fr.inria.gforge.spoon:spoon-core:jar:6.3.0-20180627.195904-178 / New API: fr.inria.gforge.spoon:spoon-core:jar:6.3.0-SNAPSHOT
|
@surli, do you proceed with merging? |
Congrats and thank you for this first PR!
|
…els (#2165) Since #2094 and #2103 we can [de]serialize the model (CtFactory) with standard java serialization in a file, or a gzip file to save space. Choice was made to write things so that we can add new compression types easily. In this PR, I propose to use Apache Common Compressor API (+other files needed for some compression types such as LZMA). Main changes in src/main/java/spoon/support/SerializationModelStreamer.java and its junit test case src/test/java/spoon/test/serializable/ModelStreamerTest.java and of course pom.xml. I don't know what you think about adding a new lib, but this lib allows very easily (i) to test what compressor is used for an inputstream (except 7zip btw), (ii) to decompress, (iii) to compress. Looking forward discussing it with you guyz :) Cheers, Mehdi
See #2094. ( my first PR, please be nice with me ;) )
Mehdi