Skip to content

perf: Gzip model serialization #2103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 28, 2018
Merged

perf: Gzip model serialization #2103

merged 11 commits into from
Jun 28, 2018

Conversation

mehdi-kaytoue
Copy link

See #2094. ( my first PR, please be nice with me ;) )
Mehdi

Mehdi Kaytoue added 4 commits June 23, 2018 22:37
Option to zip/unzip when serializaing/deserializing
GZIP Serialization option.
Gzip serialization
Gzip serialization
@surli surli changed the title Gzip model serialization to save space while keeping runtime equivalent perf: Gzip model serialization Jun 25, 2018
* @throws IOException
* if some IO error occurs
*/
void save(Factory f, OutputStream out) throws IOException;
void save(Factory f, OutputStream out, boolean zipIt) throws IOException;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it wouldn't be interesting to propose the compression by default.
Then to keep the old signature of those methods, and to add a boolean zipIt and a new setter in the API for deactivating it.

WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I'll try to do it tonight.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! In Spoon, we care a lot about backward compatibility, so we generally prefer to add new methods than changing the signature of existing ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to use an enum rather than a boolean. This allows to add further compression types (e.g. lzma) in future.

@surli
Copy link
Collaborator

surli commented Jun 25, 2018

Could you at least add a new unit test where you call the serializer with the compression: I don't see in your code any usage of save() or load() with the true value.

@mehdi-kaytoue
Copy link
Author

Sure, I'll see how to do that.

@monperrus
Copy link
Collaborator

monperrus commented Jun 25, 2018 via email

@mehdi-kaytoue
Copy link
Author

mehdi-kaytoue commented Jun 25, 2018

I am thinking about the following enum, would it be the right place?

package spoon.support;

/**
 *	Different types of serialization.
 */
public enum SerializationType {

	STANDARD("Java"),
	STANDARD_GZIP("Java, GZIP compression");
	
	private String description;
	
	private SerializationType(String description) {
		this.description = description;
	}
	
	public String getDescription() {
		return this.description;
	}
}

Not changing the signature is indeed the way to go. So I assume that I should add the option in StandardEnvironment (and its interface), for example:

public class StandardEnvironment implements Serializable, Environment {
/* ..... */
private SerializationType serializationType = SerializationType.STANDARD_GZIP;
/* ..... */
@Override
	public SerializationType getSerializationType() {
		return serializationType;
	}

	@Override
	public void setSerializationType(SerializationType serializationType) {
		this.serializationType = serializationType;
	}
}

Then, it's easy for SerializationModelStreamer#save, as it has the factory as argument, thus access to the environment.

However, for SerializationModelStreamer#load, we don't have access to the environment. Is there a way to get it as a singleton?

I am stucked here :)

@pvojtechovsky
Copy link
Collaborator

However, for SerializationModelStreamer#load, we don't have access to the environment. Is there a way to get it as a singleton?

Would it be possible to detect the type of serialized stream by some header? save would first add the header and then would continue with binary data depending on the SerializationType. If yes, then load method doesn't need extra parameters and will automatically use the deserialization type which fits to the current stream content.

Mehdi Kaytoue added 4 commits June 25, 2018 22:23
Support for different file format for serializing Factory.
Added a junit test.
Cleaning
Cleaning
Licence
@mehdi-kaytoue
Copy link
Author

The idea is good perse @pvojtechovsky, but I was not able to find a way for now. For Gzip, the few first bit of the inputstream can tell you, but I was not able to make it work for the moment (see e.g., https://www.programcreek.com/java-api-examples/?class=java.util.zip.GZIPInputStream&method=GZIP_MAGIC ).

For now I just try/catch ZipException, as we have just two different types of serialization.

Checking just the header is OK if we change the compression technique, but if we propose also to use other serialization libraries, try/catch should be ok, but this is not very formal.

@pvojtechovsky
Copy link
Collaborator

I guess it should be possible to create buffered input stream then to read e.g. first 8 bytes and detect what kind of stream it is. Then reset the content of the stream and to create appropriate stream wrapper (e.g. GZip) and then do it again.
Note: the header might be for example 'SpoonGZip` or anything shorter ... then it can continue with compressed part

/**
* Different types of serialization.
*/
public enum SerializationType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a difference between serialization and compression. Serialization is a mapping to another format, for instance, JSON, XML, and so on whereas compression is a method to reduce the number of bytes requirevd to represent the same entity. The name of you Enum suggests different serialization strategies, though, it also contains different compression techniques. I don't think that Spoon will ever use a serialization technique other than the Java default serialization as, due to the complexity of the model, it is hard to develop and even harder to maintain. The compression technique, on the other hand, may vary.

tl;dr
I suggest to rename this Enum to CompressionType with: NONE and GZIP as values.

Copy link
Author

@mehdi-kaytoue mehdi-kaytoue Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking to a lot of other serialization techniques I arrived to the same conclusion. Keep basic java serialization; propose different compression types. Given my own (yet short) experience of spoon (working with >5GB uncompressed serialized factory), it's enough. Other types of serialization require a lot of work, and as you said, very hard to maintain.

I did as suggested, it should also be easy to add new compression types such as LZMA.

Mehdi Kaytoue added 2 commits June 26, 2018 21:37
Can now use either no or GZIP compression for factory serialization. Other compression type can be easily added in the future.
Can now use either no or GZIP compression for factory serialization. Other compression type can be easily added in the future.
@mehdi-kaytoue
Copy link
Author

mehdi-kaytoue commented Jun 26, 2018

As suggested, we now have:

  • A CompressionType (NONE, GZIP) enum
  • A getter/setter in [Standard]Environment for the compression type
  • SerializationModelStreamer#save compresses as chosen in the Environment
  • SerializationModelStreamer#load decompresses if needed (look at the first bits of the stream if its GZIP)
  • A JUNIT test for both NONE and GZIP compression

I hope that it's not too late for next spoon release :)

For other types of compression, there are two interesting things

I'll do some tests on the very large model I am working with to see if it's worth it, and will create a new PR if it's positive.


@Test
public void testGZipCompression() throws IOException {

Copy link
Collaborator

@surli surli Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to specify explicitely that you're using GZIP compression type (e.g. to set it in the environment): the default value might evolve in the future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! I'll do that tonight after work.

@surli
Copy link
Collaborator

surli commented Jun 27, 2018

It's a really good job @mehdi-kaytoue! I only have a small comment, but I think we should be able to include it in the next release ;)

@mehdi-kaytoue
Copy link
Author

Thanks! I like your comments, very instructing!

All cases of compression type are tested now:  The default value (set in the Environment, that is GZIP for now), explicit NONE and explicit GZIP.

+ typo correction in a comment.
@spoon-bot
Copy link
Collaborator

API changes: 2 (Detected by Revapi)

Old API: fr.inria.gforge.spoon:spoon-core:jar:6.3.0-20180627.195904-178 / New API: fr.inria.gforge.spoon:spoon-core:jar:6.3.0-SNAPSHOT

Method was added to an interface.
Old none
New method Environment#getCompressionType()
Breaking binary: non_breaking,
Method was added to an interface.
Old none
New method Environment#setCompressionType(CompressionType)
Breaking binary: non_breaking,

@monperrus
Copy link
Collaborator

@surli, do you proceed with merging?

@monperrus
Copy link
Collaborator

monperrus commented Jun 28, 2018 via email

@surli surli mentioned this pull request Jul 4, 2018
surli pushed a commit that referenced this pull request Oct 25, 2018
…els (#2165)

Since #2094 and #2103 we can [de]serialize the model (CtFactory) with standard java serialization in a file, or a gzip file to save space. Choice was made to write things so that we can add new compression types easily. 

In this PR, I propose to use Apache Common Compressor API (+other files needed for some compression types such as LZMA).

Main changes in src/main/java/spoon/support/SerializationModelStreamer.java and its junit test case src/test/java/spoon/test/serializable/ModelStreamerTest.java and of course pom.xml.

I don't know what you think about adding a new lib, but this lib allows very easily (i) to test what compressor is used for an inputstream (except 7zip btw), (ii) to decompress, (iii) to compress.

Looking forward discussing it with you guyz :)
Cheers,
Mehdi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants