Skip to content

Speed and Memory Usage Questions #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
WholeCheese opened this issue Jul 25, 2016 · 5 comments
Closed

Speed and Memory Usage Questions #2

WholeCheese opened this issue Jul 25, 2016 · 5 comments

Comments

@WholeCheese
Copy link

Hey Flinesoft! Thank you for posting CSVImporter! It has been a great learning experience for me. But I have some questions and observations.

First, what I was looking for: I recently worked on a Python app for a client to parse a 10 million line CSV file for importing to a mySQL database. I was pleased at how easy it was to write a Python script in less than a day to do this.

But then I got curious about how quickly could I do it in Swift. The first thing I noticed was there is no NSCSV-like framework! So a search of github was in order which led me to CSVImporter, which led to learning about Carthage and dependent libraries, etc. It was a day of drinking from the fire hose! :-) I picked CSVImporter over the other CSV projects in github because I wanted to use Swift, and because your readme rational for yet another CSV importer: "a really large CSV file".

Ok, Carthage installed, FileKit and HandySwift plugged into my XCode project and everything up and running (two days) and parsing my 10 million line CSV file.

Now onto the question/observation!

My Python CSV parser (which uses the "import csv" package) parses my 10 million line file in 7 seconds. My CSVImporter Swift application takes 14 minutes to do the same thing!

So, fired up Instruments and see that the readValuesInLine function is taking up all the time in NSRegularExpression. I moved the instantiation of startPartRegex, middlePartRegex, and endPartRegex out of the function and into the class init. This cut the time to just over 6 minutes. Good but not great. Instruments now shows that stringByReplacingOccurrencesOfString is the time sink.

Any suggestions on how to speed this up?

I assume that the Python cvs is just a wrapper around the good old libcsv written in C and as such does not handle unicode correctly whereas Swift String does handle unicode and this may be the problem: Swift string operations like stringByReplacingOccurrencesOfString are probably not optimal for parsing a 10 million line file.

Another observation is that my CSVImporter app memory usage steadily increases as it runs and takes over 10 gigabytes of RAM before it is done. Your readme notes claim that the file is read line by line instead of loading the entire string into memory. Are you sure about this? Your importLines function is doing:

for line in csvStreamReader
{
let valuesInLine = readValuesInLine(line)
closure(valuesInLine: valuesInLine)
}

My understanding of the Swift Sequence object is that it does load all of the elements into memory in order to provide the "for _ in object". This would certainly explain the ever increasing memory.

Oh, as to memory, your startImportingRecords appends all records onto importedRecords. I added my own "startReader" function that does not accumulate the records to avoid the huge array at the end. Even with this removed the app is still taking many gigabytes of RAM which convinces me even more that csvStreamReader is reading in the entire file.

Do you have any advice on how to make Swift comparable to Python for CSV parsing?

Thanks!

-Allan

@Jeehut
Copy link
Member

Jeehut commented Jul 25, 2016

Hi Allen, just as a quick response before I go to sleep: Thank you for your many investigations.

Please note that the goal of this library is to provide a solution that works well with large files and I included some larger files in the performance tests to test this. I have to admit that I released this library as soon as it fit my needs - improving the initial load time of a large CSV file was enough for me as I just needed a solution where I could show progress even for large files without any noticable delay (which I solved by reading it line by line).

When trying to differentiate this from other projects I assumed reading the file line by line would also save memory - which seems not to be true as you have figured out.

The fact that NSRegularExpressions aren't nearly as fast as walking through the String yourself and parsing it manually doesn't surprise me though. I think in order to get speeds comparable to that library you used a manual parsing will be needed.

A libcsv like implementation which parses a line of String would definitely help improve speed. I would also appreciate any memory improvements. We would need to replace the completion blocks with batch completion blocks though to make sure the results can be used in some way.

I would be glad to see your changes and suggestions in form of pull requests if you don't mind. Version 1 clearly only scratched the surface! ;)

@WholeCheese
Copy link
Author

Dschee,

The increasing memory might be a memory leak. Instruments is telling me that the "correctedLine.hasSuffix" and "correctedLine.hasPrefix" code in the readValuesInLine are leaking. I've googled around for Swift String leak problems and do not see any smoking guns related to these two functions. Since I am still a Swift newbie you might want to investigate this.

As for me submitting changes to your code, I am beginning to think that parsing large CSV files should not be done with the standard String functions and NSRegularExpression. Parsing the bare naked unicode much like the libcsv does with ascii is probably the way to go.

Let me sleep on that, but practically speaking I can live with using Python for my CSV projects for now!

-Allan

memleak

@phoney
Copy link
Contributor

phoney commented Aug 2, 2016

Hoisting the regexes is an obvious improvement to the readValuesInLine() method. They can simply be private let constants in the class. They could even be file scope private let constants. There are also a number of dynamic strings in that method that should be hoisted. Strings like "(delimiter)""(delimiter)" delimiter+delimiter and so on only need to be calculated one time, not for every line.

@phoney
Copy link
Contributor

phoney commented Aug 3, 2016

ReadValuesInLine probably needs an autorelease pool. It builds a bunch of new strings and they are probably autoreleased, which might explain the large amount of memory used. I haven't tried it but @WholeCheese you might try to add this around the entire contents of readValuesInLine().

autorelease {
// all teh codes
}

@Jeehut
Copy link
Member

Jeehut commented Aug 18, 2016

I think the questions asked here were answered and performance improvements added via #9 – closing this.

@Jeehut Jeehut closed this as completed Aug 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants