Skip to content

completion: optimize the index engine for the scenario "complete on type name" #1846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
testforstephen opened this issue Aug 19, 2021 · 2 comments · Fixed by #1902
Closed

Comments

@testforstephen
Copy link
Contributor

testforstephen commented Aug 19, 2021

See a typical case for "complete on type name":
image

Currently the speed for "completion on type name" is not very fast, and one major part of time is spent by unit.codeComplete(). In the codeComplete implementation, it will call the search engine to search the available types with the prefixed string. By digging deeper into the underlying index engine, i found the bottleneck of the index engine is IO. Here is the CPU profiling result for PatternSearchJob, you can see that most of CPU time is spent on reading index files from disk.

  • 86.9% CPU time is to read Document Names from Index file.
  • 9.9% CPU time is to read typeDecls category from Index file.

image

This conclusion matches the metrics I observed by adding logpoint to the search engine source code. In my macOS (MacBook Pro (13-inch, 2016, Four Thunderbolt 3 Ports), 2.9 GHz Dual-Core Intel Core i5), I tested with the index file of the JDK library jrt-fs.jar (which has a 46M index file) as an example. PatternSearchJob takes more than 130ms (IO time is not stable, sometimes it could be 200~300ms) to get query results from this JDK index file. And out of the total time, 40ms is used to read the typeDecls category table from the index file, and 90ms is used to read the document name from the index file. Since each dependency has a separate index file, and a typical spring-boot project may have more than 100 index files, the cumulative query time to query all index files can be slow even if we have a search engine with a parallel search strategy.

Optimization:

  • Reduce the frequency of FileInputStream.open
    The profiling shows the operation FileInputStream.open under EntryResult.getDocumentNames() takes 72.6% CPU time. The reason is EntryResult.getDocumentNames() is a high frequency operation. For example, the EntryResult list to search types with prefix S could be 2k+. In current implementation of DiskIndex.readDocumentName(), it will open/close the inputStream to get a document name, that means every call to EntryResult.getDocumentNames() probably trigger a FileInputStream.open operation. The optimization is, can we read each index file only once to get document names of all the EntryResults in that index?

SearchPattern.findIndexMatches():

// It's from org.eclipse.jdt.core.search.SearchPattern.class
public void findIndexMatches(Index index, IndexQueryRequestor requestor, SearchParticipant participant, IJavaSearchScope scope, IProgressMonitor monitor) throws IOException {
	if (monitor != null && monitor.isCanceled()) throw new OperationCanceledException();
	try {
		index.startQuery();
		SearchPattern pattern = currentPattern();
		EntryResult[] entries = pattern.queryIn(index);
		if (entries == null) return;

		String containerPath = index.containerPath;
		char separator = index.separator;
		long start = System.currentTimeMillis();
		for (int i = 0, l = entries.length; i < l; i++) {
			if (monitor != null && monitor.isCanceled()) throw new OperationCanceledException();

			EntryResult entry = entries[i];
			SearchPattern decodedResult = pattern.getBlankPattern();
			decodedResult.decodeIndexKey(entry.getWord());
			if (pattern.matchesDecodedKey(decodedResult)) {
				// TODO (kent) some clients may not need the document names
				String[] names = entry.getDocumentNames(index);
				for (int j = 0, n = names.length; j < n; j++)
					acceptMatch(names[j], containerPath, separator, decodedResult, requestor, participant, scope, monitor);
			}
		}
		Util.verbose(index.getIndexLocation().fileName() + " - cook result: " + (System.currentTimeMillis() - start) + "ms");
	} finally {
		index.stopQuery();
	}
}

EntryResult.getDocumentNames():

// It's from org.eclipse.jdt.internal.core.index.EntryResult.class
public String[] getDocumentNames(Index index) throws java.io.IOException {
	if (this.documentTables != null) {
		int length = this.documentTables.length;
		if (length == 1 && this.documentNames == null) { // have a single table
			Object offset = this.documentTables[0];
			int[] numbers = index.diskIndex.readDocumentNumbers(offset);
			String[] names = new String[numbers.length];
			for (int i = 0, l = numbers.length; i < l; i++)
				names[i] = index.diskIndex.readDocumentName(numbers[i]);
			return names;
		}

		for (int i = 0; i < length; i++) {
			Object offset = this.documentTables[i];
			int[] numbers = index.diskIndex.readDocumentNumbers(offset);
			for (int j = 0, k = numbers.length; j < k; j++)
				addDocumentName(index.diskIndex.readDocumentName(numbers[j]));
		}
	}

	if (this.documentNames == null)
		return CharOperation.NO_STRINGS;

	String[] names = new String[this.documentNames.elementSize];
	int count = 0;
	Object[] values = this.documentNames.values;
	for (int i = 0, l = values.length; i < l; i++)
		if (values[i] != null)
			names[count++] = (String) values[i];
	return names;
}

DiskIndex.readDocumentName():

// It's from org.eclipse.jdt.internal.core.index.DiskIndex
synchronized String readDocumentName(int docNumber) throws IOException {
	if (this.cachedChunks == null)
		this.cachedChunks = new String[this.numberOfChunks][];

	int chunkNumber = docNumber / CHUNK_SIZE;
	String[] chunk = this.cachedChunks[chunkNumber];
	if (chunk == null) {
		boolean isLastChunk = chunkNumber == this.numberOfChunks - 1;
		int start = this.chunkOffsets[chunkNumber];
		int numberOfBytes = (isLastChunk ? this.startOfCategoryTables : this.chunkOffsets[chunkNumber + 1]) - start;
		if (numberOfBytes < 0)
			throw new IllegalArgumentException();
		this.streamBuffer = new byte[numberOfBytes];
		this.bufferIndex = 0;
		InputStream file = this.indexLocation.getInputStream();
		try {
			file.skip(start);
			if (file.read(this.streamBuffer, 0, numberOfBytes) != numberOfBytes)
				throw new IOException();
		} catch (IOException ioe) {
			this.streamBuffer = null;
			throw ioe;
		} finally {
			file.close();
			this.indexLocation.close();
		}
		int numberOfNames = isLastChunk ? this.sizeOfLastChunk : CHUNK_SIZE;
		chunk = new String[numberOfNames];
		try {
			readChunk(chunk, null, 0, numberOfNames);
		} catch (IOException ioe) {
			this.streamBuffer = null;
			throw ioe;
		}
		this.cachedChunks[chunkNumber] = chunk;
	}
	this.streamBuffer = null;
	return chunk[docNumber - (chunkNumber * CHUNK_SIZE)];
}
  • Explore adding cache to DiskIndex to avoid IO.
    The core index query class of JDT comes from DiskIndex, looks like its design intention is to save memory space. It uses quite little cache and has to re-read the index file for each index query. In particular, its IO is expensive for some large index files, such as JDK's index files.
@testforstephen
Copy link
Contributor Author

The upstream issue https://bugs.eclipse.org/bugs/show_bug.cgi?id=575562

// cc: @rgrunber

@fbricon
Copy link
Contributor

fbricon commented Oct 11, 2021

fixed with #1902

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants