-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Nate/better retrieval #1677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Nate/better retrieval #1677
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
6f36445
deduplicatearray tests
sestinj 040bff4
break out separate retrieval pipelines
sestinj 938b32e
IConfigHandler
sestinj dc774c0
Merge branch 'dev' into nate/better-retrieval
sestinj d30457f
tests for codebase indexer
sestinj 2ac88f0
better .continueignore for continue
sestinj 7f4663e
indexing fixes
sestinj 0f28800
ignore .gitignore and .continueignore when indexing
sestinj 442a925
retrieval pipeline improvements
sestinj File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,7 @@ | ||
**/*.run.xml | ||
archive/**/* | ||
extensions/vscode/models/**/* | ||
docs/docs/languages | ||
\*_/_.run.xml | ||
docs/docs/languages | ||
.changes/ | ||
.idea/ | ||
.vscode/ | ||
.archive/ | ||
**/*.scm |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,2 @@ | ||
extensions/vscode/continue_rc_schema.json | ||
extensions/vscode/continue_rc_schema.json | ||
**/.continueignore |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
import { | ||
BrowserSerializedContinueConfig, | ||
ContinueConfig, | ||
IContextProvider, | ||
IdeSettings, | ||
ILLM, | ||
} from "../index.js"; | ||
|
||
export interface IConfigHandler { | ||
updateIdeSettings(ideSettings: IdeSettings): void; | ||
onConfigUpdate(listener: (newConfig: ContinueConfig) => void): void; | ||
reloadConfig(): Promise<void>; | ||
getSerializedConfig(): Promise<BrowserSerializedContinueConfig>; | ||
loadConfig(): Promise<ContinueConfig>; | ||
llmFromTitle(title?: string): Promise<ILLM>; | ||
registerCustomContextProvider(contextProvider: IContextProvider): void; | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
import { | ||
BranchAndDir, | ||
Chunk, | ||
EmbeddingsProvider, | ||
IDE, | ||
Reranker, | ||
} from "../../.."; | ||
import { LanceDbIndex } from "../../../indexing/LanceDbIndex"; | ||
import { retrieveFts } from "../fullTextSearch"; | ||
|
||
export interface RetrievalPipelineOptions { | ||
ide: IDE; | ||
embeddingsProvider: EmbeddingsProvider; | ||
reranker: Reranker | undefined; | ||
|
||
input: string; | ||
nRetrieve: number; | ||
nFinal: number; | ||
tags: BranchAndDir[]; | ||
filterDirectory?: string; | ||
} | ||
|
||
export interface IRetrievalPipeline { | ||
run(options: RetrievalPipelineOptions): Promise<Chunk[]>; | ||
} | ||
|
||
export default class BaseRetrievalPipeline implements IRetrievalPipeline { | ||
private lanceDbIndex: LanceDbIndex; | ||
constructor(protected readonly options: RetrievalPipelineOptions) { | ||
this.lanceDbIndex = new LanceDbIndex(options.embeddingsProvider, (path) => | ||
options.ide.readFile(path), | ||
); | ||
} | ||
|
||
protected async retrieveFts(input: string, n: number): Promise<Chunk[]> { | ||
return retrieveFts( | ||
input, | ||
n, | ||
this.options.tags, | ||
this.options.filterDirectory, | ||
); | ||
} | ||
|
||
protected async retrieveEmbeddings( | ||
input: string, | ||
n: number, | ||
): Promise<Chunk[]> { | ||
return this.lanceDbIndex.retrieve( | ||
input, | ||
n, | ||
this.options.tags, | ||
this.options.filterDirectory, | ||
); | ||
} | ||
|
||
run(): Promise<Chunk[]> { | ||
throw new Error("Not implemented"); | ||
} | ||
} |
26 changes: 26 additions & 0 deletions
26
core/context/retrieval/pipelines/NoRerankerRetrievalPipeline.ts
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
import { Chunk } from "../../.."; | ||
import { deduplicateChunks } from "../util"; | ||
import BaseRetrievalPipeline from "./BaseRetrievalPipeline"; | ||
|
||
export default class NoRerankerRetrievalPipeline extends BaseRetrievalPipeline { | ||
async run(): Promise<Chunk[]> { | ||
const { input } = this.options; | ||
|
||
// Get all retrieval results | ||
const retrievalResults: Chunk[] = []; | ||
|
||
// Full-text search | ||
const ftsResults = await this.retrieveFts(input, this.options.nFinal / 2); | ||
retrievalResults.push(...ftsResults); | ||
|
||
// Embeddings | ||
const embeddingResults = await this.retrieveEmbeddings( | ||
input, | ||
this.options.nFinal / 2, | ||
); | ||
retrievalResults.push(...embeddingResults); | ||
|
||
const finalResults: Chunk[] = deduplicateChunks(retrievalResults); | ||
return finalResults; | ||
} | ||
} |
125 changes: 125 additions & 0 deletions
125
core/context/retrieval/pipelines/RerankerRetrievalPipeline.ts
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
import { Chunk } from "../../.."; | ||
import { RETRIEVAL_PARAMS } from "../../../util/parameters"; | ||
import { deduplicateChunks } from "../util"; | ||
import BaseRetrievalPipeline from "./BaseRetrievalPipeline"; | ||
|
||
export default class RerankerRetrievalPipeline extends BaseRetrievalPipeline { | ||
private async _retrieveInitial(): Promise<Chunk[]> { | ||
const { input, nRetrieve } = this.options; | ||
|
||
// Get all retrieval results | ||
const retrievalResults: Chunk[] = []; | ||
|
||
// Full-text search | ||
const ftsResults = await this.retrieveFts(input, nRetrieve / 2); | ||
retrievalResults.push(...ftsResults); | ||
|
||
// Embeddings | ||
const embeddingResults = await this.retrieveEmbeddings(input, nRetrieve); | ||
retrievalResults.push( | ||
...embeddingResults.slice(0, nRetrieve - ftsResults.length), | ||
); | ||
|
||
const results: Chunk[] = deduplicateChunks(retrievalResults); | ||
return results; | ||
} | ||
|
||
private async _rerank(input: string, chunks: Chunk[]): Promise<Chunk[]> { | ||
if (!this.options.reranker) { | ||
throw new Error("No reranker provided"); | ||
} | ||
|
||
let scores: number[] = await this.options.reranker.rerank(input, chunks); | ||
|
||
// Filter out low-scoring results | ||
let results = chunks; | ||
// let results = chunks.filter( | ||
// (_, i) => scores[i] >= RETRIEVAL_PARAMS.rerankThreshold, | ||
// ); | ||
// scores = scores.filter( | ||
// (score) => score >= RETRIEVAL_PARAMS.rerankThreshold, | ||
// ); | ||
|
||
results.sort( | ||
(a, b) => scores[results.indexOf(a)] - scores[results.indexOf(b)], | ||
); | ||
results = results.slice(-this.options.nFinal); | ||
return results; | ||
} | ||
|
||
private async _expandWithEmbeddings(chunks: Chunk[]): Promise<Chunk[]> { | ||
const topResults = chunks.slice( | ||
-RETRIEVAL_PARAMS.nResultsToExpandWithEmbeddings, | ||
); | ||
|
||
const expanded = await Promise.all( | ||
topResults.map(async (chunk, i) => { | ||
const results = await this.retrieveEmbeddings( | ||
chunk.content, | ||
RETRIEVAL_PARAMS.nEmbeddingsExpandTo, | ||
); | ||
return results; | ||
}), | ||
); | ||
return expanded.flat(); | ||
} | ||
|
||
private async _expandRankedResults(chunks: Chunk[]): Promise<Chunk[]> { | ||
let results: Chunk[] = []; | ||
|
||
const embeddingsResults = await this._expandWithEmbeddings(chunks); | ||
results.push(...embeddingsResults); | ||
|
||
return results; | ||
} | ||
|
||
async run(): Promise<Chunk[]> { | ||
// Retrieve initial results | ||
let results = await this._retrieveInitial(); | ||
|
||
// Rerank | ||
const { input } = this.options; | ||
results = await this._rerank(input, results); | ||
|
||
// // // Expand top reranked results | ||
// const expanded = await this._expandRankedResults(results); | ||
// results.push(...expanded); | ||
|
||
// // De-duplicate | ||
// results = deduplicateChunks(results); | ||
|
||
// // Rerank again | ||
// results = await this._rerank(input, results); | ||
|
||
// TODO: stitch together results | ||
|
||
return results; | ||
} | ||
} | ||
|
||
// Source: expansion with code graph | ||
// consider doing this after reranking? Or just having a lower reranking threshold | ||
// This is VS Code only until we use PSI for JetBrains or build our own general solution | ||
// TODO: Need to pass in the expandSnippet function as a function argument | ||
// because this import causes `tsc` to fail | ||
// if ((await extras.ide.getIdeInfo()).ideType === "vscode") { | ||
// const { expandSnippet } = await import( | ||
// "../../../extensions/vscode/src/util/expandSnippet" | ||
// ); | ||
// let expansionResults = ( | ||
// await Promise.all( | ||
// extras.selectedCode.map(async (rif) => { | ||
// return expandSnippet( | ||
// rif.filepath, | ||
// rif.range.start.line, | ||
// rif.range.end.line, | ||
// extras.ide, | ||
// ); | ||
// }), | ||
// ) | ||
// ).flat() as Chunk[]; | ||
// retrievalResults.push(...expansionResults); | ||
// } | ||
|
||
// Source: Open file exact match | ||
// Source: Class/function name exact match |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been commented out, is there a reason for that ?
I see the threshold is set to 0.3, which seems to high.
Setting it to 0.1 and de-commenting this section would at least remove chucks that are totally irrelevant.
Having the
rerankThreshold
configurable in the config.conf's reranker section would also be better than a hard-coded value.