Skip to content

Code search should look for the filename as well #32096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bsofiato opened this issue Sep 21, 2024 · 6 comments · Fixed by #32210
Closed

Code search should look for the filename as well #32096

bsofiato opened this issue Sep 21, 2024 · 6 comments · Fixed by #32210
Assignees
Labels
type/proposal The new feature has not been accepted yet but needs to be discussed first.

Comments

@bsofiato
Copy link
Contributor

Feature Description

How do you guys feel about the code search feature taking the filenames into account?

Right now, Gitea only searches for the contents of the source files. I think the search should also look for the given criteria in the filenames (this is how Bitbucket and GitHub do it).

What do you guys think about it?

Screenshots

The screenshot below shows an excerpt of the GitHub docs that explains how the filename also searched

image

@bsofiato bsofiato added the type/proposal The new feature has not been accepted yet but needs to be discussed first. label Sep 21, 2024
@bsofiato
Copy link
Contributor Author

Cool guys, gonna work on this PR

@bsofiato
Copy link
Contributor Author

Guys, a little question.

Do you think that, given a query, both bleve and elasticsearch should return the exact same results ? Or we can use a more lenient approach about which search backend returns ?

@lunny
Copy link
Member

lunny commented Sep 28, 2024

Guys, a little question.

Do you think that, given a query, both bleve and elasticsearch should return the exact same results ? Or we can use a more lenient approach about which search backend returns ?

I think we should input the same rule to these two engines, but we cannot control what they will output.

@bsofiato
Copy link
Contributor Author

bsofiato commented Oct 1, 2024

Guys, just an update on this.

I'm still working on this one. Right now, I'm creating a compreensive test suite for the changes. It's gonna be a big PR :P

@bsofiato
Copy link
Contributor Author

bsofiato commented Oct 2, 2024

Guys, a little question.

I'm updating the unit test for the search funcionality. The test uses a fixture that sets the user2/repo1 repository up. May I change this particular repo to mirror some other tests cases or is it used elsewhere ?

@lunny
Copy link
Member

lunny commented Oct 3, 2024

I suggest don't change a lot for this fixture repository because it's used everywhere. May be you can fork that repository and do some code search.

lunny pushed a commit that referenced this issue Oct 11, 2024
This is a large and complex PR, so let me explain in detail its changes.

First, I had to create new index mappings for Bleve and ElasticSerach as
the current ones do not support search by filename. This requires Gitea
to recreate the code search indexes (I do not know if this is a breaking
change, but I feel it deserves a heads-up).

I've used [this
approach](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis-pathhierarchy-tokenizer.html)
to model the filename index. It allows us to efficiently search for both
the full path and the name of a file. Bleve, however, does not support
this out-of-box, so I had to code a brand new [token
filter](https://blevesearch.com/docs/Token-Filters/) to generate the
search terms.

I also did an overhaul in the `indexer_test.go` file. It now asserts the
order of the expected results (this is important since matches based on
the name of a file are more relevant than those based on its content).
I've added new test scenarios that deal with searching by filename. They
use a new repo included in the Gitea fixture.

The screenshot below depicts how Gitea shows the search results. It
shows results based on content in the same way as the current version
does. In matches based on the filename, the first seven lines of the
file contents are shown (BTW, this is how GitHub does it).


![image](https://github.com/user-attachments/assets/9d938d86-1a8d-4f89-8644-1921a473e858)

Resolves #32096

---------

Signed-off-by: Bruno Sofiato <[email protected]>
matera-bs pushed a commit to matera-ar/gitea that referenced this issue Oct 15, 2024
go-gitea#32210)

This is a large and complex PR, so let me explain in detail its changes.

First, I had to create new index mappings for Bleve and ElasticSerach as
the current ones do not support search by filename. This requires Gitea
to recreate the code search indexes (I do not know if this is a breaking
change, but I feel it deserves a heads-up).

I've used [this
approach](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis-pathhierarchy-tokenizer.html)
to model the filename index. It allows us to efficiently search for both
the full path and the name of a file. Bleve, however, does not support
this out-of-box, so I had to code a brand new [token
filter](https://blevesearch.com/docs/Token-Filters/) to generate the
search terms.

I also did an overhaul in the `indexer_test.go` file. It now asserts the
order of the expected results (this is important since matches based on
the name of a file are more relevant than those based on its content).
I've added new test scenarios that deal with searching by filename. They
use a new repo included in the Gitea fixture.

The screenshot below depicts how Gitea shows the search results. It
shows results based on content in the same way as the current version
does. In matches based on the filename, the first seven lines of the
file contents are shown (BTW, this is how GitHub does it).

![image](https://github.com/user-attachments/assets/9d938d86-1a8d-4f89-8644-1921a473e858)

Resolves go-gitea#32096

---------

Signed-off-by: Bruno Sofiato <[email protected]>
matera-bs pushed a commit to matera-ar/gitea that referenced this issue Oct 15, 2024
go-gitea#32210)

This is a large and complex PR, so let me explain in detail its changes.

First, I had to create new index mappings for Bleve and ElasticSerach as
the current ones do not support search by filename. This requires Gitea
to recreate the code search indexes (I do not know if this is a breaking
change, but I feel it deserves a heads-up).

I've used [this
approach](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis-pathhierarchy-tokenizer.html)
to model the filename index. It allows us to efficiently search for both
the full path and the name of a file. Bleve, however, does not support
this out-of-box, so I had to code a brand new [token
filter](https://blevesearch.com/docs/Token-Filters/) to generate the
search terms.

I also did an overhaul in the `indexer_test.go` file. It now asserts the
order of the expected results (this is important since matches based on
the name of a file are more relevant than those based on its content).
I've added new test scenarios that deal with searching by filename. They
use a new repo included in the Gitea fixture.

The screenshot below depicts how Gitea shows the search results. It
shows results based on content in the same way as the current version
does. In matches based on the filename, the first seven lines of the
file contents are shown (BTW, this is how GitHub does it).

![image](https://github.com/user-attachments/assets/9d938d86-1a8d-4f89-8644-1921a473e858)

Resolves go-gitea#32096

---------

Signed-off-by: Bruno Sofiato <[email protected]>
matera-bs pushed a commit to matera-ar/gitea that referenced this issue Oct 29, 2024
go-gitea#32210)

This is a large and complex PR, so let me explain in detail its changes.

First, I had to create new index mappings for Bleve and ElasticSerach as
the current ones do not support search by filename. This requires Gitea
to recreate the code search indexes (I do not know if this is a breaking
change, but I feel it deserves a heads-up).

I've used [this
approach](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis-pathhierarchy-tokenizer.html)
to model the filename index. It allows us to efficiently search for both
the full path and the name of a file. Bleve, however, does not support
this out-of-box, so I had to code a brand new [token
filter](https://blevesearch.com/docs/Token-Filters/) to generate the
search terms.

I also did an overhaul in the `indexer_test.go` file. It now asserts the
order of the expected results (this is important since matches based on
the name of a file are more relevant than those based on its content).
I've added new test scenarios that deal with searching by filename. They
use a new repo included in the Gitea fixture.

The screenshot below depicts how Gitea shows the search results. It
shows results based on content in the same way as the current version
does. In matches based on the filename, the first seven lines of the
file contents are shown (BTW, this is how GitHub does it).

![image](https://github.com/user-attachments/assets/9d938d86-1a8d-4f89-8644-1921a473e858)

Resolves go-gitea#32096

---------

Signed-off-by: Bruno Sofiato <[email protected]>
matera-bs pushed a commit to matera-ar/gitea that referenced this issue Dec 17, 2024
go-gitea#32210)

This is a large and complex PR, so let me explain in detail its changes.

First, I had to create new index mappings for Bleve and ElasticSerach as
the current ones do not support search by filename. This requires Gitea
to recreate the code search indexes (I do not know if this is a breaking
change, but I feel it deserves a heads-up).

I've used [this
approach](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis-pathhierarchy-tokenizer.html)
to model the filename index. It allows us to efficiently search for both
the full path and the name of a file. Bleve, however, does not support
this out-of-box, so I had to code a brand new [token
filter](https://blevesearch.com/docs/Token-Filters/) to generate the
search terms.

I also did an overhaul in the `indexer_test.go` file. It now asserts the
order of the expected results (this is important since matches based on
the name of a file are more relevant than those based on its content).
I've added new test scenarios that deal with searching by filename. They
use a new repo included in the Gitea fixture.

The screenshot below depicts how Gitea shows the search results. It
shows results based on content in the same way as the current version
does. In matches based on the filename, the first seven lines of the
file contents are shown (BTW, this is how GitHub does it).

![image](https://github.com/user-attachments/assets/9d938d86-1a8d-4f89-8644-1921a473e858)

Resolves go-gitea#32096

---------

Signed-off-by: Bruno Sofiato <[email protected]>
@go-gitea go-gitea locked as resolved and limited conversation to collaborators Jan 9, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type/proposal The new feature has not been accepted yet but needs to be discussed first.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants