Commit Graph

13 Commits

Author SHA1 Message Date
Bruno Sofiato f64fbd9b74
Updated tokenizer to better matching when search for code snippets (#32261)
This PR improves the accuracy of Gitea's code search. 

Currently, Gitea does not consider statements such as
`onsole.log("hello")` as hits when the user searches for `log`. The
culprit is how both ES and Bleve are tokenizing the file contents (in
both cases, `console.log` is a whole token).

In ES' case, we changed the tokenizer to
[simple_pattern_split](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepatternsplit-tokenizer.html#:~:text=The%20simple_pattern_split%20tokenizer%20uses%20a,the%20tokenization%20is%20generally%20faster.).
In such a case, tokens are words formed by digits and letters. In
Bleve's case, it employs a
[letter](https://blevesearch.com/docs/Tokenizers/) tokenizer.

Resolves #32220

---------

Signed-off-by: Bruno Sofiato <bruno.sofiato@gmail.com>
2024-11-06 20:51:20 +00:00
wxiaoguang 5e6523aa57
Update go dependencies (#32389) 2024-10-31 12:05:54 +00:00
Bruno Sofiato 900ac62251
Allow code search by filename (#32210)
This is a large and complex PR, so let me explain in detail its changes.

First, I had to create new index mappings for Bleve and ElasticSerach as
the current ones do not support search by filename. This requires Gitea
to recreate the code search indexes (I do not know if this is a breaking
change, but I feel it deserves a heads-up).

I've used [this
approach](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis-pathhierarchy-tokenizer.html)
to model the filename index. It allows us to efficiently search for both
the full path and the name of a file. Bleve, however, does not support
this out-of-box, so I had to code a brand new [token
filter](https://blevesearch.com/docs/Token-Filters/) to generate the
search terms.

I also did an overhaul in the `indexer_test.go` file. It now asserts the
order of the expected results (this is important since matches based on
the name of a file are more relevant than those based on its content).
I've added new test scenarios that deal with searching by filename. They
use a new repo included in the Gitea fixture.

The screenshot below depicts how Gitea shows the search results. It
shows results based on content in the same way as the current version
does. In matches based on the filename, the first seven lines of the
file contents are shown (BTW, this is how GitHub does it).


![image](https://github.com/user-attachments/assets/9d938d86-1a8d-4f89-8644-1921a473e858)

Resolves #32096

---------

Signed-off-by: Bruno Sofiato <bruno.sofiato@gmail.com>
2024-10-11 23:35:04 +00:00
wxiaoguang 6f7cd94a02
Fix bleve fuzziness (#30799)
Fix #30797
Fix #30317
2024-05-01 15:32:52 +03:00
Lunny Xiao 3f26fe2fa2
Use db.ListOptions directly instead of Paginator interface to make it easier to use and fix performance of /pulls and /issues (#29990)
This PR uses `db.ListOptions` instead of `Paginor` to make the code
simpler.
And it also fixed the performance problem when viewing /pulls or
/issues. Before the counting in fact will also do the search.

---------

Co-authored-by: Jason Song <i@wolfogre.com>
Co-authored-by: silverwind <me@silverwind.io>
2024-03-24 18:51:08 +00:00
6543 b9c57fb78e
Determine fuzziness of bleve indexer by keyword length (#29706)
also bleve did match on fuzzy search and the other way around. this also fix that bug.
2024-03-23 16:45:13 +01:00
6543 7fd0a5b276
Refactor to use optional.Option for issue index search option (#29739)
Signed-off-by: 6543 <6543@obermui.de>
2024-03-13 08:25:53 +00:00
6543 7fdc048153
Patch in exact search for meilisearch (#29671)
meilisearch does not have an search option to contorl fuzzynes per query
right now:
 - https://github.com/meilisearch/meilisearch/issues/1192
 - https://github.com/orgs/meilisearch/discussions/377
 - https://github.com/meilisearch/meilisearch/discussions/1096

so we have to create a workaround by post-filter the search result in
gitea until this is addressed.

For future works I added an option in backend only atm, to enable
fuzzynes for issue indexer too.
And also refactored the code so the fuzzy option is equal in logic to
code indexer


---
*Sponsored by Kithara Software GmbH*
2024-03-09 01:39:27 +00:00
KN4CK3R ad0a34b492
Add `io.Closer` guidelines (#29387)
Co-authored-by: Yarden Shoham <git@yardenshoham.com>
2024-02-25 13:05:23 +00:00
Jason Song 1e76a824bc
Refactor and enhance issue indexer to support both searching, filtering and paging (#26012)
Fix #24662.

Replace #24822 and #25708 (although it has been merged)


## Background

In the past, Gitea supported issue searching with a keyword and
conditions in a less efficient way. It worked by searching for issues
with the keyword and obtaining limited IDs (as it is heavy to get all)
on the indexer (bleve/elasticsearch/meilisearch), and then querying with
conditions on the database to find a subset of the found IDs. This is
why the results could be incomplete.

To solve this issue, we need to store all fields that could be used as
conditions in the indexer and support both keyword and additional
conditions when searching with the indexer.

## Major changes

- Redefine `IndexerData` to include all fields that could be used as
filter conditions.
- Refactor `Search(ctx context.Context, kw string, repoIDs []int64,
limit, start int, state string)` to `Search(ctx context.Context, options
*SearchOptions)`, so it supports more conditions now.
- Change the data type stored in `issueIndexerQueue`. Use
`IndexerMetadata` instead of `IndexerData` in case the data has been
updated while it is in the queue. This also reduces the storage size of
the queue.
- Enhance searching with Bleve/Elasticsearch/Meilisearch, make them
fully support `SearchOptions`. Also, update the data versions.
- Keep most logic of database indexer, but remove
`issues.SearchIssueIDsByKeyword` in `models` to avoid confusion where is
the entry point to search issues.
- Start a Meilisearch instance to test it in unit tests.
- Add unit tests with almost full coverage to test
Bleve/Elasticsearch/Meilisearch indexer.

---------

Co-authored-by: Lunny Xiao <xiaolunwen@gmail.com>
2023-07-31 06:28:53 +00:00
silverwind 88f835192d
Replace `interface{}` with `any` (#25686)
Result of running `perl -p -i -e 's#interface\{\}#any#g' **/*` and `make fmt`.

Basically the same [as golang did](2580d0e08d).
2023-07-04 18:36:08 +00:00
Jason Song 3c7f5ed7b5
Fix `Init` of db indexer (#25470)
Fix regression of #25174.

The `Init` of the db indexer should return true to indicate that the
index was opened/existed, or the indexer will try to populate the index
(not really populate, just fill the queue, `Index` method of the db
indexer is a dummy).
2023-06-23 14:39:52 +00:00
Jason Song 375fd15fbf
Refactor indexer (#25174)
Refactor `modules/indexer` to make it more maintainable. And it can be
easier to support more features. I'm trying to solve some of issue
searching, this is a precursor to making functional changes.

Current supported engines and the index versions:

| engines | issues | code |
| - | - | - |
| db | Just a wrapper for database queries, doesn't need version | - |
| bleve | The version of index is **2** | The version of index is **6**
|
| elasticsearch | The old index has no version, will be treated as
version **0** in this PR | The version of index is **1** |
| meilisearch | The old index has no version, will be treated as version
**0** in this PR | - |


## Changes

### Split

Splited it into mutiple packages

```text
indexer
├── internal
│   ├── bleve
│   ├── db
│   ├── elasticsearch
│   └── meilisearch
├── code
│   ├── bleve
│   ├── elasticsearch
│   └── internal
└── issues
    ├── bleve
    ├── db
    ├── elasticsearch
    ├── internal
    └── meilisearch
```

- `indexer/interanal`: Internal shared package for indexer.
- `indexer/interanal/[engine]`: Internal shared package for each engine
(bleve/db/elasticsearch/meilisearch).
- `indexer/code`: Implementations for code indexer.
- `indexer/code/internal`: Internal shared package for code indexer.
- `indexer/code/[engine]`: Implementation via each engine for code
indexer.
- `indexer/issues`: Implementations for issues indexer.

### Deduplication

- Combine `Init/Ping/Close` for code indexer and issues indexer.
- ~Combine `issues.indexerHolder` and `code.wrappedIndexer` to
`internal.IndexHolder`.~ Remove it, use dummy indexer instead when the
indexer is not ready.
- Duplicate two copies of creating ES clients.
- Duplicate two copies of `indexerID()`.


### Enhancement

- [x] Support index version for elasticsearch issues indexer, the old
index without version will be treated as version 0.
- [x] Fix spell of `elastic_search/ElasticSearch`, it should be
`Elasticsearch`.
- [x] Improve versioning of ES index. We don't need `Aliases`:
- Gitea does't need aliases for "Zero Downtime" because it never delete
old indexes.
- The old code of issues indexer uses the orignal name to create issue
index, so it's tricky to convert it to an alias.
- [x] Support index version for meilisearch issues indexer, the old
index without version will be treated as version 0.
- [x] Do "ping" only when `Ping` has been called, don't ping
periodically and cache the status.
- [x] Support the context parameter whenever possible.
- [x] Fix outdated example config.
- [x] Give up the requeue logic of issues indexer: When indexing fails,
call Ping to check if it was caused by the engine being unavailable, and
only requeue the task if the engine is unavailable.
- It is fragile and tricky, could cause data losing (It did happen when
I was doing some tests for this PR). And it works for ES only.
- Just always requeue the failed task, if it caused by bad data, it's a
bug of Gitea which should be fixed.

---------

Co-authored-by: Giteabot <teabot@gitea.io>
2023-06-23 12:37:56 +00:00