How to fix searching

Thomas De Schampheleire patrickdepinguin at gmail.com
Tue Sep 27 19:26:33 UTC 2016


Hi Dominik,

On Mon, Sep 26, 2016 at 6:55 PM, Dominik Ruf <dominikruf at gmail.com> wrote:
> Hi,
>
> there are basically 2 different kinds of searches in kallithea.
>
> 1. filtering revisions
> Mads mentioned 2 years ago that he plans to add some support for this
> https://bitbucket.org/conservancy/kallithea/issues/18/search-needs-to-be-improved
> 2. searching in multiple repositories (inlc. fulltext searching in the
> files)
>
> I think the first point is pretty much strait forward. Git and Mercurial
> support filtering revisions. It basically 'only' needs to be implemented.
> :-)
>
> But the second one is more complicated.
> There are multiple problems with the current implementation.
>
> 1. For starters since 9c5f794df7cd the make-index command is broken. But
> that can be easily fixed.
> 2. What is no so easy to fix, is the fact that indexing is currently
> incredibly slow.
> 3. The indexing is done periodically, it only indexes the tip revision at
> indexing time and the search results refer to the tip at search time.
> Therefore
>   a) you may get hits that are no longer valid
>   b) you may get no hits even though the string is present now
>   c) you can't search for things that have been removed
>
> I believe all this is solvable. I looked into the code and found a few
> places where the indexing can definitely be improve.
> But I don't have much experience with whoosh. So I'm not sure if it is even
> worth it to fix the current implementation, or if I should restart with solr
> or elastic search.
>
> My questions to you guys are:
>
> 1. Do you have experience with whoosh? Does it scale to gigabytes of data?
> 2. Would you even pull a implementation that requires installing solr? Note:
> I believe installation and setup of solr can be automated.
> 3. Or maybe you thing the fulltext search should be dropped all together.
>

I personally think that 'fulltext search' on repositories which are
typically containing source code, has relatively little value.
Fulltext search like whoosh or solr are providing are not aware of the
structure of source code, and thus have no advanced capabilities to
search only in identifiers, or click through on symbols in the search
result. Real code browsers, like OpenGrok or LXR, do have such
features.
The few times that I actually use fulltext search on e.g. GitHub is
when I'm too lazy to actually clone the repo and use a grep-like tool
to find it myself. It definitely has some value, but not so much.

With this in mind, I actually think there is much more value in fixing
the first type of search you highlight, i.e. filtering revisions.
Therefore, in my opinion we should prioritize 'just implementing' that
before looking at fulltext search.

Coming back to fulltext search:
- I have no specific experience with whoosh
- Regardless of the tool we'd use (whoosh, solr, ...), I think it
should always be optional. Kallithea should be installable without
search capabilities.
- It may be more useful to implement a flexible way where Kallithea
allows searching, but that the backend is customizable. I.e. the
search term can either be passed to whoosh, solr, or any other tool
that the user wants to configure. The tool would get the search term
and probably some other elements referring to the repo to search or
specific paths in the repo. Kallithea documentation can give some
examples on how to plug in known tools into this, but need not be
concerned with the entire gamma of tools available, nor choose one
specific one that may not scale to a particular use case. The same
could even be used to hook in code browsers like OpenGrok/LXR in the
search feature, rather than pure text search.

Best regards,
Thomas


More information about the kallithea-general mailing list