Issue #261: journal filtering doesn't work for repository name with group

Reported by:	Katsunori FUJIWARA
State:	resolved
Created on:	2016-12-28 11:43
Updated on:	2018-05-18 20:08

Description

Journal filtering with a condition "repository:foo/bar" can't pick up entries for repository "foo/bar" (= repository "bar" belonging to group "foo").

AFAIK, Whoosh library (at least 2.5.7) treats "repository:foo/bar" as "repository == foo AND repository == bar" unintentionally.

You can reproduce this unexpected(?) behavior of Whoosh library by Python script below. This is minimum part of building query from filter condition in Kallithea.

from whoosh.fields import (
    TEXT, ID, STORED, NUMERIC,
    BOOLEAN, Schema, FieldType, DATETIME
)
from whoosh.qparser.dateparse import DateParserPlugin
from whoosh.qparser.default import QueryParser

# from kallithea/lib/indexers/__init__.py
JOURNAL_SCHEMA = Schema(
    username=TEXT(),
    date=DATETIME(),
    action=TEXT(),
    repository=TEXT(),
    ip=TEXT(),
)

# from _journal_filter()@kallithea/controllers/admin/admin.py
qp = QueryParser('repository', schema=JOURNAL_SCHEMA)
qp.add_plugin(DateParserPlugin())

# with wildcard, treated as prefix matching
print repr(qp.parse(u'repository:foo/bar*'))
# without wildcard, treated as AND-ed conditions
print repr(qp.parse(u'repository:foo/bar'))

BTW, if you don't have group, of which name ends with "foo", suffix matching "repository:*foo/bar" may be used as an instant workaround.

Attachments

Comments

Comment by Mads Kiilerich, on 2017-01-02 23:45

Thanks for the report.

Do you have a suggestion on how to solve the problem?

Comment by Katsunori FUJIWARA, on 2017-01-03 10:02

Today, I just found the reason why "repository:foo/bar" is treated as AND-ed conditions unintentionally.

QueryParser.parse() uses WordNode to represent "foo/bar" part of the query
WordNode causes tokenization before evaluation of query
"/" of "foo/bar" is treated as delimiter
splitted "foo" and "bar" are AND-ed, because search term "foo bar" means "foo AND bar" by default
using WordNode for field value in query is embedded in QueryParser.parse()

(debug=True for QueryParser.parse() shows intermediate steps at parsing)

Therefore, there is no instant way (e.g. passing boolean flags) to avoid this issue, IMHO.

On the other hand, I also found that difference between WordNode and TextNode (base class of the former) is whether "tokenize" (of specified text) and "removestops" (= removing stop words) are enabled or not.

Then, I tried to implement custom filter plugin, which disables "tokenize" and "removestops" of WordNode before evaluation of query, like below (I followed implementation of WhitespacePlugin).

from whoosh.qparser import plugins, syntax

class TextnizeFilter(plugins.Plugin):
    """Filter to make WordNode in syntax tree work like TextNode
    """
    def filters(self, parser):
        return [(self._textnize, 600)]

    def _textnize(self, parser, group):
        newgroup = group.empty_copy()
        for node in group:
            if isinstance(node, syntax.GroupNode):
                # Textnize recursively
                newgroup.append(self._textnize(parser, node))
            else:
                if isinstance(node, syntax.WordNode):
                    # these disabling makes WordNode work like
                    # TextNode, because enabling these features
                    # is only difference between WordNode and
                    # TextNode
                    node.tokenize = False
                    node.removestops = False
                newgroup.append(node)
        return newgroup

qp.add_plugin(TextnizeFilter())

At least for "Journal filtering", nobody should expect that field value is tokenized and treated as AND-ed conditions, because valid fields in "Journal filtering" schema are "username", "date", "action", "repository", and "ip" (AND-ed conditions for these fields always causes "not matched" against any journal entry).

And QueryParser instance is always re-created at each journal filtering requests.

Therefore, disabling "tokenize" and "removestops" by custom plugin shouldn't cause any issue for "Journal filtering", IMHO.

I confirmed that adding this plugin to QueryParser treats "reository:foo/bar" as "repository == 'foo/bar'" as expected. But I strongly want to get this solution reviewed by Whoosh guru, because I'm newbie on whoosh.

Comment by Mads Kiilerich, on 2017-01-03 11:25

Nice analysis!

The help text describes briefly how the query language looks like. That makes it quite clear that it is a "query language" that will give exact results more than a "natural language query" that will try to guess what the user meant. Thus I don't think that automatic stemming is a good thing. Disabling it as you do is good.

We haven't touched whoosh much in Kallithea. I think you just became our guru :-) I think we should use your approach if it works for you and you prepare a PR with some test coverage.

(I never grokked or used filtering much. It seems a bit magic and quoting and escaping doesn't work as I would expect it to.)

Comment by Katsunori FUJIWARA, on 2017-01-03 16:48

OK, I'll issue pull request with this solution, after checking some additional points and writing tests !

Comment by Katsunori FUJIWARA, on 2017-01-11 10:09

While additional investigation, I found another (and simpler) solution for this issue!

diff -r 16234f629cfb -r 09709840bc18 kallithea/lib/indexers/__init__.py
--- a/kallithea/lib/indexers/__init__.py
+++ b/kallithea/lib/indexers/__init__.py
@@ -79,10 +79,10 @@

 # used only to generate queries in journal
 JOURNAL_SCHEMA = Schema(
-    username=TEXT(),
+    username=ID(),
     date=DATETIME(),
     action=TEXT(),
-    repository=TEXT(),
+    repository=ID(),
     ip=TEXT(),
 )

Using ID instead of TEXT can solve this issue easily, because the former avoids tokenization while building query from search terms.

BTW, ID field also avoid "removing stop words". This is useful, if user may use instant repository name like "a", "do", "does", and so on.

You'll notice that ID field also avoids lowercase-ing. But this doesn't violate current case-insensitive search policy, because LOWER-ing in actual SQL query is achieved by get_filterion() or so in kallithea/controllers/admin/admin.py.

This replacement should be safe with already existing DB instance, because:

JOURNAL_SCHEMA definition in kallithea/lib/indexers/init.py is used only to parse search terms, and to issue actual query
actual definition of DBMS table "user_logs" itself exists in kallithea/model/db.py (as UserLog class)

Using ID for "username" also avoids similar issues for searching with "username:" prefixed condition.

Comment by Thomas De Schampheleire, on 2018-05-18 20:08

Fixed with 9a773d2f022b (stable) and 73e3599971da

Kallithea issues archive