While looking for similar software to IDSgrep, to make the academic paper about it look more scientific with "meaningful comparisons," I found that Stanford NLP's Tregex matcher can do something we can't: it can search for an arbitrarily deep descendant of a node while placing restrictions on what kinds of nodes may occur on the ancestry path. For instance (they're matching TreeBank parses) "find a V nested inside any number of VPs nested inside an S." We can't do that; IDSgrep's only way of matching arbitrarily deeply is through the ... operator, which places no restrictions on the nodes in between. It might be possible to fake some cases of it with a not-anywhere-not construction, but I don't think it'd be easy in general. I suspect Tregex of being an NP-hard decision problem anyway, because of its ability to backtrack over variable assignments, so it's reasonable that it could be more powerful than IDSgrep's polynomial-time matching. But it nonetheless ought to be possible to support at least some queries of this type without increasing IDSgrep's power unreasonably.
Specific proposal: create a new command-line option that takes an IDSgrep matching pattern as its argument, and causes "match against this pattern" to become a user-defined matching predicate. That sounds innocent enough. What makes it a big deal is allowing such patterns to recurse. Then by setting #1 equal to the pattern |木*⿰#1? (for instance) it's possible to match a 木 nested inside any number of ⿰.
I think allowing this doesn't hurt the asymptotic complexity of pattern matching as long as we require that matching one of these user-defined patterns against a given haystack node must not recurse to matching the same user-defined pattern against the same haystack node again, only strict subtrees of it. That rules out paradoxical things like setting #1 equal to ! #1. If this feature is used at all then we should probably turn on memoization, because otherwise it seems possible to demand exponential work even without the ... and * operators. It would be possible, but probably not worthwhile, to compute exact bit vector filters by recursing through the user-defined patterns just deeply enough that the filter doesn't change; probably a better plan is to simply assign the match-anything filter to any user-defined predicate and not recurse into them at all.
While looking for similar software to IDSgrep, to make the academic paper about it look more scientific with "meaningful comparisons," I found that Stanford NLP's Tregex matcher can do something we can't: it can search for an arbitrarily deep descendant of a node while placing restrictions on what kinds of nodes may occur on the ancestry path. For instance (they're matching TreeBank parses) "find a V nested inside any number of VPs nested inside an S." We can't do that; IDSgrep's only way of matching arbitrarily deeply is through the ... operator, which places no restrictions on the nodes in between. It might be possible to fake some cases of it with a not-anywhere-not construction, but I don't think it'd be easy in general. I suspect Tregex of being an NP-hard decision problem anyway, because of its ability to backtrack over variable assignments, so it's reasonable that it could be more powerful than IDSgrep's polynomial-time matching. But it nonetheless ought to be possible to support at least some queries of this type without increasing IDSgrep's power unreasonably.
Specific proposal: create a new command-line option that takes an IDSgrep matching pattern as its argument, and causes "match against this pattern" to become a user-defined matching predicate. That sounds innocent enough. What makes it a big deal is allowing such patterns to recurse. Then by setting #1 equal to the pattern |木*⿰#1? (for instance) it's possible to match a 木 nested inside any number of ⿰.
I think allowing this doesn't hurt the asymptotic complexity of pattern matching as long as we require that matching one of these user-defined patterns against a given haystack node must not recurse to matching the same user-defined pattern against the same haystack node again, only strict subtrees of it. That rules out paradoxical things like setting #1 equal to ! #1. If this feature is used at all then we should probably turn on memoization, because otherwise it seems possible to demand exponential work even without the ... and * operators. It would be possible, but probably not worthwhile, to compute exact bit vector filters by recursing through the user-defined patterns just deeply enough that the filter doesn't change; probably a better plan is to simply assign the match-anything filter to any user-defined predicate and not recurse into them at all.