任务单 #33629

Syntax errors in CHISE and CJKVI databases

开放日期: 2014-04-04 03:06 最后更新: 2014-04-04 03:07

报告人:
属主:
类型:
状态:
开启 [Owner assigned]
组件:
里程碑:
(无)
优先:
5 - Medium
严重性:
5 - Medium
处理结果:
文件:

Details

Despite our detection and filtering of some errors of this type, the CHISE and CJKVI databases compiled by the IDSgrep build process contain some "entries" that are not single syntactically valid EIDSes. This is caused by syntax errors in the original databases we are looking at, and is visible at the output in discrepancies between the number of lines in a result set and the count reported by --statistics. Those two numbers should differ when the multi-line headers from the dictionaries are included in the results, but only then - all actual dictionary entries should be single-line. Usually what happens is that a partial entry on one line will consume one or two entries on following lines to make up its missing children, so the tree count ends up smaller than the line count. Lines are not special to the EIDS parser.

Since this is properly an issue with the input data which we didn't write (IDSgrep is functioning correctly, given its specifications and the bad data), and there's no way to really fix it right short of creating our own replacement dictionary entries for the bad ones, it may not be top priority; but it's not nice for speed tests because it means we can't just count lines to count matches but must capture and sum the STATS lines. Filing it as a bug and not a hairy yak, though, because we're already attempting to filter out bad data in input dictionaries and that filtering has evidently failed in this case. Maybe consider a syntax-check feature to *make* lines special to the EIDS parser and throw an error if there is a tree incomplete at line end; then errors of this type could at least be detected during dictionary creation.

任务单历史 (2/2 Histories)

2014-04-04 03:06 Updated by: mskala
  • New Ticket "Syntax errors in CHISE and CJKVI databases" created
2014-04-04 03:07 Updated by: mskala
  • Details Updated

Attachment File List

No attachments

编辑

You are not logged in. I you are not logged in, your comment will be treated as an anonymous post. » 登录名