修订版 | f83313ef3fe05c825da58b2d265d9e91986062f1 (tree) |
---|---|
时间 | 2022-11-05 02:48:21 |
作者 | Albert Mietus < albert AT mietus DOT nl > |
Commiter | Albert Mietus < albert AT mietus DOT nl > |
Moved CCastle QuickNotes on tool-desing into new section
@@ -5,7 +5,7 @@ | ||
5 | 5 | ========================= |
6 | 6 | |
7 | 7 | No computer language can exist without a set of :ref:`Castle-WorkshopTools`; at least a compiler is needed -- and a lot |
8 | -more. This chapter contains a growing number of notes, blogs & article on desinging them | |
8 | +more. This chapter contains a growing number of notes, blogs & article on desinging them. | |
9 | 9 | |
10 | 10 | |
11 | 11 | .. toctree:: |
@@ -13,6 +13,7 @@ | ||
13 | 13 | :titlesonly: |
14 | 14 | :glob: |
15 | 15 | |
16 | + */index | |
16 | 17 | * |
17 | 18 | |
18 | 19 |
@@ -0,0 +1,17 @@ | ||
1 | +.. _QN_Arpeggio: | |
2 | + | |
3 | +=================== | |
4 | +QuickNote: Arpeggio | |
5 | +=================== | |
6 | + | |
7 | +.. post:: | |
8 | + :category: CastleBlogs, rough | |
9 | + :tags: Grammar, PEG, DRAFT | |
10 | + | |
11 | + In this short QuickNote blog we give a bit of info on `Arpeggio <https://textx.github.io/Arpeggio/2.0/>`__; an python | |
12 | + package to implement a (PEG) parser. Eventually, it will be implemented in Castle -- like all | |
13 | + :ref:`Castle-WorkshopTools`. To kickstart, we use python and python-packages. | |
14 | + | |
15 | + As Arpeggio is quite well `documented <https://textx.github.io/Arpeggio/2.0/>`__ is it a short note | |
16 | + | |
17 | +.. seealso:: :ref:`QN_PEGEN` |
@@ -0,0 +1,11 @@ | ||
1 | +================ | |
2 | +Some Quick Blogs | |
3 | +================ | |
4 | + | |
5 | +.. toctree:: | |
6 | + :maxdepth: 2 | |
7 | + :titlesonly: | |
8 | + :glob: | |
9 | + | |
10 | + * | |
11 | + |
@@ -0,0 +1,178 @@ | ||
1 | +.. include:: /std/localtoc.irst | |
2 | + | |
3 | +.. _QN_PEGEN: | |
4 | + | |
5 | +================ | |
6 | +QuickNote: PEGEN | |
7 | +================ | |
8 | + | |
9 | +.. post:: 2022/11/3 | |
10 | + :category: CastleBlogs, rough | |
11 | + :tags: Grammar, PEG | |
12 | + | |
13 | + To implement CCastle we need a parser, as part of the compiler. Eventually, that parser will be writen in Castle. For | |
14 | + now we kickstart it in python; which has several packages that can assist us. As we like to use an PEG one, there | |
15 | + are a few options. `Arpeggio <https://textx.github.io/Arpeggio/2.0/>`__ is well known, and has some nice options -- | |
16 | + but can’t handle `left recursion <https://en.wikipedia.org/wiki/Left_recursion>`__ -- like most PEG-parsers. | |
17 | + | |
18 | + Recently python itself uses a PEG parser, that supports `left recursion | |
19 | + <https://en.wikipedia.org/wiki/Left_recursion>`__ (which is a recent development). That parser is also available as a | |
20 | + package: `pegen <https://we-like-parsers.github.io/pegen/index.html>`__; but hardly documented. | |
21 | + | |
22 | +This blog is writen to remember some leassons learned when playing with it. And as kind of informal docs. | |
23 | + | |
24 | +.. seealso:: :ref:`QN_Arpeggio` | |
25 | + | |
26 | +Build-In Lexer | |
27 | +============== | |
28 | + | |
29 | +Pegen is specially writen for Python and use a specialized lexer; unlike most PEG-parser that uses PEG for lexing too. Pegen | |
30 | +uses the `tokenizer <https://docs.python.org/3/library/tokenize.html>`__ that is part of Python. This comes with some | |
31 | +restrictions. | |
32 | + | |
33 | +This lexer -or tokenize(r) as python calls it-- is used **both** to read the grammar (the peg-file), *and* to read the | |
34 | +source-files that are parser by the generated parser. | |
35 | + | |
36 | +.. hint:: | |
37 | + | |
38 | + These restrictions applies when we use pegen as modele: ``pyton -m pegen ...``; that calls `simple_parser_main()`. | |
39 | + |BR| | |
40 | + But also when we use the parser-class in own code --so, when importing pegen ``from pegen.parser Parser ...``-- it is | |
41 | + restricted. Then is a bit more possible, as we can configure another (self made) lexer. The interface is quite narrow | |
42 | + to python however. | |
43 | + | |
44 | + | |
45 | +Tokens | |
46 | +------ | |
47 | + | |
48 | +The lexer will recognize some tokes that are specialy for python, like `INDENT` & `DEDENT`. Also some generic tokens | |
49 | +like NAME (which is an ID) and `NUMBER` are know, and can be used to define the language. | |
50 | + | |
51 | +Unfortunally, it will also find some tokens --typical operators-- that *hardcoded* for python. Even when we like to use | |
52 | +them differently; possible combined with other characters. Then, those will not be found; not the literal-strings as set | |
53 | +in the grammar. | |
54 | + | |
55 | +.. note:: | |
56 | + | |
57 | + Pegen speaks about *(soft)* **keywords** for all kind of literal terminals; even when they are more like operators | |
58 | + than *words*. | |
59 | + | |
60 | +.. warning:: | |
61 | + | |
62 | + When the grammar defines (literal) terminals (or keywords) --especially for operators-- make sure the lexer will not | |
63 | + break them into predefined tokens! | |
64 | + |BR| | |
65 | + This will not give an error, but it does not work! | |
66 | + | |
67 | + .. code-block:: PEG | |
68 | + | |
69 | + Left_arrow_BAD: '<-' ## This is WRONG, as ``<`` is seen as a token. And so, `<-` is never found | |
70 | + Left_arrow_OKE: '<' '-' ## This is acceptable | |
71 | + | |
72 | + This *splitting* results however in 2 enties in the resulting tree --unless one uses `grammar actions | |
73 | + <https://we-like-parsers.github.io/pegen/grammar.html#grammar-actions>`__ to create one new “token”. | |
74 | + | |
75 | +.. seealso:: See https://docs.python.org/3/library/token.html, for an overiew of the predefined tokens | |
76 | + | |
77 | +.. tip:: | |
78 | + | |
79 | + A quick trick to see how a file is split into tokens, use ``python -m tokenize [-e] filename.peg``. | |
80 | + |BR| | |
81 | + Make sure you do not use string-literals that (eg) are composed of two tokens. Like the above mentioned ``<--`` | |
82 | + | |
83 | + | |
84 | + | |
85 | +.. sidebar:: Reserverd | |
86 | + :class: localtoc | |
87 | + | |
88 | + - showpeek | |
89 | + - name | |
90 | + - number | |
91 | + - string | |
92 | + - op | |
93 | + - type_comment | |
94 | + - soft_keyword | |
95 | + - expect | |
96 | + - expect_forced | |
97 | + - positive_lookahead | |
98 | + - negative_lookahead | |
99 | + - make_syntax_error | |
100 | + | |
101 | +Rule names | |
102 | +---------- | |
103 | + | |
104 | +The *GeneratedParser* inherites and calls the base ``pegen.parser.Parser`` class and has methods for all | |
105 | +rule-names. This implies some names should not be used as rule-names (in all cases) -- see the sidebar. | |
106 | + | |
107 | + | |
108 | +Meta Syntax (issues) | |
109 | +==================== | |
110 | + | |
111 | +No: regexps | |
112 | +----------- | |
113 | + | |
114 | +PEGEN has **no** support for regular expressions probably as it uses a custom lexer. | |
115 | + | |
116 | +Unordered Group starts a comment | |
117 | +-------------------------------- | |
118 | + | |
119 | +PEGEN (or it lexer) used the ``#`` to start a comment. This implies an **Unordered group** ``( sequence )#`` --as in | |
120 | +`Arpeggio <https://textx.github.io/Arpeggio/2.0/grammars/#grammars-written-in-peg-notations>`__-- are not recognized | |
121 | + | |
122 | +A workarond is to use another character like ``@`` instead of the hash (``#``). | |
123 | + | |
124 | + | |
125 | +Result/Output | |
126 | +============= | |
127 | + | |
128 | +cmd-tool | |
129 | +-------- | |
130 | + | |
131 | +The commandline tool ``pyton -m pegen ...`` only prints the parsed tree: a list (shown as ``[`` ... ``]``) with | |
132 | +sub-list and/or `TokenInfo` namedtuples. Each `TokenInfo` has 5 elements: a token type (an int and its enum-name), the | |
133 | +token-string (that was was parsed), the begin & end location (line- & column-number), and the full line that is beeing | |
134 | +parsed. | |
135 | + | |
136 | +No info about the matched gramer-rule (e.g. the rule-name) is shown. Actually that info is not part of the parsed-tree. | |
137 | + | |
138 | +.. seealso:: This `structure is described <https://docs.python.org/3/library/tokenize.html?highlight=TokenInfo>`__ in | |
139 | + the tokenize module; without specifying its name: TokenInfo. | |
140 | + | |
141 | +The parser | |
142 | +---------- | |
143 | + | |
144 | +The GeneratedParser (and/or it’s baseclass: ``pegen.parser.Parser``) returns only (list of) tokens from the tokenizer (a | |
145 | +OO wrapper arround tokenize). And so, the same TokenInfo objects as described above. | |
146 | + | |
147 | +Stability | |
148 | +========= | |
149 | + | |
150 | +The current pegen package op `pypi <https://pypi.org/project/pegen/>`__ is V0.1.0 -- which already showns it not | |
151 | +mature. `That version github <https://github.com/we-like-parsers/pegen/tree/v0.1.0>`__ is dated September 2021 (with 36 | |
152 | +commits). The `current <https://github.com/we-like-parsers/pegen/tree/db7552dda0af6b27cbbb1230be116e8a56c49736>`__ | |
153 | +version (Nov 22) has 20 commits more (56). | |
154 | +|BR| | |
155 | +And can be installed with ``pip install git+https://github.com/we-like-parsers/pegen`` | |
156 | + | |
157 | +It os however, not fully compatible. By example ``pegen/parser.y::simple_parser_main()`` now expect an ATS object (to | |
158 | +print), not a list of TokenInfo. | |
159 | + | |
160 | +.. tip:: | |
161 | + | |
162 | + The pegen package is **NOT** used inside the `(C)Python tool | |
163 | + <https://github.com/python/cpython/tree/main/Tools/peg_generator>`__; the CPython version is heavily related to other | |
164 | + details of CPython; it can also generate C-code. The pegen-package is based on it, and more-or-less in sync, can | |
165 | + generate Python-code only, but is not depending on the compiler-implementation details. | |
166 | + | |
167 | + .. seealso:: https://we-like-parsers.github.io/pegen/#differences-with-cpythons-pegen | |
168 | + | |
169 | + | |
170 | +Buggy current version | |
171 | +--------------------- | |
172 | + | |
173 | +The git version contains (at least) one bug. The function ``parser::simple_parser_main()``, that is called when using the | |
174 | +generated file, uses the AST module to print (show) the result -- which simple does not work. | |
175 | +|BR| | |
176 | +Probably, that* default main* isn’t used a lot (Also, I prever to use -- have use-- a own main). Still it shows it | |
177 | +inmaturity. | |
178 | + |
@@ -1,17 +0,0 @@ | ||
1 | -.. _QN_Arpeggio: | |
2 | - | |
3 | -=================== | |
4 | -QuickNote: Arpeggio | |
5 | -=================== | |
6 | - | |
7 | -.. post:: | |
8 | - :category: CastleBlogs, rough | |
9 | - :tags: Grammar, PEG, DRAFT | |
10 | - | |
11 | - In this short QuickNote blog we give a bit of info on `Arpeggio <https://textx.github.io/Arpeggio/2.0/>`__; an python | |
12 | - package to implement a (PEG) parser. Eventually, it will be implemented in Castle -- like all | |
13 | - :ref:`Castle-WorkshopTools`. To kickstart, we use python and python-packages. | |
14 | - | |
15 | - As Arpeggio is quite well `documented <https://textx.github.io/Arpeggio/2.0/>`__ is it a short note | |
16 | - | |
17 | -.. seealso:: :ref:`QN_PEGEN` |
@@ -1,11 +0,0 @@ | ||
1 | -================ | |
2 | -Some Quick Blogs | |
3 | -================ | |
4 | - | |
5 | -.. toctree:: | |
6 | - :maxdepth: 2 | |
7 | - :titlesonly: | |
8 | - :glob: | |
9 | - | |
10 | - * | |
11 | - |
@@ -1,178 +0,0 @@ | ||
1 | -.. include:: /std/localtoc.irst | |
2 | - | |
3 | -.. _QN_PEGEN: | |
4 | - | |
5 | -================ | |
6 | -QuickNote: PEGEN | |
7 | -================ | |
8 | - | |
9 | -.. post:: 2022/11/3 | |
10 | - :category: CastleBlogs, rough | |
11 | - :tags: Grammar, PEG | |
12 | - | |
13 | - To implement CCastle we need a parser, as part of the compiler. Eventually, that parser will be writen in Castle. For | |
14 | - now we kickstart it in python; which has several packages that can assist us. As we like to use an PEG one, there | |
15 | - are a few options. `Arpeggio <https://textx.github.io/Arpeggio/2.0/>`__ is well known, and has some nice options -- | |
16 | - but can’t handle `left recursion <https://en.wikipedia.org/wiki/Left_recursion>`__ -- like most PEG-parsers. | |
17 | - | |
18 | - Recently python itself uses a PEG parser, that supports `left recursion | |
19 | - <https://en.wikipedia.org/wiki/Left_recursion>`__ (which is a recent development). That parser is also available as a | |
20 | - package: `pegen <https://we-like-parsers.github.io/pegen/index.html>`__; but hardly documented. | |
21 | - | |
22 | -This blog is writen to remember some leassons learned when playing with it. And as kind of informal docs. | |
23 | - | |
24 | -.. seealso:: :ref:`QN_Arpeggio` | |
25 | - | |
26 | -Build-In Lexer | |
27 | -============== | |
28 | - | |
29 | -Pegen is specially writen for Python and use a specialized lexer; unlike most PEG-parser that uses PEG for lexing too. Pegen | |
30 | -uses the `tokenizer <https://docs.python.org/3/library/tokenize.html>`__ that is part of Python. This comes with some | |
31 | -restrictions. | |
32 | - | |
33 | -This lexer -or tokenize(r) as python calls it-- is used **both** to read the grammar (the peg-file), *and* to read the | |
34 | -source-files that are parser by the generated parser. | |
35 | - | |
36 | -.. hint:: | |
37 | - | |
38 | - These restrictions applies when we use pegen as modele: ``pyton -m pegen ...``; that calls `simple_parser_main()`. | |
39 | - |BR| | |
40 | - But also when we use the parser-class in own code --so, when importing pegen ``from pegen.parser Parser ...``-- it is | |
41 | - restricted. Then is a bit more possible, as we can configure another (self made) lexer. The interface is quite narrow | |
42 | - to python however. | |
43 | - | |
44 | - | |
45 | -Tokens | |
46 | ------- | |
47 | - | |
48 | -The lexer will recognize some tokes that are specialy for python, like `INDENT` & `DEDENT`. Also some generic tokens | |
49 | -like NAME (which is an ID) and `NUMBER` are know, and can be used to define the language. | |
50 | - | |
51 | -Unfortunally, it will also find some tokens --typical operators-- that *hardcoded* for python. Even when we like to use | |
52 | -them differently; possible combined with other characters. Then, those will not be found; not the literal-strings as set | |
53 | -in the grammar. | |
54 | - | |
55 | -.. note:: | |
56 | - | |
57 | - Pegen speaks about *(soft)* **keywords** for all kind of literal terminals; even when they are more like operators | |
58 | - than *words*. | |
59 | - | |
60 | -.. warning:: | |
61 | - | |
62 | - When the grammar defines (literal) terminals (or keywords) --especially for operators-- make sure the lexer will not | |
63 | - break them into predefined tokens! | |
64 | - |BR| | |
65 | - This will not give an error, but it does not work! | |
66 | - | |
67 | - .. code-block:: PEG | |
68 | - | |
69 | - Left_arrow_BAD: '<-' ## This is WRONG, as ``<`` is seen as a token. And so, `<-` is never found | |
70 | - Left_arrow_OKE: '<' '-' ## This is acceptable | |
71 | - | |
72 | - This *splitting* results however in 2 enties in the resulting tree --unless one uses `grammar actions | |
73 | - <https://we-like-parsers.github.io/pegen/grammar.html#grammar-actions>`__ to create one new “token”. | |
74 | - | |
75 | -.. seealso:: See https://docs.python.org/3/library/token.html, for an overiew of the predefined tokens | |
76 | - | |
77 | -.. tip:: | |
78 | - | |
79 | - A quick trick to see how a file is split into tokens, use ``python -m tokenize [-e] filename.peg``. | |
80 | - |BR| | |
81 | - Make sure you do not use string-literals that (eg) are composed of two tokens. Like the above mentioned ``<--`` | |
82 | - | |
83 | - | |
84 | - | |
85 | -.. sidebar:: Reserverd | |
86 | - :class: localtoc | |
87 | - | |
88 | - - showpeek | |
89 | - - name | |
90 | - - number | |
91 | - - string | |
92 | - - op | |
93 | - - type_comment | |
94 | - - soft_keyword | |
95 | - - expect | |
96 | - - expect_forced | |
97 | - - positive_lookahead | |
98 | - - negative_lookahead | |
99 | - - make_syntax_error | |
100 | - | |
101 | -Rule names | |
102 | ----------- | |
103 | - | |
104 | -The *GeneratedParser* inherites and calls the base ``pegen.parser.Parser`` class and has methods for all | |
105 | -rule-names. This implies some names should not be used as rule-names (in all cases) -- see the sidebar. | |
106 | - | |
107 | - | |
108 | -Meta Syntax (issues) | |
109 | -==================== | |
110 | - | |
111 | -No: regexps | |
112 | ------------ | |
113 | - | |
114 | -PEGEN has **no** support for regular expressions probably as it uses a custom lexer. | |
115 | - | |
116 | -Unordered Group starts a comment | |
117 | --------------------------------- | |
118 | - | |
119 | -PEGEN (or it lexer) used the ``#`` to start a comment. This implies an **Unordered group** ``( sequence )#`` --as in | |
120 | -`Arpeggio <https://textx.github.io/Arpeggio/2.0/grammars/#grammars-written-in-peg-notations>`__-- are not recognized | |
121 | - | |
122 | -A workarond is to use another character like ``@`` instead of the hash (``#``). | |
123 | - | |
124 | - | |
125 | -Result/Output | |
126 | -============= | |
127 | - | |
128 | -cmd-tool | |
129 | --------- | |
130 | - | |
131 | -The commandline tool ``pyton -m pegen ...`` only prints the parsed tree: a list (shown as ``[`` ... ``]``) with | |
132 | -sub-list and/or `TokenInfo` namedtuples. Each `TokenInfo` has 5 elements: a token type (an int and its enum-name), the | |
133 | -token-string (that was was parsed), the begin & end location (line- & column-number), and the full line that is beeing | |
134 | -parsed. | |
135 | - | |
136 | -No info about the matched gramer-rule (e.g. the rule-name) is shown. Actually that info is not part of the parsed-tree. | |
137 | - | |
138 | -.. seealso:: This `structure is described <https://docs.python.org/3/library/tokenize.html?highlight=TokenInfo>`__ in | |
139 | - the tokenize module; without specifying its name: TokenInfo. | |
140 | - | |
141 | -The parser | |
142 | ----------- | |
143 | - | |
144 | -The GeneratedParser (and/or it’s baseclass: ``pegen.parser.Parser``) returns only (list of) tokens from the tokenizer (a | |
145 | -OO wrapper arround tokenize). And so, the same TokenInfo objects as described above. | |
146 | - | |
147 | -Stability | |
148 | -========= | |
149 | - | |
150 | -The current pegen package op `pypi <https://pypi.org/project/pegen/>`__ is V0.1.0 -- which already showns it not | |
151 | -mature. `That version github <https://github.com/we-like-parsers/pegen/tree/v0.1.0>`__ is dated September 2021 (with 36 | |
152 | -commits). The `current <https://github.com/we-like-parsers/pegen/tree/db7552dda0af6b27cbbb1230be116e8a56c49736>`__ | |
153 | -version (Nov 22) has 20 commits more (56). | |
154 | -|BR| | |
155 | -And can be installed with ``pip install git+https://github.com/we-like-parsers/pegen`` | |
156 | - | |
157 | -It os however, not fully compatible. By example ``pegen/parser.y::simple_parser_main()`` now expect an ATS object (to | |
158 | -print), not a list of TokenInfo. | |
159 | - | |
160 | -.. tip:: | |
161 | - | |
162 | - The pegen package is **NOT** used inside the `(C)Python tool | |
163 | - <https://github.com/python/cpython/tree/main/Tools/peg_generator>`__; the CPython version is heavily related to other | |
164 | - details of CPython; it can also generate C-code. The pegen-package is based on it, and more-or-less in sync, can | |
165 | - generate Python-code only, but is not depending on the compiler-implementation details. | |
166 | - | |
167 | - .. seealso:: https://we-like-parsers.github.io/pegen/#differences-with-cpythons-pegen | |
168 | - | |
169 | - | |
170 | -Buggy current version | |
171 | ---------------------- | |
172 | - | |
173 | -The git version contains (at least) one bug. The function ``parser::simple_parser_main()``, that is called when using the | |
174 | -generated file, uses the AST module to print (show) the result -- which simple does not work. | |
175 | -|BR| | |
176 | -Probably, that* default main* isn’t used a lot (Also, I prever to use -- have use-- a own main). Still it shows it | |
177 | -inmaturity. | |
178 | - |