essence/idea of this repo
Whats this repo all about?
Parser rules for terminal data. And maybe later on something like a reference parser and more goodies/helper in various languages to get certain tasks around terminal data done.
How to tackle this?
I suggest to start in a top down fashion:
-
collect existing specification documents
I'd love to have those documents in this repo to decouple references from outer world. Sadly we cannot do that, until we have a clear go from copyright holders. In that case we have to fall back on foreign references and anyone dealing with those documents should get his own copy. Edit: We are free to store ECMA documents. -
define some terminology used in the parser context
Needed to avoid ambiguity. Example:- terminal sequence - code or sequence of consecutive codes, that is meaningful for terminals. This would include C0/C1 codes.
-
high level terminal sequence type identification rules
A first step should come up with high level identification rules of terminal sequences. Not sure yet about the notation, maybe some sort of EBNF will do.-
PRINTABLES
Any code, that is not part of any other terminal sequence (ground state). -
C0/C1
These prolly need own identifier, as composed sequences partially use them in construction rules. Example:BEL = "\x07"
. -
ESC type
Should contain all allowed code combinations of ECMA-35 and ECM-48, that are not C1 variants and not higher sequence types. Illegal combinations have to be detected somehow. -
CSI type
That is basically the only type, that is nicely explained by ECMA-48. There are still gaps to be filled. -
string/command types
All in APC/DCS/PM/SOS follow the same top level rules as given by EMCA-48, but with different introducers.
This high level dealing is already problematic - C0 wrongly embedded into other sequence might stay meaningful, might or might not break that sequence. Following the notes on VT100.net this is mainly caused by DEC's error recovery strategy. At this point we already might have to follow vendor specific rules. It will also bloat the rules alot. It is also a major reason, why pure regexp based parsing is hard to get done right with terminal sequences.
Another problem comes from ECMA-35/ISO-2022, which allows dynamic remapping of codes. Here I suggest to build the parser in a generic fashion based on unicode codes to avoid the complexity of parser modifying rules. To still support ISO-2022 rules correctly, data has to be transcoded beforehand. Examples later on could also contain inplace UTF-8 "upcoding" rules to avoid the more expensive full stream transcoding.
-
PRINTABLES
-
subparsing rules
Once we can sufficiently identify sequences types, we can dig into them. CSI is fairly spec'ed out, not so APC/DCS/PM/SOS. Here we should discuss, whether we want to propagate the de-facto standard of numerical OSC function identifiers, apply DEC's DCS format and so on. A general purpose parser still should have a global rule to catch any sequence, that doesnt match those subschemes. We have basically nothing to learn from for APC, PM and SOS. I would treat those as SEP (someone else's problem) in the beginning. -
examples and reference implementations
This could be done orthogonal to the parsing rules - once we have something fitting the bill there, we can implement helpers and examples for various things. The final "ring to bind 'em all" might be a full reference parser, that can be used for a terminal emulator. The parser on VT100.net might be a good starting point for such a parser, but I am not sure about its copyright. Does anyone know?
Ofc there are a few more things to be sorted out upfront:
-
repo strategy
I would favor here simple PR/MR handling without further requirements (if you like git flow, use it, but it is not mandatory). Broader problems are usually easier to discuss in separate issues, discussion below MRs should be related to particular changes. master should only contain stuff, that is well tested and in a working condition. Also own documents/writups should only enter master after serious moderation and with good reference coverage. This is needed to not drift into a hand-waving-maybe state for something crucial as a parser definition. There will be rough edges that cannot be verified with references. These will have to pass consensus among active terminal developers to enter master. -
license
Any work placed here should be under an open license. We should discuss and clarify the licenses beforehand, as it cannot be undone easily later on. If the licenses are in place, any contribution automatically falls under them, thus it is important to make sure, that you dont violate third party rights before contributing anything. The group or the platform wont take responsibility for that (beside deleting questionable stuff). -
CoC
Nothing beside the rules above and rules already in place from the platform and the group itself. The group is mainly consensus driven, so content still might get deleted if it was found to be offensive or totally unrelated.
These are my basic thoughts regarding this repo. Happy to hear more thoughts/remarks/ideas.