BioLegato - Paper: "PCD-bioLegato Menu Language: Syntax and Semantic"
From Bioinformatics.Org Wiki
Note: Last vesion in DOC format above
PCD - bioLegato Menu Language: Syntax and Semantic
In computer science, the syntax of a programming language is the set of rules that define the combinations of symbols that are considered to be correctly structured programs in that language. The syntax of a language defines its surface form. Text-based programming languages are based on sequences of characters. The lexical grammar of a textual language specifies how characters must be chunked into tokens. Other syntax rules specify the permissible sequences of these tokens and the process of assigning meaning to these token sequences is part of semantics.
The syntactic analysis of source code usually entails the transformation of the linear sequence of tokens into a hierarchical syntax tree (abstract syntax trees are one convenient form of syntax tree). This process is called parsing, as it is in syntactic analysis inlinguistics . Tools have been written that automatically generate parsers from a specification of a language grammar written in Backus-Naur form, e.g., Yacc (yet another compiler compiler).
Syntax The term "syntax" refers to the structure of a programming language, in particular, the different series of symbols and words that make up the basic parts of the language. The most common way of specifying the syntax of a language is use a notation known as Backus-Naur Form.
Terminals & Nonterminals
Backus-Naur Form, BNF for short, is a notation used to describe grammars. The notation breaks down the grammar into a series of rules - which are used to describe how the languages lexical and syntactic structures are used to form different logical units.
The actual reserved words, symbols, etc... of the grammar are represented "terminals". In Backus-Naur Form, terminals are usually left without any special formatting or are simply delimited by single or double quotes. Examples include: if, while, '=' and identifier.
Syntactic rules are represented with a "nonterminal" - which are structure names. Typically, nonterminals are delimited by angle-brackets, but this is not always the case. Examples include <statement> and <exp>. Both terminals and nonterminals are referred to generically as "symbols".
The actual syntax of the grammar is specified by combining terminals and nonterminals into syntactic rules known as "productions". They have the following format: N ::= s where N is a nonterminal and s is a series of zero or more terminals and nonterminals. Different alternatives can be specified in Backus-Naur Form. For readability, often productions are grouped together and separated by a “pipe” symbol - which is read as the word “or”.
Basically, a production has the following properties.
- The production starts with a single nonterminal, which is the name of the structure being defined
- This nonterminal is followed by a ::= symbol which means “as defined as”. The ::= symbol is often used interchangeably with the symbol. They both have the same meaning.
- The symbol is followed by a series of terminals and nonterminals.
Deterministic Finite Automata
A Deterministic Finite Automaton is used to perform lexical analysis for the parser.
As the name implies, deterministic finite automata are deterministic. This means that from any given state there is only one path for any given input. In other words, there is no ambiguity in state transition. It is also complete which means there is one path from any given input. It is finite; meaning there is a fixed and known number of states and transitions between states. Finally, it is an automaton. The transition from state to state is completely determined by the input. The algorithm merely follows the correct branches based on the information provide, the algorithm jumps to (gotos) the appropriate state representing the reduced nonterminal. This simulates the shifting of a nonterminal in the LR state machine.
Finally, when the start symbol (the nonterminal used to represent the entire grammar) is reduced, the input is both complete and correct. At this point, parsing terminates. d.
A DFA is commonly represented with a graph. The term "graph" is used quite loosely by other scientific fields. Often, it is refers to a plotted mathematical function or graphical representation of data. In computer science terms, however, a "graph" is simply a collection of nodes connected by edges. The figure to the right is a simple Deterministic Finite Automata that recognizes common identifiers and numbers. For instance, assume that the input contains the text "gold". From State 1 (the initial state), the DFA moves to State 2 when the "g" is read. For the next three characters, "o", "l" and "d", the DFA continues to loop to State 2.
By design, the tokenizer attempts to match the longest series of characters possible before accepting a token. For example: if the tokenizer is reading the characters "count" from the source, it can match the first character "c" as an identifier. It would not be prudent for the tokenizer to report five separate identifiers: "c", "o", "u", "n" and "t".
The LALR algorithm is used to perform syntactic analysis for the parser.. Like Deterministic Finite Automata, the LALR algorithm is a simple state transition graph. Each state represents a different stage the system can be in as it parses a string.
Each state represents a point in the parse process where a number of tokens have been read from the source and rules are in different states of completion. Each production in a state of completion is called a "configuration" and each state is really a configuration set.
For each token received from the. lexer, the LR algorithm can take four different actions: Shift, Reduce, Accept and Goto.
For each state, the LR algorithm checks the next token on the input queue against all tokens that expected at that stage of the parse. If the token is expected, it is "shifted". This action represents moving the cursor past the current token. The token is removed form the input queue and pushed onto the parse stack. A reduce is performed when a rule is complete and ready to be replaced by the single nonterminal it represents. Essentially, the tokens that are part of the rule's handle - the right-hand side of the definition - are popped from the parse stack and replaced by the rule's nonterminal.
When a rule is reduced, the algorithm jumps to (gotos) the appropriate state representing the reduced nonterminal. This simulates the shifting of a nonterminal in the LR state machine.
Finally, when the start symbol (the nonterminal used to represent the entire grammar) is reduced, the input is both complete and correct. At this point, parsing terminates.
2. Formal Syntax
2.2.1. Syntax Rules
3. Operational Semantics
4. Application of the PCD - BioLegato Menu Language