c - How to implement a language interpreter without regular expressions? -


i attempting write interpreted programming language read in files , output bytecode-like format can executed virtual machine.

my original plan was:

  • begin loading contents of file in memory of interpreter,
  • read each line (where word defined being series of letters followed space or escape character),
  • using regular expressions, determine parameters of word,
  • for example, if myvalue = 5 then become if myvalue 5,
  • convert each of these words byte form (i.e. if become 0x09),
  • execute these bytes 1 one (as executor understand if or 0x09 followed 2 bytes.

i have been told regular expressions terrible way go doing this, i'm unsure if or bad way of implementing interpreted language.

this experience, don't mind if isn't performance friendly, benefit.

what best way of implementing interpreter, , there examples (written in plain old c)?

the reason why people tell regular expressions aren't best idea case because regular expressions take more time evaluate, , regular expression language has many limitations , quirks make unsuitable many applications.

you should aware knee-jerk reaction many programmers have (including myself), regardless of whether or not regular expressions suitable application. stems people trying regular expressions, such trying parse html.

many compilers use basic single-pass tokenization algorithm. tokenizer has basic idea of can used delimiter, how constants , identifiers should treated, etc. tokenizer iterate through input , emit string of tokens can parsed.

for basic application, such parsing tokens, relatively minor penalty using regular expressions isn't worry about. said, there peculiarities how regular expressions work might limit things tokenizer can do, though work can offloaded later point in compiler pipeline. tokenizer expected should representable standard regular expressions though.

it should noted when involve regular expressions directly in code, things can little hairy. have determine how regular expressions should applied input, input delineated, , on. you'll incur penalty of compiling , evaluating regular expressions.

there projects, such lex, make use of regular expressions because provide simple, concise way describe grammar, can make use of internally in whatever representation in chooses. handle of glue logic , merely have describe grammar should use via regular expressions.

when such generator used , can change regular expressions code represents expression means. if sees expression [0-9], can replace call isdigit, equivalent switch statement, or other representation. makes generated code more efficient inline use of regular expressions achieve.

so thought this: if want use regular expressions parse language, go way , create scanner description flex/lex generate tokenizer you. however, if you're writing entirely yourself, you'd better off logically simpler approach 1 described.

i thought fun write sample tokenizer doesn't use regular expressions, here is. wrote in c-like c++. c++ feature used standard vector , string, did in such way drop in c variants.

#include <vector> #include <ctype.h> #include <string>  typedef std::vector<std::string> string_list; typedef std::vector<long long > int_list; typedef std::vector<long double> float_list;  std::string substr(const char* value, size_t length){     std::string v;     v.resize(length);     memcpy(&v[0], value, length * sizeof(char));     return v; }  long long string_to_int(const char* value, size_t length){     return atoll(substr(value, length).c_str()); } long double string_to_float(const char* value, size_t length){     return atof(substr(value, length).c_str()); }   void int_list_add(int_list& list, long long value){     list.push_back(value); } void string_list_add(string_list& list, const char* value, size_t length){     list.push_back(substr(value, length)); } void float_list_add(float_list& list, long double value){     list.push_back(value); } size_t int_list_last(int_list& list){     return list.size(); } size_t string_list_last(string_list& list){     return list.size(); } size_t float_list_last(float_list& list){     return list.size(); }    typedef struct{     string_list identifiers;     string_list constants_string;     int_list constants_int;     float_list constants_float;     size_t id; } *state, state_value;  state tok_state_create(){     state ret = new state_value;     ret->id = 0;     return ret; } void tok_state_destroy(state t_state){     delete t_state; } const char* tok_state_read_identifier(state t_state, size_t id){     return t_state->identifiers[id - 1].c_str(); } const char* tok_state_read_string(state t_state, size_t id){     return t_state->constants_string[id - 1].c_str(); } long long tok_state_read_int(state t_state, size_t id){     return t_state->constants_int[id - 1]; } long double tok_state_read_float(state t_state, size_t id){     return t_state->constants_float[id - 1]; }    const char* punct_tokens[] = { "not token (dummy)", ".", ",", "<", "<<", ">", ">>", ";", "+", "-", "/", "*", "!", "%", "^", "&", "(", ")", "=", "==", "[", "]", "{", "}", "?", ":", "|", "||", "&&", "~", 0 };  const char* key_tokens[] = { "not token (dummy)", "if", "while", "do", "then", "end", 0 };  typedef enum{     tok_type_integer = 500,     tok_type_float,     tok_type_string,     tok_type_identifier,     tok_type_none } tok_type;  const char* get_token_from_id(size_t id){     if (id < 100){         return punct_tokens[id];     }     if (id < 200){         return key_tokens[id - 100];     }     if (id >= 500){         switch (id){         case tok_type_integer:      return "integer constant";         case tok_type_float:        return "float constant  ";         case tok_type_string:       return "string constant ";         case tok_type_identifier:   return "identifier      ";         case tok_type_none:         return "unknown         ";         default:             break;         }     }     return "not token (dummy)"; }  int is_identifier_char(char c){     if (isalpha(c) || c == '_'){         return 1;     }     return 0; }  size_t read_punct_token(const char* input, size_t size){     size_t max_len = 0;     size_t token_id = 0;     (size_t = 1; punct_tokens[i] != 0; ++i){         size_t len = strlen(punct_tokens[i]);         if (len > max_len && len <= size && strncmp(punct_tokens[i], input, len) == 0){             max_len = len;             if (i == 1 && size > 1 && isdigit(input[1])){                 return 0; //special case floats             }             token_id = i;         }     }     return token_id; }  size_t read_key_token(const char* input, size_t size){     size_t max_len = 0;     size_t token_id = 0;     (size_t = 1; key_tokens[i] != 0; ++i){         size_t len = strlen(key_tokens[i]);         if (len > max_len && len <= size && strncmp(key_tokens[i], input, len) == 0){             max_len = len;             token_id = + 100;         }     }     return token_id; }   size_t is_punct_token_char(char c){     (size_t = 1; punct_tokens[i] != 0; ++i){         if (punct_tokens[i][0] == c){             return 1;         }     }     return 0; }   void add_token(state t_state, tok_type type, const char* string, size_t length){     switch (type){     case tok_type_integer:         int_list_add(t_state->constants_int, string_to_int(string, length));         t_state->id = int_list_last(t_state->constants_int);         break;     case tok_type_float:         float_list_add(t_state->constants_float, string_to_float(string, length));         t_state->id = float_list_last(t_state->constants_float);         break;     case tok_type_string:         string_list_add(t_state->constants_string, string, length);         t_state->id = string_list_last(t_state->constants_string);         break;     case tok_type_identifier:         string_list_add(t_state->identifiers, string, length);         t_state->id = string_list_last(t_state->identifiers);         break;     default:         //do error here         break;     } }  size_t get_token(state t_state, char** input, size_t *size){     if (t_state->id != 0){         size_t id = t_state->id;         t_state->id = 0;         return id;     }     char* base = *input;     size_t padding = 0;     size_t length = 0;     tok_type type = tok_type_none;     while (*size > 0){         if (isspace(*base)){             base++;             (*size)--;         }         else{             break;         }     }      size_t tok = read_punct_token(base, *size);     if (tok){         size_t len = +strlen(get_token_from_id(tok));         *input = base + len;         *size -= len;         return tok;     }     tok = read_key_token(base, *size);     if (tok){         size_t len = +strlen(get_token_from_id(tok));         *input = base + len;         *size -= len;         return tok;     }      while (*size - length > 0){         if (length == 0 && type == tok_type_none){             if (is_identifier_char(*base)){                 type = tok_type_identifier;                 length++;             }             else if (*base == '"'){                 type = tok_type_string;                 padding = 1;                 base++;                 (*size)--;             }             else if (*base == '.' && *size > 1 && isdigit(base[1])){                 type = tok_type_float;             }             else if (isdigit(*base)){                 type = tok_type_integer;             }             else if (is_punct_token_char(*base)){                 tok = read_punct_token(base, *size);                 if (tok){                     size_t len = strlen(punct_tokens[tok]);                     *input += len;                     *size -= len;                     return tok;                 }                 else{                     //do error                 }             }         }         else{             if (!isspace(base[length]) || type == tok_type_string){                 switch (type){                 case tok_type_integer:                     if (isdigit(base[length])){                         length++;                         continue;                     }                     else if (base[length] == '.' || tolower(base[length]) == 'e'){                         type = tok_type_float;                         length++;                         continue;                     }                     break;                 case tok_type_float:                     if (isdigit(base[length]) || base[length] == '.' || base[length] == 'e'){                         length++;                         continue;                     }                     break;                 case tok_type_string:                     if (base[length] != '"'){                         length++;                         continue;                     }                     break;                 case tok_type_identifier:                     if (is_identifier_char(base[length])){                         length++;                         continue;                     }                     break;                 default:                     break;                 }             }             //we here if space or of switch cases didn't continue.             add_token(t_state, type, base, length);             *input = base + length + padding;             *size -= length + padding;             return type;         }     }     *input = base + length + padding;     *size -= length + padding;     return 0; }  int main(){     const char* input = "if(1+1==4)then print\"hi!\";end";     state s = tok_state_create();     size_t size = strlen(input);     size_t token;     size_t token_prev = 0;     printf("token\tmeaning\n\n");      while ((token = get_token(s, (char**)&input, &size)) != 0){         if (token_prev < 500){             if (token < 500){                 printf("%d\t%s\n", token, get_token_from_id(token));             }             else{                 printf("%d\t%s #", token, get_token_from_id(token));             }         }         else{             printf("%d\t", token);             switch (token_prev){             case tok_type_identifier: printf("%s\n", tok_state_read_identifier(s, token)); break;             case tok_type_string: printf("%s\n", tok_state_read_string(s, token)); break;             case tok_type_integer: printf("%d\n", tok_state_read_int(s, token)); break;             case tok_type_float: printf("%f\n", tok_state_read_float(s, token)); break;              }         }         token_prev = token;     }      tok_state_destroy(s); } 

this print:

token   meaning  101     if 16      ( 500     integer constant #1     1 8       + 500     integer constant #2     1 19      == 500     integer constant #3     4 17      ) 104     503     identifier       #1     print 502     string constant  #1     hi! 7       ; 105     end 

Comments

Popular posts from this blog

javascript - AngularJS custom datepicker directive -

javascript - jQuery date picker - Disable dates after the selection from the first date picker -