chevron_left chevron_right
Login Register invert_colors photo_library


Stay updated and chat with others! - Join the Discord!
Thread Rating:
  • 1 Vote(s) - 5 Average


Tutorial CYFA - Creating Your First Assembler - The Language filter_list
Author
Message
CYFA - Creating Your First Assembler - The Language #1
So, if you're just joining in to this series, this is part 8 in a tutorial series about building an ARM assembler. I advise that you read the previous parts, otherwise you may not understand. You can find the full list at this page.

Ok, so in the last part, we wrote some helper methods for initialization and conversion. At this point, it's possible to fill our structures with instructions. Now, in this part, we're going to start the amazing fun that is language parsing.

I'm going to split the parsing up into parts. In this one, we're just going to build structures that define the opcodes, conditions, op-lists, registers, etc. We will later use these structures to help us parse the syntax and make sure your code is error free when you write your first assembly program. Doing this is vastly superior to straight parsing, because we can define the entire language up front, and have a single parse function that does it all for us, and that we don't have to modify every time we need to tweak something.

At this point, your project should look like this:
[Image: JYVyWcT.png]

That's good, we want to keep everything organized so that we can find it easily. Let's go ahead and create a file in /Headers called language.h. We'll write all of our code in here and worry about organizing it in the end (since most of this is coming straight from my brain and tbh I have no idea what code we're going to write just yet). Side note: it's a good thing that I didn't plan this as much as I would have in a production environment, because I can guarantee that by the time we finish this, there will be at least 5 good bugs that we will have to fix. This will also give you a good intro to hardware design and reverse engineering, and prevent people from leaching all these parts before its done.

Ok, so you've created language.h. C::B will autogen some file contents, which is good, we want that. If you're using another IDE, it should look like this:
Code:
#ifndef LANGUAGE_H_INCLUDED
#define LANGUAGE_H_INCLUDED



#endif // LANGUAGE_H_INCLUDED

the first thing I want to define is an enumeration that defines the types of instructions that exist for this assembler. These will aid in knowing which structure in our union to fill. Remember, we are only supporting 3 distinct instruction classes:
  1. Data Processing
  2. Single Data Transfer
  3. Branch

Alright, stop. You probably failed that quiz (as did I). We already made that enumeration, it's in instruction.h. We're going to go ahead and leave that in place.
So, our next task, is to define a structure that holds our registers. This is going to be pretty straight forward, since ARM uses a numbered sequence rather than x86's lettered sequence. Let's go ahead and define that structure:

So, we're going to need a string pointer for the register name, and an integer holding the register's number. Now, if you remember from our earlier parts, the registers are numbered 0-15. This means that we want to max out our data storage to that amount. Although this won't actually make a difference, it's good practice and a good example of self-documenting code to limit this to a nibble. We will also need to add stdint.h to this file, so that we have access to integer definitions in the exacted format.

Code:
#include <stdint.h>

struct language_register
{
   char *reg_name;
   uint8_t reg_num : 4;
};

Cool. Now, let's define a structure that will hold our language syntax. To do that, we first have to define the individual tokens that will make up our language. This will be an enumeration.
Now, we know at this point, that our tokens are this:
  • mnemonic
  • register
  • constant
  • expression

So, before we really get into what those mean, let's just go ahead and create our enumeration:
Code:
enum language_token
{
   kTOKEN_UNDEF,
   kTOKEN_MNEMONIC,
   kTOKEN_REGISTER,
   kTOKEN_CONSTANT,
   kTOKEN_EXPRESSION
};

Ok, let me explain those. So, half of these are pretty straight forward, I'll go over those first.
A register is directly mapped to a CPU register, we will use our struct language_register to help parse these.
A constant is literally that, its a constant value. We don't distinguish between bases in this enumeration (but we will).
A mnemonic is where things get neat. So, we need a way to parse both opcodes and conditional flags. The combination of these two is the mnemonic. We will split these later on.
Finally, an expression. This will be useful for memory accessors, or just plain old "somebody put an explicit calculation in the field". Expressions will either evaluate to another expression, or a constant.

Ok cool. Now that we've gone over that, let's go ahead and modify our enumeration to hold the derrived types as well:
Code:
enum language_token
{
   kTOKEN_UNDEF,
   /* basic tokens */
   kTOKEN_MNEMONIC,
   kTOKEN_REGISTER,
   kTOKEN_CONSTANT,
   kTOKEN_EXPRESSION
   /* evaluated tokens */
   kTOKEN_OPCODE,
   kTOKEN_CONDITION,
   kTOKEN_ACCESSOR
};

Sweet. Now, we technically have all of the tools that we need to start building our language structure. Let's go ahead and start that. For this, we're going to have (to start)
  1. The name of the rule (for debug logging)
  2. An array of type enum language_token (to hold our ORDERED token syntax list)
Ok, so the code for that should look like this:
[/code]
struct language_rule
{
   char *name;
   enum language_token *syntax;
};
[/code]
Now, since we're dealing with arrays, we will need some form of NULL terminator. Right now, we don't have any sort of thing in our enumeration, let's go ahead and add that. When you're done, the entire file should look like this:
Code:
#ifndef LANGUAGE_H_INCLUDED
#define LANGUAGE_H_INCLUDED

#include <stdint.h>

struct language_register
{
   char *reg_name;
   uint8_t reg_num : 4;
};

enum language_token
{
   kTOKEN_NULL = 0, /* terminator for language rule list */
   kTOKEN_UNDEF,
   /* basic tokens */
   kTOKEN_MNEMONIC,
   kTOKEN_REGISTER,
   kTOKEN_CONSTANT,
   kTOKEN_EXPRESSION
   /* evaluated tokens */
   kTOKEN_OPCODE,
   kTOKEN_CONDITION,
   kTOKEN_ACCESSOR
};

struct language_rule
{
   char *name;
   enum language_token *syntax;
};

#endif // LANGUAGE_H_INCLUDED

Ok. so now, we can make up some of the basic syntax, but we're still at a loss if the line should contain specific characters. To do that, we're going to have to define another enumeration value for a hard coded constant character. That one's pretty simple, just add
Code:
kTOKEN_CHARACTER
into your list somewhere. I'm going to put it above the basic tokens list, since these are something we're going to define.

Awesome. Now we come to the third thing that our structure needs. When we define our array of tokens, we're going to have a bunch of kTOKEN_CHARACTER entries in there indicating where we need spaces, comas, brackets, etc. But as it stands right now, we have no way of defining what those characters are. Since they will be used to determine how far we read when searching for a token, we're going to also need a character array to set those. Let's add that to our structure.
Code:
struct language_rule
{
   char *name;
   enum language_token *syntax;
   char *characters;
};

Ok, so now we need the fourth thing. And this is where it gets a little tricky, we need to add a segment to our structure that defines the type of instruction this references.. As you remember, this is defined in instruction.h. This poses a bit of a design problem, because it IS used in that file. We can either leave it in place and include instruction.h in our file, or move it to our file and include that in instruction. For now, we're going to leave them in place, although we may need to move this later on. Let's go ahead and include instruction, and then add an entry in our structure for that. Afterwards, your file will look as follows:
Code:
#ifndef LANGUAGE_H_INCLUDED
#define LANGUAGE_H_INCLUDED

#include <stdint.h>
#include <instruction.h>

struct language_register
{
   char *reg_name;
   uint8_t reg_num : 4;
};

enum language_token
{
   kTOKEN_NULL = 0, /* terminator for language rule list */
   kTOKEN_UNDEF,
   kTOKEN_CHARACTER,
   /* basic tokens */
   kTOKEN_MNEMONIC,
   kTOKEN_REGISTER,
   kTOKEN_CONSTANT,
   kTOKEN_EXPRESSION
   /* evaluated tokens */
   kTOKEN_OPCODE,
   kTOKEN_CONDITION,
   kTOKEN_ACCESSOR
};

struct language_rule
{
   char *name;
   enum instruction_type type;
   enum language_token *syntax;
   char *characters;
};

#endif // LANGUAGE_H_INCLUDED

At this point, we also need to define what mnemonics we allow for this rule. This means both opcode and condition. We can do this one of two ways.
1. we can write one rule for each mnemonic
2. we can write one rule for a group of mnemonics
Just to make sure this code is shorter, we're going to go with option 2. This means that we need an enumeration that is properly segmented, to define all of our opcodes. Let's name that enum language_opcode and give it a prefix of kOPCODE_.
Go ahead and fill our our enum using the previous parts of this series as a guide.
ok, here's my code. I know I included more opcodes than our guide historically has, I just want to make sure I covered everything I could
Code:
enum language_opcode
{
   /* Data processing */
   kOPCODE_AND = 0x00001,
   kOPCODE_EOR = 0x00002,
   kOPCODE_SUB = 0x00004,
   kOPCODE_RSB = 0x00008,
   kOPCODE_ADD = 0x00010,
   kOPCODE_ADC = 0x00020,
   kOPCODE_SBC = 0x00040,
   kOPCODE_RSC = 0x00080,
   kOPCODE_TST = 0x00100,
   kOPCODE_TEQ = 0x00200,
   kOPCODE_CMP = 0x00400,
   kOPCODE_CMN = 0x00800,
   kOPCODE_ORR = 0x01000,
   kOPCODE_MOV = 0x02000,
   kOPCODE_BIC = 0x04000,
   kOPCODE_MVN = 0x08000,
   /* single data transfer */
   kOPCODE_LDR = 0x10000,
   kOPCODE_STR = 0x20000,
   /* branches */
   kOPCODE_B   = 0x40000,
   kOPCODE_BL  = 0x80000
};

Now, you may ask the question: "do we have to segment the condition enumeration?" Well, no. All instructions are conditional and all instructions can carry every condition, so we don't need to define allowed conditions at all. Now, let's go ahead and add an INTEGER argument for our structure that allows us to define the allowed instructions. When complete, your file will look like this. (sorry I keep posting this file, it's just super important that you get this one right).
Code:
#ifndef LANGUAGE_H_INCLUDED
#define LANGUAGE_H_INCLUDED

#include <stdint.h>
#include <instruction.h>

struct language_register
{
   char *reg_name;
   uint8_t reg_num : 4;
};

enum language_token
{
   kTOKEN_NULL = 0, /* terminator for language rule list */
   kTOKEN_UNDEF,
   kTOKEN_CHARACTER,
   /* basic tokens */
   kTOKEN_MNEMONIC,
   kTOKEN_REGISTER,
   kTOKEN_CONSTANT,
   kTOKEN_EXPRESSION
   /* evaluated tokens */
   kTOKEN_OPCODE,
   kTOKEN_CONDITION,
   kTOKEN_ACCESSOR
};

enum language_opcode
{
   /* Data processing */
   kOPCODE_AND = 0x00001,
   kOPCODE_EOR = 0x00002,
   kOPCODE_SUB = 0x00004,
   kOPCODE_RSB = 0x00008,
   kOPCODE_ADD = 0x00010,
   kOPCODE_ADC = 0x00020,
   kOPCODE_SBC = 0x00040,
   kOPCODE_RSC = 0x00080,
   kOPCODE_TST = 0x00100,
   kOPCODE_TEQ = 0x00200,
   kOPCODE_CMP = 0x00400,
   kOPCODE_CMN = 0x00800,
   kOPCODE_ORR = 0x01000,
   kOPCODE_MOV = 0x02000,
   kOPCODE_BIC = 0x04000,
   kOPCODE_MVN = 0x08000,
   /* single data transfer */
   kOPCODE_LDR = 0x10000,
   kOPCODE_STR = 0x20000,
   /* branches */
   kOPCODE_B   = 0x40000,
   kOPCODE_BL  = 0x80000
};

struct language_rule
{
   char *name;
   enum instruction_type type;
   enum language_token *syntax;
   uint32_t allowed_opcodes;
   char *characters;
};

#endif // LANGUAGE_H_INCLUDED

Cool. Now, just for shits and giggles, I'm going to write a quick rule that would define an ADD instruction:
Code:
struct language_rule add =
{
   "ADD_NO_CONDITION",
   kINSTRUCTION_DATA,
   {
       kTOKEN_OPCODE,   /* ADD */
       kTOKEN_CHARACTER /* SPACE */
       kTOKEN_REGISTER, /* Rd */
       kTOKEN_CHARACTER,/* , */
       kTOKEN_REGISTER, /* Rs */
       kTOKEN_CHARACTER,/* , */
       kTOKEN_REGISTER  /* Rm */
   },
   kOPCODE_ADD,
   {
       ' ',
       ',',
       ','
   }
};
Note: DO NOT put this in your project, this is just for reference
Ok, so I'm gladd I tested this. I want to reorder the arguments so that the characters come immediately after the syntax. My new language_rule structure looks like this
Code:
struct language_rule
{
   char *name;
   enum instruction_type type;
   enum language_token *syntax;
   char *characters;
   uint32_t allowed_opcodes;
};
and my new rule looks like this:
Code:
struct language_rule add =
{
   "ADD_NO_CONDITION",  /* .name */
   kINSTRUCTION_DATA,   /* .type */
   {                    /* .syntax */
       kTOKEN_OPCODE,   /* ADD */
       kTOKEN_CHARACTER /* SPACE */
       kTOKEN_REGISTER, /* Rd */
       kTOKEN_CHARACTER,/* , */
       kTOKEN_REGISTER, /* Rs */
       kTOKEN_CHARACTER,/* , */
       kTOKEN_REGISTER  /* Rm */
   },
   {                    /* .characters */
       ' ',
       ',',
       ','
   },
   kOPCODE_ADD          /* .allowed_opcodes */
};

Now that all sounds pretty straight forward, but it has just occurred to me that we won't be able to sense end of line, and the following is NOT valid:
Code:
ADD  R0,R1,R2
Can you tell what's wrong? Probably not. There are two spaces between ADD and R0. We didn't define this, and we forced it to be a space rather than a tab or whatever. Let's go ahead and write our assembler to ignore whitespace. If we do this, we can remove the character token between opcode and register, and of course the space from our list. This would make our EXAMPLE instruction rule look like this:
Code:
struct language_rule add =
{
   "ADD_NO_CONDITION",  /* .name */
   kINSTRUCTION_DATA,   /* .type */
   {                    /* .syntax */
       kTOKEN_OPCODE,   /* ADD */
       kTOKEN_REGISTER, /* Rd */
       kTOKEN_CHARACTER,/* , */
       kTOKEN_REGISTER, /* Rs */
       kTOKEN_CHARACTER,/* , */
       kTOKEN_REGISTER  /* Rm */
   },
   {                    /* .characters */
       ',',
       ','
   },
   kOPCODE_ADD          /* .allowed_opcodes */
};

Ok. So with that done, and with you a little bit smarter, we have to define a couple of other things before we end this part (since it's already too long).

First of all, we're going to rename language_register and then add a couple other structures. the code is as follows:
Code:
struct language_parsing_register
{
   char *reg_name;
   uint8_t reg_num : 4;
};

struct language_parsing_opcode
{
   char *opcode_name;
   enum language_opcode opcode_value; /* DO NOT LOR THESE! */
};

struct language_parsing_condition
{
   char *condition_name;
   enum instruction_condition condition_value;
};

Ok, now we have a problem. We've used enum language_opcode before we've defined it. It's time to start organizing things. Go ahead and create a folder in /Headers called language, and add a file there called parsing.h. This file is currently blank:
Code:
#ifndef PARSING_H_INCLUDED
#define PARSING_H_INCLUDED



#endif // PARSING_H_INCLUDED

Go ahead and paste those three structures into this file. Don't worry about includes on this one.

Code:
#ifndef PARSING_H_INCLUDED
#define PARSING_H_INCLUDED

struct language_parsing_register
{
   char *reg_name;
   uint8_t reg_num : 4;
};

struct language_parsing_opcode
{
   char *opcode_name;
   enum language_opcode opcode_value; /* DO NOT LOR THESE! */
};

struct language_parsing_condition
{
   char *condition_name;
   enum instruction_condition condition_value;
};

#endif // PARSING_H_INCLUDED

Now, we don't have to do this part, but I think it will look neater if we do. Let's also make a file language/enumerations.h and of course, paste our enumerations inside that.
Code:
#ifndef ENUMERATIONS_H_INCLUDED
#define ENUMERATIONS_H_INCLUDED

enum language_token
{
   kTOKEN_NULL = 0, /* terminator for language rule list */
   kTOKEN_UNDEF,
   kTOKEN_CHARACTER,
   /* basic tokens */
   kTOKEN_MNEMONIC,
   kTOKEN_REGISTER,
   kTOKEN_CONSTANT,
   kTOKEN_EXPRESSION
   /* evaluated tokens */
   kTOKEN_OPCODE,
   kTOKEN_CONDITION,
   kTOKEN_ACCESSOR
};

enum language_opcode
{
   /* Data processing */
   kOPCODE_AND = 0x00001,
   kOPCODE_EOR = 0x00002,
   kOPCODE_SUB = 0x00004,
   kOPCODE_RSB = 0x00008,
   kOPCODE_ADD = 0x00010,
   kOPCODE_ADC = 0x00020,
   kOPCODE_SBC = 0x00040,
   kOPCODE_RSC = 0x00080,
   kOPCODE_TST = 0x00100,
   kOPCODE_TEQ = 0x00200,
   kOPCODE_CMP = 0x00400,
   kOPCODE_CMN = 0x00800,
   kOPCODE_ORR = 0x01000,
   kOPCODE_MOV = 0x02000,
   kOPCODE_BIC = 0x04000,
   kOPCODE_MVN = 0x08000,
   /* single data transfer */
   kOPCODE_LDR = 0x10000,
   kOPCODE_STR = 0x20000,
   /* branches */
   kOPCODE_B   = 0x40000,
   kOPCODE_BL  = 0x80000
};

#endif // ENUMERATIONS_H_INCLUDED

Now, our original language.h is pretty empty. It looks like this:
Code:
#ifndef LANGUAGE_H_INCLUDED
#define LANGUAGE_H_INCLUDED

#include <stdint.h>
#include <instruction.h>

struct language_rule
{
   char *name;
   enum instruction_type type;
   enum language_token *syntax;
   char *characters;
   uint32_t allowed_opcodes;
};

#endif // LANGUAGE_H_INCLUDED
but it's missing a few things. We need to include our files that we just created. Order IS important in this.
Code:
#include <language/enumerations.h>
#include <language/parsing.h>

Ok. Now we have the basics down. The next step we need to do is define all of the actual language rules. To make this simple, we're going to create a few constants, and place them in language/constants.h. These constants will create logical groups of our opcodes, so that we don't have to make a giant list of them when we do things. I'll just do this for you, since it's pretty straight forward. The entire file should look like this:
Code:
#ifndef CONSTANTS_H_INCLUDED
#define CONSTANTS_H_INCLUDED

const static uint32_t C_OPCODES_DP =
   kOPCODE_AND |
   kOPCODE_EOR |
   kOPCODE_SUB |
   kOPCODE_RSB |
   kOPCODE_ADD |
   kOPCODE_ADC |
   kOPCODE_SBC |
   kOPCODE_RSC |
   kOPCODE_TST |
   kOPCODE_TEQ |
   kOPCODE_CMP |
   kOPCODE_CMN |
   kOPCODE_ORR |
   kOPCODE_MOV |
   kOPCODE_BIC |
   kOPCODE_MVN;
const static uint32_t C_OPCODES_DT =
   kOPCODE_LDR |
   kOPCODE_STR;
const static uint32_t C_OPCODES_BR =
   kOPCODE_B   |
   kOPCODE_BL ;

#endif // CONSTANTS_H_INCLUDED

and we will include that in language.h as well. NOTE: we will be adding more of these, this is just the bare minimum.

Finally, before I let you go, I'm going to have you create the basic list of instruction rules. We won't actually write any rules today, but we're going to make the list just so I don't forget how to do it right.
Go ahead and create /Sources/language/rules.c and paste this code in it:
Code:
#include <stdlib.h>
#include <language.h>

struct language_rule rules[] =
{
   { /* the NULL rule */
       NULL,
       kINSTRUCTION_UNDEF, /* no structure */
       { kTOKEN_NULL }, /* no instruction */
       NULL, /* no characters */
       0 /* none allowed */
   }
};

Ok, let's go ahead and try to build that
[Image: khFucCe.png]
Ok, looks like we forgot a coma here. Good. That will keep you on your toes so that you don't just breeze through this and copy paste all my code. Go ahead and add the missing coma on kTOKEN_EXPRESSION (if you didn't already catch my mistake) and try to build it again

[Image: GtpXl9e.png]

Ok, so this isn't an error, and it's not technically wrong, but we're going to change it anyways just to get rid of that pesky warning.
Code:
struct language_rule rules[] =
{
   { /* the NULL rule */
       NULL,
       kINSTRUCTION_UNDEF, /* no structure */
       kTOKEN_NULL, /* no instruction */
       NULL, /* no characters */
       0 /* none allowed */
   }
};

Now, rebuild it
[Image: d82TT9O.png]
PERFECT!

Ok, before I let you go, I'll give you the ending project tree screenshot.
[Image: P9KCJnG.png]

Sweet. Don't forget to save all the files, save your project, good luck and see you next time, when we get to write all of the rules, and maybe make some changes to this.

Please don't forget to discuss and ask questions. Think you know how to do it better? Let me know, you're probably right!

Reply






Users browsing this thread: 1 Guest(s)