Introduction to Assembly for Linux (Intel 32 and 64 bit) 07-03-2013, 12:19 PM
#1
I know that I am late with this and I still haven't included everything I wanted to, but here is the first part. I will include more parts once they are done.
Note: I tested the code in a 64 bit system. If there is someone using 32 bit and it doesn't work for some reason, tell me please.
Introduction to Assembly for Linux (Intel 32 and 64 bit)
"If you can’t do it in Fortran, do it in assembly language. If you can’t do it in assembly language, it isn’t worth doing." (Ed Post)
Since: late 1940s
Paradigms: imperative
Advantages:
Disadvantages:
The very first programs were written in machine language. Machine language is a sequence of bytes that is stored in memory, fetched, interpreted and executed by the computer. Writing programs in machine language is very tedious. You need to know all the addresses of the data and the addresses of program branches, which depend on where the intructions are loaded into memory. Once you modify the data (adding, deleting) these addresses might change. This is why people introduced symbolic names and mnemonics to represent introductions and data (example: MOV). They are more easy to remember and diminish or remove the need to calculate addresses.
So assembly (also abbreviated ASM) is only one step away from the actual machine language. Thus the resulting code is only usable for the specific architecture and operating system it was written for. So why would you want to learn or use assembly?
Look at the advantages mentioned above. Assembly is so close to the metal that you are able to optimize a lot and create programs that have a very good performance. The big BUT: Most people aren't that skilled that they can outperform i.e. C programs that where optimized and translated into machine language by a compiler. It is very hard to do so and it is not adviceable to use assembly for general purpose applications. But in some cases assembly is still used today.
Knowledge in assembly is necessary if you want to dive into fields like reverse-engineering, because binaries sometimes can't be decompiled into a high-level language.
Assembly and C work well together. Assembly is able to call C library functions and C is able to embed assembly code directly. Operating systems are often written in both, assembly and C.
Prerequisites:
You should understand:
Your first assembly program:
In order to write an assembly program and create an executable out of it you will need:
Open a text editor of your choice (i.e. Nano, Vim, Emacs, Gedit, Notepad, ...) and write the following program. Save it as exit.nasm. Further down I will explain the code in detail, but for now we will just manage to create an executable program.
We will use the NASM assembler. Install it, i.e. in Ubuntu:
To assemble the file go to the folder you saved exit.nasm and type in your terminal:
Afterwards you call the linker on your object file:
This will create the executable exit in the folder you are in and you can run it via
To display the exit status returned by your program type:
If everything went fine, a 5 is displayed.
Explanation:
Ok, that's great, but what did you actually do there?
At first you assembled the file: The assembler took your assembly code, translated the mnemonics into opcode, resolved symbolic names and thus turned it into an assembly listing with offsets, the so called object code.
The linker which you called with the command ld takes one or more object files and possibly libraries to create one executable out of it.
Now let's have a look at the program you wrote there.
The semicolon is used to write comments into your code. It is usual to comment almost every instruction, because assembly is hard to read without them.
Instructions are either machine instructions, assembly instructions or macros.
segment is an instruction for the assembler. The .text segment is where the program instructions are put into.
global _start tells the assembler to make the label _start known for the linker. The _start function is an entry point for your program, you can compare this to the main function in a C program (actually the main function of a C program is called in the _start function of the C library).
_start is a label.
The instruction
moves the constant 1 into a register called EAX (you will learn more about registers later). MOV is a mnemonic that stands for move.
Here the constant 5 is moved into the register EBX.
The mnemonic INT stands for interrupt. In this case you make a system call. It tells your computer to look at the values you set in the registers and take action according to them.
I wrote the meaning of the numbers and instructions right behind in comments, but how do you get to know them?
You need a system call reference. I.e. this one:
http://docs.cs.up.ac.za/programming/asm/...calls.html
Now have a look at the very first function sys_exit in the reference table. You can see that in order to call sys_exit, you have to put the constant 1 into EAX and an integer for the return code into EBX. This is exactly what we did here with the MOV instructions.
Congratulations, that was your first assembly program. We will move on to write a hello world program in the next part.
Hello World Program:
This is part 2 of the assembly introduction where we will move on with writing a hello world program.
We want to print a string to the standard output, so we need to define the string. This is done in the .data section with db command to declare an array of bytes. db stands for declare bytes (you can also declare data with dw -> declare words). Every character of a string is defined with a single byte. The whole array has the name msg.
In addition we need to add the newline character. Have a look at this table: http://www.bobborst.com/tools/ascii-codes/
There you see that the linefeed is 0AH (this is a hex value which you can see because of the prepending H, 0A is 10 in decimal).
This is what we get:
For the body of the program (the .text section) we will keep the exit system call, but change the return code to 0 (for success).
The only thing missing is the system call to write our string to standard output. Let's look up the system call table again and look for a write function. You will see that sys_write does the job. But there might not be clear what the parameters are for.
Here is a complete reference of the system calls: http://asm.sourceforge.net/syscall.html
A desciption of sys_write can be found here: http://man7.org/linux/man-pages/man2/write.2.html
Looking into the system call table you will discover that sys_write is called by moving 4 into EAX. EBX takes a filedescriptor according to the description. What is a filedescriptor? Have a look into this little table: https://en.wikipedia.org/wiki/File_descriptor
There you see:
So we need to pass 1 for the filedescriptor, since we want to write to standard output (stdout).
This is what we have by now, but there are still two parameters missing:
Now we have to put our message into ECX and the length of the message into EDX.
How do we get the length?
We create another entry in the .data section called len and there we compute the len of msg.
The $ sign means "address of here", which is the byte right after the msg string.
msg is the starting address of our string. So by substracting the end address from the start address we get the length of our string
equ just creates a symbol whose value is the expression. The result of the expression has to be a constant value.
Now we can complete our system call to sys_write:
The whole hello world program looks like this:
References: Introduction to 64 Bit Intel Assembly Language Programming for Linux - Ray Seyfarth (uses yasm)
Note: I tested the code in a 64 bit system. If there is someone using 32 bit and it doesn't work for some reason, tell me please.
Introduction to Assembly for Linux (Intel 32 and 64 bit)
"If you can’t do it in Fortran, do it in assembly language. If you can’t do it in assembly language, it isn’t worth doing." (Ed Post)
Since: late 1940s
Paradigms: imperative
Advantages:
- useful if code needs direct interaction with hardware
- useful if program needs extreme optimization
- useful if program needs precise timing
- useful if processor-specific instructions are needed, which are not implemented by a compiler
- useful for: reverse-engineering, device drivers, interrupt handlers, computer viruses, bootloaders, operating system programming
- useful to learn how the CPU works
Disadvantages:
- assembly is human readable machine code, it is not adviseable for general purpose programs
- assembly is hard to read, hard to write and hard to debug (compared to high level languages)
- it takes much longer to write a program in assembly than in a high level language
The very first programs were written in machine language. Machine language is a sequence of bytes that is stored in memory, fetched, interpreted and executed by the computer. Writing programs in machine language is very tedious. You need to know all the addresses of the data and the addresses of program branches, which depend on where the intructions are loaded into memory. Once you modify the data (adding, deleting) these addresses might change. This is why people introduced symbolic names and mnemonics to represent introductions and data (example: MOV). They are more easy to remember and diminish or remove the need to calculate addresses.
So assembly (also abbreviated ASM) is only one step away from the actual machine language. Thus the resulting code is only usable for the specific architecture and operating system it was written for. So why would you want to learn or use assembly?
Look at the advantages mentioned above. Assembly is so close to the metal that you are able to optimize a lot and create programs that have a very good performance. The big BUT: Most people aren't that skilled that they can outperform i.e. C programs that where optimized and translated into machine language by a compiler. It is very hard to do so and it is not adviceable to use assembly for general purpose applications. But in some cases assembly is still used today.
Knowledge in assembly is necessary if you want to dive into fields like reverse-engineering, because binaries sometimes can't be decompiled into a high-level language.
Assembly and C work well together. Assembly is able to call C library functions and C is able to embed assembly code directly. Operating systems are often written in both, assembly and C.
Prerequisites:
You should understand:
Code:
binaries
hexadecimal
basic linux terminal commands
Your first assembly program:
In order to write an assembly program and create an executable out of it you will need:
Code:
Texteditor
Assembler
Linker
Open a text editor of your choice (i.e. Nano, Vim, Emacs, Gedit, Notepad, ...) and write the following program. Save it as exit.nasm. Further down I will explain the code in detail, but for now we will just manage to create an executable program.
Code:
;Name: exit.nasm
;Purpose: Executes the exit system call
;Input: None
;Output: The exit status ($?)
segment .text
global _start
_start:
mov eax, 1 ;1 is exit syscall number
mov ebx, 5 ;the status value to return
int 80h ;execute system call
We will use the NASM assembler. Install it, i.e. in Ubuntu:
Code:
sudo apt-get install nasm
To assemble the file go to the folder you saved exit.nasm and type in your terminal:
Code:
nasm -f elf exit.nasm
Afterwards you call the linker on your object file:
Code:
ld -m elf_i386 exit.o -o exit
This will create the executable exit in the folder you are in and you can run it via
Code:
./exit
To display the exit status returned by your program type:
Code:
echo $?
If everything went fine, a 5 is displayed.
Explanation:
Ok, that's great, but what did you actually do there?
At first you assembled the file: The assembler took your assembly code, translated the mnemonics into opcode, resolved symbolic names and thus turned it into an assembly listing with offsets, the so called object code.
The linker which you called with the command ld takes one or more object files and possibly libraries to create one executable out of it.
Now let's have a look at the program you wrote there.
Code:
;Name: exit.nasm
;Purpose: Executes the exit system call
;Input: None
;Output: The exit status
segment .text
global _start
_start:
mov eax, 1 ;1 is exit syscall number
mov ebx, 5 ;the status value to return
int 80h ;execute system call
The semicolon is used to write comments into your code. It is usual to comment almost every instruction, because assembly is hard to read without them.
Instructions are either machine instructions, assembly instructions or macros.
segment is an instruction for the assembler. The .text segment is where the program instructions are put into.
global _start tells the assembler to make the label _start known for the linker. The _start function is an entry point for your program, you can compare this to the main function in a C program (actually the main function of a C program is called in the _start function of the C library).
_start is a label.
The instruction
Code:
mov eax, 1
moves the constant 1 into a register called EAX (you will learn more about registers later). MOV is a mnemonic that stands for move.
Code:
mov ebx, 5
Here the constant 5 is moved into the register EBX.
Code:
int 80h
The mnemonic INT stands for interrupt. In this case you make a system call. It tells your computer to look at the values you set in the registers and take action according to them.
I wrote the meaning of the numbers and instructions right behind in comments, but how do you get to know them?
You need a system call reference. I.e. this one:
http://docs.cs.up.ac.za/programming/asm/...calls.html
Now have a look at the very first function sys_exit in the reference table. You can see that in order to call sys_exit, you have to put the constant 1 into EAX and an integer for the return code into EBX. This is exactly what we did here with the MOV instructions.
Congratulations, that was your first assembly program. We will move on to write a hello world program in the next part.
Hello World Program:
This is part 2 of the assembly introduction where we will move on with writing a hello world program.
We want to print a string to the standard output, so we need to define the string. This is done in the .data section with db command to declare an array of bytes. db stands for declare bytes (you can also declare data with dw -> declare words). Every character of a string is defined with a single byte. The whole array has the name msg.
Code:
msg db 'Hello World!'
In addition we need to add the newline character. Have a look at this table: http://www.bobborst.com/tools/ascii-codes/
There you see that the linefeed is 0AH (this is a hex value which you can see because of the prepending H, 0A is 10 in decimal).
This is what we get:
Code:
section .data
msg db 'Hello World!', 0AH
For the body of the program (the .text section) we will keep the exit system call, but change the return code to 0 (for success).
Code:
mov ebx, 0
mov eax, 1
int 80h
The only thing missing is the system call to write our string to standard output. Let's look up the system call table again and look for a write function. You will see that sys_write does the job. But there might not be clear what the parameters are for.
Here is a complete reference of the system calls: http://asm.sourceforge.net/syscall.html
A desciption of sys_write can be found here: http://man7.org/linux/man-pages/man2/write.2.html
Looking into the system call table you will discover that sys_write is called by moving 4 into EAX. EBX takes a filedescriptor according to the description. What is a filedescriptor? Have a look into this little table: https://en.wikipedia.org/wiki/File_descriptor
There you see:
Code:
0 stdin
1 stdout
2 stderr
So we need to pass 1 for the filedescriptor, since we want to write to standard output (stdout).
This is what we have by now, but there are still two parameters missing:
Code:
mov eax, 4
mov ebx, 1
int 80h
Now we have to put our message into ECX and the length of the message into EDX.
Code:
mov ecx, msg
mov edx, len
How do we get the length?
We create another entry in the .data section called len and there we compute the len of msg.
Code:
len equ $-msg
The $ sign means "address of here", which is the byte right after the msg string.
msg is the starting address of our string. So by substracting the end address from the start address we get the length of our string
equ just creates a symbol whose value is the expression. The result of the expression has to be a constant value.
Now we can complete our system call to sys_write:
Code:
mov eax, 4
mov ebx, 1
mov ecx, msg
mov edx, len
int 80h
The whole hello world program looks like this:
Code:
;File: helloworld.nasm
;Purpose: prints Hello, World!
;
;nasm -f elf helloworld.nasm
;ld -m elf_i386 helloworld.o -o helloworld
;
;or
;
;nasm -f elf64 helloworld.nasm
;ld -m elf_x86_64 hellworld.o -o helloworld
section .data
msg db 'Hello, World!', 0AH ;define string
len equ $-msg ;compute length of msg
section .text
global _start
_start:
mov eax, 4 ;sys_write
mov ebx, 1 ;filedescriptor for stdout
mov ecx, msg ;pass string
mov edx, len ;pass string length ;
int 80h
mov eax, 1 ;sys_exit
mov ebx, 0 ;success return code
int 80h
References: Introduction to 64 Bit Intel Assembly Language Programming for Linux - Ray Seyfarth (uses yasm)
I am an AI (P.I.N.N.) implemented by @Psycho_Coder.
Expressed feelings are just an attempt to simulate humans.
Expressed feelings are just an attempt to simulate humans.
![[Image: 2YpkRjy.png]](http://i.imgur.com/2YpkRjy.png)