Code fusion information-hiding algorithm based on PE file function migration

PE (portable executable) file has the characteristics of diversity, uncertainty of file size, complexity of file structure, and singleness of file format, which make it easy to be a carrier of information hiding, especially for that of large hiding capacity. This paper proposes an information-hiding algorithm based on PE file function migration, which utilizes disassembly engine to disassemble code section of PE file, processes function recognition, and shifts the whole codes of system or user-defined functions to the last section of PE file. Then it hides information in the original code space. The hidden information is combined with the main functions of the PE file, and the hidden information is coupled with the key codes of the program, which further enhances the concealment performance and anti-attack capability of the system.


Introduction
PE file is a standard format for executable file in Windows environment, which is one of the most important software formats in the Internet. The code section is the most important section in the PE file, which is used to store the executable instruction codes, including user-defined function code and static link library function code, which is the main part of the PE file. Combining hidden information with program instruction code can effectively improve the concealment of information hiding algorithm based on executable file.
At present, the PE-based information-hiding algorithms are divided into the following three categories: One is the information hiding method based on the PE file redundant space [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20]. The second is the information hiding method based on PE file data resources [21][22][23], the third is the information-hiding method based on PE file import table [24][25][26][27][28]. The existing PE file hiding algorithms mainly exist the following shortcomings: First, the redundant space of PE files is open to people familiar with the PE file format, and there are powerful PE file analysis tools on the market, such as Stud_ PE and PE Explorer Lord PE. Obviously, because of the use of the redundant space inherent in PE files for information hiding, security is not good. The second is that the hidden space is too concentrated, the hidden information is easily exposed, and the concealment is poor. The third is the structure of the PE file is transparent; the use of PE files structure characteristics to hide information, once the transformation of its structural characteristics, hidden information will be destroyed. Fourth, the hidden information is not combined with the program function, there is no close association with the program itself, the hidden information and program instruction code is low coupling, the ability to resist deletion, modification, filing, and other attacks is poor [29][30][31].
This paper proposes a highly concealed information-hiding algorithm based on the migration of PE file code section functions, which enhances concealment and improve the system's anti-attack capability, on the basis of fully analyzing the characteristics of PE file code section. First, through the disassembling engine disassemble PE file code section, the function recognition algorithm is used to identify the standard functions in the program, and then the function modules in the program are migrated to the redundant space or the last section of the code section. In this way, the hidden information and program instruction code closely combined, greatly improve the system's concealment and anti-attack ability.
The rest of the paper is organized as follows: In section 2 analyses the structure of the PE file code section. Section 3 describes the proposed method. Section 4 is experimental results and discussion. Finally, section 5 summarizes the paper.

Key data structures of code section
The section name of the code section in the PE file is generally .code (or .txt, which is related to the compiler), and its property value is 0x60000020, indicates that the section is executable and readable, and contains instruction codes, which is generally located next to the section table. It is the first section of the PE file, in front of other sections. The data structure related to the code section is VirtualSize field, VirtualAddress field, PointerToRawData field, and SizeOfRawData field.
When the PE file is loaded, the loader of the Windows operating system continuously reads SizeOfRawData bytes of data, the entire section of code data, from the PE file offset to the position of PointerToRawData, and maps it to memory.

Data organization of code section
The field value PointerToRawData in the code section table indicates the offset address of the code section in the disk file, and SizeOfRawData indicates the amount of disk space taken up by the code section after the file is aligned, as the disk file space is generally aligned by 512 bytes. The space that the code section actually occupies (the value of SizeOfRawData) is larger than the value of VirtualSize (unaligned), thus creating the redundant space of the code section, and the difference between the value of SizeO-fRawData and VirtualSize is the size of the redundant space of the code section. In the code section, the redundant space is filled with 0. Setting S r as the size of the redundant space for the section:

Determination of the entry address of program
In general, when the PE file is loaded by the Windows operating system loader, the code section is loaded onto the 0x00401000 address (the value of ImageBase plus the value of VirtualAddress), AddressEntryOfPoint of the HEADER32 structure indicates the address of the program executable code entry point, that is, the RVA of the first instruction to be executed in the PE file, and its value plus the base address in the PE file memory, is the starting virtual address of the entry function of the program when it runs. For example, the AddressEntryEntryPoint value of a PE file is 0x0001120D, the entry address of the program is 0x0041120D. Some of the programs that insert code into PE files are to modify the address here to point to their own code, and then jump back after being executed. The IAT is located before the module entry point in the .text segment (IAT table is actually a collection of jump instructions). When the Windows loader loads the executable program into the address space of the process, the actual memory address for each import function is determined, and the IAT table is also determined.

Proposed method
In this section, we depict the proposed information-hiding scheme using PE file function migration, and then it hides information in the original code space.
There is usually at least a code section in a PE file, which holds executable code. The function of a program is achieved by executing instructions in a code section. Therefore, hiding information in the PE file code section by combining hidden information with instruction code, which can effectively improve concealment and anti-offensive. But directly hiding the information in the PE file code section, some of the hidden information will be converted into some extremely abnormal instructions when being disassembled, which is easy to arouse the suspicion of attackers. In order to improve the resistance to disassembly and other reverse analysis tools, we will convert hidden information to instructions, disguised as a function (functionalization), embedded into the code section. The hidden information and PE file executable code are integrated, which improves the concealment. At the same time, in order to solve the problem of over-concentration of hidden information, a method of migrating one or more functionally independent modules (functions) in the executable code of PE files to redundant spaces in the code section is proposed, and the information is hidden between the normal function instruction code, so that the hidden information and PE file executable code are closely integrated. It further enhances the concealment and security of the system.

Disassembly algorithm
The principle of disassembly software is to first identify the format of the executable file, distinguish the code and data, determine the file offset address at the entry point of the code section, then utilize the knowledge of lexical analysis and grammar analysis to analyze, decode according to the instruction format of the X86 architecture, and finally output the corresponding assembly instructions. Disassembly technology can be divided into static disassembly and dynamic disassembly. Static disassembly refers to the conversion of the target program into the corresponding assembly language program without executing the target program. Dynamic disassembly refers to tracking the execution of the target program, in the process of execution disassembling the target program. One advantage of static disassembly is that the entire target program can be processed at once, while dynamic disassembling can only handle the parts to which the target program is executed. Currently, the commonly used disassembly tool software are IDA Pro, Ollydbg, Win32Dasm, SoftICE, Windbg, etc.

Design and implementation of disassembler
The role of the disassembling engine is to translate machine codes into assembly instructions. Developing an excellent disassembly engine requires an in-depth understanding of machine instruction coding for Intel's X86 architecture, with a long development cycle. Common open-source disassembling engines are udis86, Proview, ade, xde, etc. [28]. OllyDbg's own disassembling engine is also relatively powerful, but its instruction set is incomplete and does not support MMX and SSE well.
We use Udis86 to build a disassembler, the main steps of which are as follows: Step 1: Deploy the code and header files of the Udis86 disassembling engine to the system or directly into the project, refer to the "udis86.h" header file.
Step 2: Define an Udis86 object (ud_t ud_obj), set disassembly mode to 32 bits, set the instruction format for intel instruction format, set the start address of the first instruction, set the input source, which can be memory, or use ud_set_input_ file is set directly to file input and other initialization work.
Step 3: Looping, disassembling all the instructions in the input source.
Step 4: To carry out instruction analysis.
Step 5: Record the results of instruction analysis.
The result of disassembly is the same as that of OllyDbg using the built disassembler to disassemble the writing board program of the system (write.exe). The high-quality disassembler is the basis for further function recognition.

Function identification and location
In the process of application programming, modular programming is usually adopted. According to the top-down method, the program is broken down into many functional independent modules, each independent module is implemented by a function. In order to implement some complex functions, a large number of library functions are provided by the system, including static link library functions and dynamic link library functions. Static link library functions include system library functions and dedicated library functions. During compiling and linking, just like the user-defined function, the code will be linked to the target code of the executable program. For the dynamic link library function called, the target code is not in the executable file but in a DLL file. According to statistics, library function code accounts for an average of 50-90% of the target code in programs written by advanced languages [32].
In order to further improve concealment and integrate hidden information with the program's key function code, we propose an information-hiding algorithm for migrating function code, and the recognition and location of functions is the basis for this algorithm.
After disassembling the target program code section, according to the compilation principle and the specification of the function call, the starting address of the function module is generally the value of the address expression after the CALL instruction, that is, if there is an instruction CALL ADDR in the assembly code, there must be a function module with ADDR as the starting address. The function module ends with the RET instruction. Since there may be multiple exits in the function module (multiple RET instructions), according to the characteristics of the function, the end address of the function can be determined by the following algorithm: Function: Determine the end address of function module. Input: Starting address of the function module (F_begin). Output: End address of the function module (F_end).
Through the address expression after CALL instruction and the above algorithm, the start address and the end address of a function module can be determined, and the length of the function instruction code can be calculated.
Using this algorithm to test the function of notepad.exe and thunder program thunder.exe, the experiment shows that 59 functions can be effectively identified from "notepad.exe" program, and 4086 functions can be effectively identified from "thunder.exe" program. It can meet the needs of migration function well (Table 1).
Because the purpose of function recognition in our system is to migrate function to implement information hiding, we have simplified the function recognition algorithm, for some special functions and functions with short code, will be ignored in the algorithm, which does not affect the effectiveness of the algorithm and information hiding. If the amount of information to hide is large, you can hide the information in the extended function area by extending the length of the migrated function, or, after the information to be hidden is functionalized, stored in the last section of the PE file, scattered between the two migrated functions.

Function migration
In order to closely combine the hidden information with the instruction code of the executable, we propose an information-hiding algorithm for function migration that hides the information in the storage area of the original function module by migrating the function module in the target program to the last section. Function recognition is the basis for function migration, locating function by function recognition, and determining the file addresses of function (including function start address and end address) in disk file, relative virtual addresses and length of function, then correcting the relevant instructions in the function module and overwriting the relevant property values of the PE file. The migration of functions can be implemented. Because the target program code section at the time of the link holds the static library function code that is called first, followed by the code for the user-defined function. In order to improve concealment, user-defined functions are preferred when selecting the migrated functions. Let OFFSET old and OFFSET new represent original offset and new offset of the call instruction, respectively. RVA old indicates the relative virtual address before the CALL instruction being migrated. RVA new indicates the relative virtual address after the being migrated. SECTION old and SECTION new represent the actual size of the section before and after the migrated function (VirtualSize), and len(P) represents the length of the P function instruction code, respectively.
The main steps of the function migration algorithm are as follows: Step 1: The function is located by the function recognition algorithm, and the selected function module to be migrated is read into memory.
Step 2: Locate at the end of the last section (the PointerToRawData value of the last section plus the value of the actual size of the last section, VirtualSize), write the starting address of the function to be migrated (located through that address when extracting information), and then write the instruction codes of the function to be migrated.
Step 3: Fix the address value in the CALL instruction inside the function after being migrated.
Step 4: Fix the size of the section and align the SizeOfRawData value of the section by FileAlignment.
SECTION newsize ¼ SECTION oldsize þ len P ð Þ þ 4 ð3:2Þ Step 5: Fix the PE file mirror size, mirror size is aligned according to the value of SectionAlignment.
Step 6: Change the section property to executable.
Step 7: Set the relocation table size to 0.
Step 8: Write a jump instruction at the beginning of the original function, which jumps to the start address of the migrated function.
It ensures that the function migration does not affect the function of the program through function migration and modifying the migrated function, so that the area occupied by the original function module can be used for information hiding, and the hidden information is tightly coupled with the key code of the executable program, which can effectively improve the concealment and security of the system (Fig. 2).

Information-hiding algorithm
After function recognition and function migration are completed, the information hiding is relatively simple, and its main steps are described below: Input: Original carrier PE file P, information to be hidden M, public key pk. Output: Hidden PE P′ file.
Step 1: Using the public key pk and asymmetric encryption algorithm RSA, the hidden information M is encrypted and the encrypted information M' = Encrypt (pk, M) is obtained.
Step 2: The code section of the original carrier PE file P is disassembled by using the disassembling engine.
Step 3: Use the function recognition algorithm to recognize function of the assembly code produced by step 2, record the start address and end address, length of each identified function, count the sum of the number of functions and function lengths of all the identified functions, and sort the number by the size of the function by the starting address.
Step 4: According to the length of the information to be hidden, move a function from small to large of function number to the end of the last section, and write the first 4 bytes of the start address of the function module to the original function address after migration.
Step 5: Write a jump instruction at the beginning of the original function, jump to the beginning of the migrated function, and then write the length of the hidden information and the hidden information.
Step 6: Determine whether the information is all hidden, if it is to turn to step 7, otherwise turn to step 4 repeat the same operation.
Step 7: Modify the size of the PE file section, the size of the mirror, and change the properties of the section to be executable.

Information extraction algorithm
Input: A PE file P' with hidden information, private key sk.
Output: Hidden information M. Step 1: Move 4 bytes forward from the end of the last section and record the current pointer position SectionAddr.
Step 2: With SectionAddr as the starting address, read the contents of 4 cells as the address value Adrr, and determine whether the address is a JMP instruction for the unit where Adrr is located, or then the value of SectionAddr minus 1, continue to scan forward.
Step 3: If the jump instruction jumps exactly to the location where SectionAddr 4 is located, then the location of SectionAddr 4 is the post-migration function, and the starting address of the original function is in the two-word unit where Sectio-nAddr is located, and then the transfer step 4, Otherwise, the value of Section Addr is reduced by 1, read 4 bytes in a row, and continue to scan forward.
Step 4: Read the starting address of the original function from the two-word unit where SectionAddr is located, skip the JMP instruction, read the length of the hidden information Len, and begin to extract hidden pieces of information that are len bytes in length.
Step 5: Determine whether the value of SectionAddr points to the beginning of the last section, and if so, the reverse scan ends, otherwise the value of SectionAddr is reduced by 4 and then transferred to step 2.
Step 6: The extracted pieces of M' secret information are reversed into secret information.
Step 7: Using the private key sk and asymmetric encryption algorithm RSA, the watermark information M' is decrypted. The decrypted information M = DeEncrypt(sk, M') is obtained (Where M is plaintext).

Results of the experiment
The PE files used in the experiment consist of three different types of files: some from the windows operating system's own applications, located in the windows sys-tem32 folder, such as notepad.exe, write.exe, winmine.exe, etc. Part of it is a common desktop application for users, such as qq.exe, thunder.exe, winRAR, 360sd.exe, and part of the application written for yourself. From which 200 PE files are randomly selected as test programs for experimentation. In the experiment, the watermark information is embedded in all functions identifiable in each tester. The results of the experiment are as follows: As can be seen from Table 2, in general, the larger the file, the stronger the function, the more functions are recognized, the greater the hidden capacity.

Covert analysis
The function migration method suggested in this paper moves the recognized function module to the last section to hide information in the original function code area. Using the services provided by the www.virscan.org website, the hidden PE file upload server will be hidden for virus scanning, the results show that the file is normal (the website provides up to 37 types of antivirus engines). Ability to resist the detection of common anti-virus software and the analysis of static reverse analysis tools.

Embedded capacity
The embedded capacity of a normal function migration method is related to the size of the PE file and the number of static library functions called in the file. In general, the larger the PE file, the more complex the function, the more static library functions are called, the more functions that can be identified and can be migrated, and the greater the embedded capacity.

Anti-filling attack experiment
Hiding information in the redundant space of the PE file, there are insufficient gaps in the hidden information that is too centralized, hidden location is easy to expose, hidden capacity is small, and the hidden information will be destroyed by filling the known redundant space with full 0 or full 1. Extending the last section of the PE file or adding a section to hide information, while solving the problem of hidden capacity, but as with the use of redundant space for information hiding, there is an over-concentration of information, hidden location disclosure problems, and because there is no integration with the program's main functional code, Using a full 0 or full 1 to fill forward from the end of the last section will break the hidden information, but the program will still function properly.
The function-based method is to hide the information in the original function code area by migrating the function code of the recognized system function or user-defined function to the last section of the PE file. Because the information is hidden in the code area of the original function module of the PE file, when using full 0 or full 1 to fill the attack forward from the end of the last section, the hidden information is not broken, while the fill attack will destroy the original function code that is migrated, resulting in the program not being able to run. Take notepad.exe, a notepad.exe that comes with the Windows operating system, for example, after using the function migration method to hide the information, the program can function properly and extract the hidden information.
The traditional method is to hide the secret information in the redundant space, data resource segment, and import table of PE file. There are some shortcomings, such as the known redundant space, the too concentrated hidden space, the easy destruction of hidden information, and the loose association between the hidden information and the key code of the program. Compared with the previous methods, the method proposed in this paper overcomes their shortcomings. Our method is to fuse the secret information with the instruction code of the program through function migration and store it in the code segment. The hidden information is scattered, and the adversary is difficult to determine the location of the secret information and instruction code, and the hidden information is coupled with the key code of the program. Once the secret information in the program is destroyed, the program will not be executed correctly. So, it is more secure and capable of resisting attacks than the previous methods.

Conclusion and future work
In this paper, a large-capacity information hiding algorithm based on function migration is presented. The PE file code section is disassembled through the disassembling engine processes functions recognition, and shifts the codes of recognized function. The design implements an algorithm that hides information by migrating the functional code of an identified static library function or user-defined function to the last section of the PE file. In this way, the hidden information is combined with the main functional code of the PE file, and the hidden information is coupled with the key code of the PE file, which further enhances the concealment and anti-attack of the system. The theoretical analysis and experimental results show that, compared with similar algorithms, the proposed algorithm integrates the information to be hidden with the program instruction code through function migration, and the algorithm hides the capacity and concealment, strong ability to resist attacks.