Patching a function name in an object file

Non ASCII characters in C identifiers

According to cppreference.com unicode characters can appear in C identifiers, using the \u and \U escape notation. In practice, it seems they can also be used directly with UTF-8 encoding: For example, the following UTF-8 file is accepted both by gcc 13 and clang 14:

#include <stdio.h>

void taille_fenêtre(int dx, int dy)
{
  fprintf(stderr, "::taille_fenêtre %i %i\n",dx, dy);
}

This file is compiled to an object file that will contain the (UTF-8 encoded) global symbol taille_fenêtre for the above function.

$ gcc -c accents.c
$ nm accents.o
                 U fprintf
                 U stderr
0000000000000000 T taille_fenêtre

It does not seem to be possible to use characters from the ISO 8859-1 (aka "Latin-1") encoding in a C identifier. If I convert the above file to Latin-1 encoding and try to compile it

$ recode utf8..latin1 accents.c
$ gcc -c accents.c
accents.c:3:16: error: stray '\352' in program
    3 | void taille_fen<ea>tre(int dx, int dy)
      |                ^~~~
accents.c:3:17: error: expected '=', ',', ';', 'asm' or '__attribute__' before
tre'
    3 | void taille_fen�tre(int dx, int dy)

gcc complains about a \352 byte in the program, which is the octal value representation of the Latin-1 encoded ê character.1 The second line also shows its hexadecimal value ea. On the last line, the character is displayed as since the terminal is configured for UTF-8 and can't display this. Of course this last character would be displayed correctly in a Latin-1 enabled terminal

    3 | void taille_fenêtre(int dx, int dy)

but the issue is that the compiler will refuse a function name containing a byte with the hex value EA.

Latin-1 characters in binary symbols

In the following, all commands will be run in a Latin-1 enabled terminal, so when you see a ê below, this is really a Latin-1 encoded ê, ie a byte with hexadecimal value EA.

The restriction on C identifiers apparently doesn't apply to object files: it seems a name can be any sequence of bytes.

I have a long started implemention of a toy programming language in which, for some reason, the identifiers use the Latin-1 encoding. This was originally only an interpreter but I later added a module to generate code in LLVM intermediate representation to get a compiler. LLVM IR has no problem with character encoding, here's an LLVM .ll file where we see a function main calling a function taille_fen\EAtre which is previously declared. As we'll see, \EA is just a way to represent a byte of hex value EA, so the function name is actually taille_fenêtre, in Latin-1.

; ModuleID = 'algo-llvm'
source_filename = "algo-llvm"

@c = global i8 0

; Function Attrs: argmemonly nofree nounwind willreturn
declare void @llvm.memcpy.p0i8.p0i8.i64(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i64, i1 immarg) #0

declare void @read_char(i8*)

declare void @"taille_fen\EAtre"(i64, i64)

define void @main() {
entry:
  call void @"taille_fen\EAtre"(i64 200, i64 600)
  call void @read_char(i8* @c)
  ret void
}

attributes #0 = { argmemonly nofree nounwind willreturn }

This file can be compiled to assembly and finally to object code

$ llc-14 accents.ll 
$ as accents.s -o accents.o
$ nm accents.o 
                 U _GLOBAL_OFFSET_TABLE_
0000000000000000 B c
0000000000000000 T main
                 U read_char
                 U taille_fenêtre

that contains a main function and needs definitions for symbols read_char and taille_fenêtre (where, again, the ê here is byte xEA).

To execute this I need an object file providing definitions for read_char and taille_fenêtre, and I'd like to create it from C source.

Please note: I don't pretend in any way that what I'm going to do is a good idea. It is most probably not, in general. I could have mangled (or transcoded) the names at some point to avoid the issue, but that seemed boring, and I like to experiment! After all, LLVM IR supports this with out of the box.

Patching the binary

Since the compiler will refuse to build an object file with the symbol I want, I can compile with a compiler-accepted symbol and then patch the object file. Assume in my source file the function is simply named taille_fenetre with only ASCII characters, the object file will mention it contents a definition for the symbol taille_fenetre. It is certainly possible to edit the object file to change the symbol. Changing the size of the symbol would certainly corrupt the file, but simply substituting one byte for another in a symbol shouldn't break anything.

Actually, we can simply use sed for such a substitution, even it is meant to edit text files. Here's a command to do that:

sed -i -e "s/taille_fenetre/taille_fen"$(/bin/echo -e "\xEA")"tre/g" libgraph.o

The full makefile recipe can be:

libgraph.o: libgraph.c
	gcc -Wall -fPIC -c $<
	sed -i -e "s/taille_fenetre/taille_fen"$$(/bin/echo -e "\xEA")"tre/g" \
	    libgraph.o

Granted, this assumes the only occurences of a sequence of bytes that can represent the string taille_fenetre are indeed meant to represent that symbol, and not a string constant or, worse, some instructions whose binary form happen to produce that sequence of bytes. Did I say doing this may not be a terrific idea?

Last point: I've spent an unexpected amount of time to get this sed command to work correctly in the Makefile, due to subtleties in the way make runs the recipes. This is described in this post.

Footnotes


1

clang is more explicit about what the problem is, although it could probably refrain from providing the 2nd and 3rd errors:

$ clang -c accents.c
accents.c:3:16: error: source file is not valid UTF-8
void taille_fen<EA>tre(int dx, int dy)
               ^
accents.c:3:6: error: variable has incomplete type 'void'
void taille_fen<EA>tre(int dx, int dy)
     ^
accents.c:3:16: error: expected ';' after top level declarator
void taille_fen<EA>tre(int dx, int dy)
               ^
               ;
3 errors generated.