Patching a function name in an object file
Non ASCII characters in C identifiers
According to cppreference.com unicode characters can appear in C
identifiers, using the \u
and \U
escape notation. In practice, it
seems they can also be used directly with UTF-8 encoding: For example,
the following UTF-8 file is accepted both by gcc 13 and clang 14:
#include <stdio.h>
void taille_fenêtre(int dx, int dy)
{
fprintf(stderr, "::taille_fenêtre %i %i\n",dx, dy);
}
This file is compiled to an object file that will contain the (UTF-8
encoded) global symbol taille_fenêtre
for the above function.
$ gcc -c accents.c $ nm accents.o U fprintf U stderr 0000000000000000 T taille_fenêtre
It does not seem to be possible to use characters from the ISO 8859-1 (aka "Latin-1") encoding in a C identifier. If I convert the above file to Latin-1 encoding and try to compile it
$ recode utf8..latin1 accents.c $ gcc -c accents.c accents.c:3:16: error: stray '\352' in program 3 | void taille_fen<ea>tre(int dx, int dy) | ^~~~ accents.c:3:17: error: expected '=', ',', ';', 'asm' or '__attribute__' before tre' 3 | void taille_fen�tre(int dx, int dy)
gcc complains about a \352
byte in the program, which is the octal
value representation of the Latin-1 encoded ê
character.1 The second
line also shows its hexadecimal value ea
. On the last line, the
character is displayed as �
since the terminal is configured for
UTF-8 and can't display this. Of course this last character would be
displayed correctly in a Latin-1 enabled terminal
3 | void taille_fenêtre(int dx, int dy)
but the issue is that the compiler will refuse a function name
containing a byte with the hex value EA
.
Latin-1 characters in binary symbols
In the following, all commands will be run in a Latin-1 enabled
terminal, so when you see a ê
below, this is really a Latin-1
encoded ê
, ie a byte with hexadecimal value EA.
The restriction on C identifiers apparently doesn't apply to object files: it seems a name can be any sequence of bytes.
I have a long started implemention of a toy programming language in
which, for some reason, the identifiers use the Latin-1 encoding.
This was originally only an interpreter but I later added a module to
generate code in LLVM intermediate representation to get a compiler.
LLVM IR has no problem with character encoding, here's an LLVM .ll
file where we see a function main
calling a function
taille_fen\EAtre
which is previously declared. As we'll see, \EA
is just a way to represent a byte of hex value EA, so the function
name is actually taille_fenêtre
, in Latin-1.
; ModuleID = 'algo-llvm'
source_filename = "algo-llvm"
@c = global i8 0
; Function Attrs: argmemonly nofree nounwind willreturn
declare void @llvm.memcpy.p0i8.p0i8.i64(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i64, i1 immarg) #0
declare void @read_char(i8*)
declare void @"taille_fen\EAtre"(i64, i64)
define void @main() {
entry:
call void @"taille_fen\EAtre"(i64 200, i64 600)
call void @read_char(i8* @c)
ret void
}
attributes #0 = { argmemonly nofree nounwind willreturn }
This file can be compiled to assembly and finally to object code
$ llc-14 accents.ll $ as accents.s -o accents.o $ nm accents.o U _GLOBAL_OFFSET_TABLE_ 0000000000000000 B c 0000000000000000 T main U read_char U taille_fenêtre
that contains a main
function and needs definitions for symbols
read_char
and taille_fenêtre
(where, again, the ê
here is byte
xEA).
To execute this I need an object file providing definitions for
read_char
and taille_fenêtre
, and I'd like to create it from C
source.
Please note: I don't pretend in any way that what I'm going to do is a good idea. It is most probably not, in general. I could have mangled (or transcoded) the names at some point to avoid the issue, but that seemed boring, and I like to experiment! After all, LLVM IR supports this with out of the box.
Patching the binary
Since the compiler will refuse to build an object file with the symbol
I want, I can compile with a compiler-accepted symbol and then patch
the object file. Assume in my source file the function is simply
named taille_fenetre
with only ASCII characters, the object file
will mention it contents a definition for the symbol
taille_fenetre
. It is certainly possible to edit the object file to
change the symbol. Changing the size of the symbol would certainly
corrupt the file, but simply substituting one byte for another in a
symbol shouldn't break anything.
Actually, we can simply use sed
for such a substitution, even it is
meant to edit text files. Here's a command to do that:
sed -i -e "s/taille_fenetre/taille_fen"$(/bin/echo -e "\xEA")"tre/g" libgraph.o
The full makefile recipe can be:
libgraph.o: libgraph.c gcc -Wall -fPIC -c $< sed -i -e "s/taille_fenetre/taille_fen"$$(/bin/echo -e "\xEA")"tre/g" \ libgraph.o
Granted, this assumes the only occurences of a sequence of bytes that
can represent the string taille_fenetre
are indeed meant to
represent that symbol, and not a string constant or, worse, some
instructions whose binary form happen to produce that sequence of
bytes. Did I say doing this may not be a terrific idea?
Last point: I've spent an unexpected amount of time to get this sed
command to work correctly in the Makefile, due to subtleties in the
way make
runs the recipes. This is described in this post.
Footnotes
clang is more explicit about what the problem is, although it could probably refrain from providing the 2nd and 3rd errors:
$ clang -c accents.c accents.c:3:16: error: source file is not valid UTF-8 void taille_fen<EA>tre(int dx, int dy) ^ accents.c:3:6: error: variable has incomplete type 'void' void taille_fen<EA>tre(int dx, int dy) ^ accents.c:3:16: error: expected ';' after top level declarator void taille_fen<EA>tre(int dx, int dy) ^ ; 3 errors generated.