Weird FFI problem in delivered image
Hi.
I have some FFI code for interfacing to a third-party C library.
Functions in this library take a C struct as input and (sometimes)
operate on it. Between operations I store this struct as a
base64-encoded string in a database. The whole setup works like a
charm in my development image, but not when I deliver my application.
Suddenly the library functions return a status code indicating that
the input is corrupted.
My delivered apps have a REPL available, and going through the steps
manually, I can't find any difference between the C structs in my
development image and the delivered image. The base64 strings are
verified identical, as are each slot returned by
fli:foreign-slot-names on the constructed c object.
This has led me to believe that the problem is not actually with the
input data, but somewhere on the FFI call chain.
So I started looking at documentation, and discovered that
fli:register-module should be called at runtime. I had it as a
toplevel form in the file containing fli definitions. Calling
fli:register-module from my application startup routine seems to fix
the problem.
But I don't really want to have all the applications that use this
library have to know about it and call my new initialization
function.
So I started thinking about action lists. But adding the
initialization to "When starting image" doesn't help. The C-libary is
available, but calls return the same erroneous status code about
corrupted input. I found an undocumented action list called "Reset
FFI on startup" that looked promising, but the same thing happened
when I put the initialization on that one.
To try to make sure there was no stale module registered before
delivery, I also called fli:disconnect-module with :remove t from a
delivery action. I verified that if I don't call my initialization
routine at all in the delivered image, I get an error about an
unresolved symbol.
My current guess is that not only must the fli:register-module call
happen at runtime, it must happen after multiprocessing has been
initialized, which I guess is after the "When starting image" action
list has run.
I have two questions, really:
* Has anyone else seen this kind of behaviour and have a clue about
what may be the cause of the weird return values?
* Is there any way of specifying stuff to run after multiprocessing
has been initialized, but before (or in parallell with) the
application startup function takes over?
My platform is LWL 5.1.1. Delivered apps work fine in LWL 5.0 from
the same codebase (with the fli:register-module call on toplevel).
The C library I'm interfacing is very much a black box (RSA Securid
Authentication Engine, if anyone cares), there's not much hope of
debugging from that side.
The foreign function I'm calling (there's lots of them, but all return
the same error code) is defined as something like
(fli:define-foreign-function (%info
"Info_NotTheRealFuncName"
:source)
((bdata
(:pointer
(:struct binary-data)))
(token-info
(:pointer
(:struct output-info-struct))))
:result-type
:int
:language
:ansi-c
:module :pdata)
The error return value is 45 in the faulty case, 0 normally. This
return value means that the first argument is internally inconsistent,
but as I wrote, identical data in my development image is accepted.
The whole thing is under NDA, so I can't really show much more of the
definitions.
--
Mvh/Regards
Peder O. Klingenberg
Netfonds Bank ASA
Re: Weird FFI problem in delivered image
Nick Levine <ndl@ravenbrook.com> writes:
> So I started thinking about action lists. But adding the
> initialization to "When starting image" doesn't help. The C-libary
> is available, but calls return the same erroneous status code about
> corrupted input. I found an undocumented action list called "Reset
> FFI on startup" that looked promising, but the same thing happened
> when I put the initialization on that one.
>
> Was your action after all the predefined ones?
According to lw:print-actions, yes. I didn't explicitly set any
:after condition, but it ended up at the bottom of both lists, and the
doc says the output of print-actions is ordered.
> Seems unlikely.
No argument there, but I can't see what else is left. Testing that
hypothesis was the reason for my second question.
> Why can't your initialization function explicitly register the modules
> you need?
Until now, my lisp-level library didn't have an initialization
function. Application code using it just included my lisp file in the
build. I'd prefer to keep the thing nicely encapsulated and taking
care of its own needs, without involving the application startup code.
But calling an initialization function which registers the modules
from application startup is one way to avoid the problem.
Another workaround, which I'm leaning towards at the moment, is that I
can rely on the fact that all the ffi calls are wrapped in a single
function, which handles marshalling of foreign objects and C's stupid
way of handling multiple return values and stuff. It's easy to have
this function initialize the the module on the first call.
Considering that each call to my lisp-functions causes at least one,
usually two, round-trips to a database, an extra test on each call to
determine if initalization is needed is not terribly costly.
Nevertheless, this solution leaves me with an itch in my aesteticles.
There should be a better way.
My curiosity has been piqued, and using either of the available,
tested and working workarounds seems like a cop-out at this point. :)
--
Mvh/Regards
Peder O. Klingenberg
Netfonds Bank ASA