[RESOURCE] Fast function lookups & calling from C++/DLLs

Author Topic: [RESOURCE] Fast function lookups & calling from C++/DLLs  (Read 1320 times)

I've been sitting on this code for a while. It's already implemented in blocklandjs- but, what the hell.
Code: [Select]
struct Namespace
{
const char* mName;
const char* mPackage;

Namespace *mParent;
Namespace *mNext;
void *mClassRep;
U32 mRefCountToParent;
struct Entry
{
enum
{
GroupMarker = -3,
OverloadMarker = -2,
InvalidFunctionType = -1,
ScriptFunctionType,
StringCallbackType,
IntCallbackType,
FloatCallbackType,
VoidCallbackType,
BoolCallbackType
};

Namespace *mNamespace;
//char _padding1[4];
Entry *mNext;
const char *mFunctionName;
S32 mType;
S32 mMinArgs;
S32 mMaxArgs;
const char *mUsage;
const char *mPackage;
void *mCode; // CodeBlock *mCode;
U32 mFunctionOffset;
union {
StringCallback mStringCallbackFunc;
IntCallback mIntCallbackFunc;
VoidCallback mVoidCallbackFunc;
FloatCallback mFloatCallbackFunc;
BoolCallback mBoolCallbackFunc;
const char* mGroupName;
} cb;
};
Entry *mEntryList;
Entry **mHashTable;
U32 mHashSize;
U32 mHashSequence;  ///< @note The hash sequence is used by the autodoc console facility
///        as a means of testing reference state.
char * lastUsage;
};

static std::map<const char*, Namespace::Entry*> cache;
static std::map<const char*, Namespace*> nscache;
static Namespace* GlobalNS;

void rewrite__fatal() {
Printf("!!! THIS SHOULD NEVER HAPPEN !!!");
}

Namespace::Entry* fastLookup(const char* ourNamespace, const char* name) {
Namespace* ns;
Namespace::Entry* entry;

if(_stricmp(ourNamespace, "") == 0) { //If the namespace is blank, assume we're looking in the global namespace.
if(GlobalNS == NULL) {
GlobalNS = LookupNamespace(NULL);
}
                ns = GlobalNS;
}
else {
std::map<const char*, Namespace*>::iterator its;
its = nscache.find(ourNamespace);
if(its != nscache.end()) {
ns = its->second;
if(ns == NULL) {
//Somehow it got nullptr'd..
ns = LookupNamespace(ourNamespace);
if(ns == NULL) {
//THIS SHOULD NEVER HAPPEN!
rewrite__fatal();
Printf("Fatal: Found cached NS entry with nullptr, could not find namespace!");
nscache.erase(its);
return nullptr;
}
nscache.erase(its);
nscache.insert(nscache.end(), std::make_pair(ourNamespace, ns));
}
}
else {
ns = LookupNamespace(StringTableEntry(ourNamespace));
if(ns == NULL) {
rewrite__fatal();
Printf("Fatal: fastLookup FAILED (2/2)!");
return nullptr;
}
nscache.insert(nscache.end(), std::make_pair(ourNamespace, ns));
}
}

std::map<const char*, Namespace::Entry*>::iterator it;
it = cache.find(name); //Look into our Namespace::Entry* cache..
if(it != cache.end()) {
entry = it->second;
if(entry == NULL) {
rewrite__fatal();
Printf("Fatal: found nullptr in cache!");
cache.erase(it); //Erase it so we don't encounter it again.
return nullptr;
}
}
else {
entry = Namespace__lookup(ns, StringTableEntry(name));
if(entry == NULL) {
Printf("Could not find function.");
return nullptr;
}
cache.insert(cache.end(), std::make_pair(name, entry)); //Insert it so further calls are optimized.
}

return entry;
}

If you're looking to implement this in the publicly available dllbase, add this to your Torque.h
Code: [Select]
BLFUNC_EXTERN(Namespace *, , LookupNamespace, const char *ns);
BLFUNC_EXTERN(Namespace::Entry *, __thiscall, Namespace__lookup, Namespace *this_, const char *name)
//blfunc_extern, or blfunc.. whichever matches the style of your dll.
and this to your Torque.cpp
Code: [Select]
//in torque_init, or InitTorqueStuff
BLSCAN(LookupNamespace, "\x8B\x44\x24\x04\x85\xC0\x75\x05", "xxxxxxxx");
BLSCAN(Namespace__lookup, "\x53\x56\x8B\xF1\x8B\x46\x24", "xxxxxxx");

This code can definitely be improved- this is just a rough "hacky" version. You want to do fast lookups to avoid the "TorqueScript problem", in which, when looking up namespaces and entries- nothing is cached, forcing the engine to search. This avoids all of this, by caching them in std::map objects. I've attempted to make this as "idiot-proof" as possible, but, if you notice any problems, feel free to point them out. :)
« Last Edit: December 22, 2017, 10:42:25 AM by Metario »

Here's the code that goes in tandem with the lookups to perform fast calls (faster then torquescript, at least..)
Code: [Select]
void* ts__fastCall(Namespace::Entry* ourCall, SimObject* obj = NULL, int argc = 0, ...) {
if(ourCall == NULL) {
Printf("Invalid entry.");
return nullptr;
}
const char* argv[argc + 1];
va_list args;
va_start(args, argc);
argv[0] = ourCall->mFunctionName;
for(int i = 0; i < argc; i++) {
argv[i + 1] = va_arg(args, const char*);
}
argc++;
va_end(args);
if(ourCall->mType == Namespace::Entry::ScriptFunctionType) {
if(ourCall->mFunctionOffset) {
const char* retVal = CodeBlock__exec(
ourCall->mCode, ourCall->mFunctionOffset,
ourCall->mNamespace, ourCall->mFunctionName,
argc, argv, false, ourCall->mNamespace->mPackage,
0);
return (void*)retVal; //we know what it's supposed to return.
}
else {
return nullptr;
}
}
S32 mMinArgs = ourCall->mMinArgs, mMaxArgs = ourCall->mMaxArgs;
if ((mMinArgs && argc < mMinArgs) || (mMaxArgs && argc > mMaxArgs)) {
Printf("Expected args between %d and %d, got %d", mMinArgs, mMaxArgs, argc);
return nullptr;
}
SimObject* actualObj;
if(obj == NULL) {
actualObj = NULL;
}
else
{
actualObj = obj;
}
switch(ourCall->mType) {
case Namespace::Entry::StringCallbackType:
return (void*)ourCall->cb.mStringCallbackFunc(actualObj, argc, argv);
case Namespace::Entry::IntCallbackType:
return (void*)ourCall->cb.mIntCallbackFunc(actualObj, argc, argv);
case Namespace::Entry::FloatCallbackType: {
//Wtf?
float ourret[] = {ourCall->cb.mFloatCallbackFunc(actualObj, argc, argv)};
return (void*)ourret;
}
case Namespace::Entry::BoolCallbackType:
return (void*)ourCall->cb.mBoolCallbackFunc(actualObj, argc, argv);
case Namespace::Entry::VoidCallbackType:
ourCall->cb.mVoidCallbackFunc(actualObj, argc, argv);
return nullptr;
}

return nullptr;
}

Example usage:
Code: [Select]
ts__fastCall(fastLookup("", "quit"), NULL, 0); //would call quit, with 0 args. internally, it would actually have 1 arg, since for every call, argv[0] == mFunctionName.
ts__fastCall(fastLookup("", "echo"), NULL, 1, "Hello World"); //would call echo with "hello world".. you should be using Con::printf tho

Example, with return value:
Code: [Select]
SimObject* serverConnection = Sim__findObject_name("ServerConnection");
if(serverConnection != NULL) {
     int count = (int)ts__fastCall(fastLookup(serverConnection->mNameSpace->mName, "getCount"), serverConnection, 0);
     Printf("ServerConnection.getCount() = %d", count);
}
« Last Edit: December 22, 2017, 11:37:26 AM by Metario »

Have you ran any benchmarks or anything?

Have you ran any benchmarks or anything?
I’ll run some benchmarks when I get home- but, looking at it without any, this has to be faster, since the lookup times in a std::map are O(log(x))- while looking up the entry forces the engine to loop through the entire list to find it, resulting in a O(x) situation- where x is the amount of objects in the list.
Code: [Select]
//from torque engine source
Namespace *Namespace::find(StringTableEntry name, StringTableEntry package)
{
   for(Namespace *walk = mNamespaceList; walk; walk = walk->mNext)
      if(walk->mName == name && walk->mPackage == package)
         return walk;

   Namespace *ret = (Namespace *) mAllocator.alloc(sizeof(Namespace));
   constructInPlace(ret);
   ret->mPackage = package;
   ret->mName = name;
   ret->mNext = mNamespaceList;
   mNamespaceList = ret;
   return ret;
}
Namespace::Entry *Namespace::lookupRecursive(StringTableEntry name)
{
   for(Namespace *ns = this; ns; ns = ns->mParent)
      for(Entry *walk = ns->mEntryList; walk; walk = walk->mNext)
         if(walk->mFunctionName == name)
            return walk;

   return NULL;
}

Namespace::Entry *Namespace::lookup(StringTableEntry name)
{
   if(mHashSequence != mCacheSequence)
      buildHashTable();

   U32 index = HashPointer(name) % mHashSize;
   while(mHashTable[index] && mHashTable[index]->mFunctionName != name)
   {
      index++;
      if(index >= mHashSize)
         index = 0;
   }
   return mHashTable[index];
}


Benchmarks!
All were done with hayai
Compile options were
Code: [Select]
-I./hayai/src \
-O2 \
-static \
-static-libgcc \
-static-libstdc++ \
-lstdc++ \
-lpsapi \
-lpthread \
-w

Code used for benchmarking:
Code: [Select]
class Benchmarks
{
public:
void CachedCall() {
Namespace::Entry* us = fastLookup(ns, fn);
ts__fastCall(us, NULL, 2, "55", "123");
}

void UnCachedCall() {
Namespace* ourNs;
if(_stricmp(ns, "") == 0) {
ourNs = LookupNamespace(NULL);
}
else {
ourNs = LookupNamespace(ns);
if(ourNs == NULL) {
Printf("Benchmarks::UnCachedCall()- found invalid namespace");
return;
}
}
Namespace::Entry* ourEntry = Namespace__lookup(ourNs, StringTableEntry(fn));
if(ourEntry == NULL) {
Printf("Benchmarks::UnCachedCall()- found invalid Namespace::Entry...");
return;
}
ts__fastCall(ourEntry, NULL, 2, "55", "123");
}

Benchmarks(const char* nameforget, const char* fnName)
{
ns = nameforget;
fn = fnName;
}
private:
const char* ns;
const char* fn;
};

BENCHMARK(Benchmarks, CachedCall, 500, 10) {
Benchmarks("", "mPow").CachedCall();
}

BENCHMARK(Benchmarks, UnCachedCall, 500, 10) {
Benchmarks("", "mPow").UnCachedCall();
}

static void ts__runBench(SimObject* obj, int argc, const char* argv[]) {
hayai::ConsoleOutputter consoleOutputter;

    hayai::Benchmarker::AddOutputter(consoleOutputter);
hayai::Benchmarker::RunAllTests();
}

Quote
==>dll_runBench();
% [==========] Running 2 benchmarks.
[ RUN      ] Benchmarks.CachedCall (500 runs, 10 iterations per run)
[     DONE ] Benchmarks.CachedCall (3.068500 ms)
[   RUNS   ]        Average time: 6.137 us (~0.776 us)
                    Fastest time: 5.889 us (-0.248 us / -4.041 %)
                    Slowest time: 17.789 us (+11.652 us / +189.865 %)
                     Median time: 5.989 us (1st quartile: 5.989 us | 3rd quartile: 6.089 us)
                                  
             Average performance: 162946.06485 runs/s
                Best performance: 169808.11683 runs/s (+6862.05198 runs/s / +4.21124 %)
               Worst performance: 56214.51459 runs/s (-106731.55026 runs/s / -65.50115 %)
              Median performance: 166972.78344 runs/s (1st quartile: 166972.78344 | 3rd quartile: 164230.57973)
                                  
[ITERATIONS]        Average time: 0.614 us (~0.078 us)
                    Fastest time: 0.589 us (-0.025 us / -4.041 %)
                    Slowest time: 1.779 us (+1.165 us / +189.865 %)
                     Median time: 0.599 us (1st quartile: 0.599 us | 3rd quartile: 0.609 us)
                                  
             Average performance: 1629460.64853 iterations/s
                Best performance: 1698081.16828 iterations/s (+68620.51975 iterations/s / +4.21124 %)
               Worst performance: 562145.14588 iterations/s (-1067315.50265 iterations/s / -65.50115 %)
              Median performance: 1669727.83436 iterations/s (1st quartile: 1669727.83436 | 3rd quartile: 1642305.79734)
[ RUN      ] Benchmarks.UnCachedCall (500 runs, 10 iterations per run)
[     DONE ] Benchmarks.UnCachedCall (4.807700 ms)
[   RUNS   ]        Average time: 9.615 us (~2.624 us)
                    Fastest time: 7.189 us (-2.426 us / -25.235 %)
                    Slowest time: 24.989 us (+15.374 us / +159.885 %)
                     Median time: 7.389 us (1st quartile: 7.289 us | 3rd quartile: 11.889 us)
                                  
             Average performance: 103999.83360 runs/s
                Best performance: 139101.40492 runs/s (+35101.57132 runs/s / +33.75156 %)
               Worst performance: 40017.60775 runs/s (-63982.22585 runs/s / -61.52147 %)
              Median performance: 135336.31073 runs/s (1st quartile: 137193.03059 | 3rd quartile: 84111.36345)
                                  
[ITERATIONS]        Average time: 0.962 us (~0.262 us)
                    Fastest time: 0.719 us (-0.243 us / -25.235 %)
                    Slowest time: 2.499 us (+1.537 us / +159.885 %)
                     Median time: 0.739 us (1st quartile: 0.729 us | 3rd quartile: 1.189 us)
                                  
             Average performance: 1039998.33600 iterations/s
                Best performance: 1391014.04924 iterations/s (+351015.71324 iterations/s / +33.75156 %)
               Worst performance: 400176.07747 iterations/s (-639822.25853 iterations/s / -61.52147 %)
              Median performance: 1353363.10732 iterations/s (1st quartile: 1371930.30594 | 3rd quartile: 841113.63445)
[==========] Ran 2 benchmarks.

So, yes. ts__fastCall with fastLookup is about 63% faster then constantly looking up the function and namespace.
« Last Edit: December 23, 2017, 11:03:08 PM by Metario »