Performance

Performance of the FreeToken SDK is largely dependent on the LLM, network latency, the number of tokens in the prompt, and device capabilities. However, we've tuned performance to optimize for speed as much as possible.

Tip: Knowing what models are recommended for devices

We don't have a hard and fast rule on what devices can run which models, but we do have recommendations that you can view in the console.

Pre-warming the model

Models are automatically loaded anytime that you begin usage of an AI model. However, loading can take a number of seconds especially on first load of a model after download.

To help with this, we suggest that you pre-load models when you expect that a user will imminently use the model. The best way to do this is to run the prewarmAIFor method with the appropriate parameters.


// Use a unique run ID for this prewarm
await FreeToken.shared.prewarmAIFor(runIdentifier: "your_run_id") 

/// ... Later when you go to use the model ...

await FreeToken.shared.runMessageThread(
  id: messageThreadID,
  runIdentifier: "your_run_id", // Keep the ID the same as when you prewarmed
  success: { message in
    // Handle success
  },
  error: { error in
    // Handle error
  }
)

What happens when you pre-warm?

We load the model into memory, initialize the model context, and (if not already done) pre-cache the system prompt onto disk for faster future loads. This is especially helpful for larger system prompts that contain many tool definitions or examples.

runMessageThread vs localChat

We have optimized runMessageThread for performance and recommend using it over localChat whenever possible. LocalChat is a great way to run simple conversations in which you do not expect to persist the response or need many advanced features.

GPU vs CPU

We always opt to have the model and all it's operations run on the GPU, with one exception. This allows for the fastest possible performance for your applcation. The downside to this is that GPU memory is slightly more limited than CPU memory (even on unified memory systems like Apple Silicon).

In our performance evalutations of Apple hardware, we found that keeping the entire model on GPU is around ~2.5x faster than splitting the KV cache off to the CPU and running model operations in GPU.

The one exception to this is when the client is running on a device that does not support AI. In this case, we only run the embedding model locally on device and run it in CPU mode while creating embeddings for documents and queries. We run on device as this preserves security and privacy allowing encrypted documents to be searched without passing string queries to the server.