Documentation

Ollama Models

A drag-and-drop component for running local LLMs through Ollama. Configure model parameters and connect inputs/outputs to other components while keeping all processing on your machine.

Ollama Component

Ollama component interface and configuration

Local Setup Required: This component requires Ollama to be installed and running on your machine or a remote server you can access. Ensure you have downloaded the necessary models before using this component.

Component Inputs

  • Base URL: The URL where Ollama is running

    Example: "http://localhost:11434" (Default for local installation)

  • Template: Custom prompt template for model instructions

    Example: "[INST] input [/INST]" (For Llama-based models)

  • Format: Response format specification

    Example: "json" (To force JSON output from supported models)

  • System: System prompt to guide model behavior

    Example: "You are a helpful AI assistant that answers questions concisely."

  • Input: User input text

    Example: "Explain the concept of transfer learning in machine learning."

Component Outputs

  • Text: Generated text output

    Example: "Transfer learning is a machine learning technique where a model developed for one task is reused as the starting point for a model on a second task..."

  • Language Model: Model information and metadata

    Example: model: llama3:8b, created_at: 2024-07-07T12:34:56.789Z, done: true

Model Parameters

Temperature

Controls randomness in the output - higher values increase creativity

Default: 0.7 Range: 0.0 to 2.0 Recommendation: Lower (0.1-0.3) for factual/consistent responses, Higher (0.7-1.0) for creative tasks

Context Window Size

Maximum number of tokens to use from the context

Default: Model-dependent Range: 1 to model maximum (e.g., 8192 for Llama3 8B) Recommendation: Adjust based on available system memory

Number of GPU

Number of GPUs to use for inference

Default: All available Range: 0 to number of available GPUs Recommendation: Use all available GPUs for optimal performance

Number of Threads

CPU threads to utilize

Default: Auto-detected Range: 1 to available CPU threads Recommendation: Set to number of physical cores for balance of performance and system responsiveness

Advanced Settings

Mirostat

Adaptive sampling algorithm for controlling perplexity

Options: Disabled (0), Enabled (1), v2 (2) Default: Disabled Recommendation: Enable for more consistent quality output

Mirostat Eta

Learning rate for mirostat algorithm

Default: 0.1 Range: 0.0 to 1.0 Recommendation: Start with default and adjust if needed

Mirostat Tau

Target entropy for mirostat algorithm

Default: 5.0 Range: 0.0 to 10.0 Recommendation: Lower values (3-5) for more focused text

Repeat Penalty

Penalty for repeated token sequences

Default: 1.1 Range: 1.0 to 2.0 Recommendation: Higher values (1.2-1.5) to reduce repetition

Top K

Limits vocabulary for each generation step to k most likely tokens

Default: 40 Range: 0 (disabled) to any positive integer Recommendation: 40-100 for balanced diversity

Top P

Nucleus sampling - considers tokens with cumulative probability p

Default: 0.9 Range: 0.0 to 1.0 Recommendation: 0.9-0.95 for most use cases

Implementation Example

// Basic configuration const ollamaConfig = { baseUrl: "http://localhost:11434", model: "llama3", system: "You are a helpful programming assistant." }; // Advanced configuration const advancedOllamaConfig = { baseUrl: "http://localhost:11434", model: "mistral", temperature: 0.5, numGpu: 1, numThread: 8, repeatPenalty: 1.2, topK: 50, topP: 0.9, mirostat: 2, mirostatEta: 0.1, mirostatTau: 5.0, template: "{{system}}\n\nUser: {{input}}\n\nAssistant:" }; // Usage example async function generateResponse(input) { const response = await ollamaComponent.generate({ input: input, system: "You are an AI assistant that explains complex concepts simply.", temperature: 0.3 }); return response.text; }

Use Cases

  • Privacy-Focused Applications: Use local LLMs for sensitive data that shouldn't leave your system
  • Offline Development: Create AI-powered applications that work without internet connectivity
  • Cost-Effective Solutions: Eliminate API costs by running models locally
  • Low-Latency Requirements: Reduce response time by eliminating network latency
  • Custom Model Integration: Run specialized or fine-tuned models not available on cloud services

Best Practices

  • Ensure Ollama is running before attempting to connect
  • Adjust thread count based on your CPU capabilities
  • Configure GPU usage appropriately for your hardware
  • Test with different sampling methods to find the best for your use case
  • Use smaller models (7B-13B) for faster responses or on limited hardware
  • Consider using quantized models (Q4_K_M) to reduce memory requirements
  • Monitor system resource usage when running larger models