Configure LLM platforms

  • Tier: Premium, Ultimate
  • Offering: GitLab Self-Managed

The AI Gateway supports multiple LLM providers through LiteLLM. Each platform has unique features and benefits that can cater to different needs. The following documentation summarises the providers we have validated and tested. If the platform you want to use is not in this documentation, provide feedback in the platform request issue (issue 526144).

Use multiple models and platforms

You can use multiple models and platforms in the same GitLab instance.

For example, you can configure one feature to use Azure OpenAI, and another feature to use AWS Bedrock, or self-hosted models served with vLLM.

This setup gives you flexibility to choose the best model and platform for each use case. Models must be supported and served through a compatible platform.

Self-hosted model deployments

vLLM

vLLM is a high-performance inference server optimized for serving LLMs with memory efficiency. It supports model parallelism and integrates easily with existing workflows.

To install vLLM, see the vLLM Installation Guide. You should install version v0.6.4.post1 or later.

Configuring the endpoint URL

When configuring the endpoint URL for any OpenAI API compatible platforms (such as vLLM) in GitLab:

  • The URL must be suffixed with /v1
  • If using the default vLLM configuration, the endpoint URL would be https://<hostname>:8000/v1
  • If your server is configured behind a proxy or load balancer, you might not need to specify the port, in which case the URL would be https://<hostname>/v1

Find the model name

After the model has been deployed, to get the model name for the model identifier field in GitLab, query the vLLM server’s /v1/models endpoint:

curl \
  --header "Authorization: Bearer API_KEY" \
  --header "Content-Type: application/json" \
  http://your-vllm-server:8000/v1/models

The model name is the value of the data.id field in the response.

Example response:

{
  "object": "list",
  "data": [
    {
      "id": "Mixtral-8x22B-Instruct-v0.1",
      "object": "model",
      "created": 1739421415,
      "owned_by": "vllm",
      "root": "mistralai/Mixtral-8x22B-Instruct-v0.1",
      // Additional fields removed for readability
    }
  ]
}

In this example, if the model’s id is Mixtral-8x22B-Instruct-v0.1, you would set the model identifier in GitLab as custom_openai/Mixtral-8x22B-Instruct-v0.1.

For more information, see the following documentation:

Mistral-7B-Instruct-v0.2

  1. Download the model from HuggingFace:

    git clone https://<your-hugging-face-username>:<your-hugging-face-token>@huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
  2. Run the server:

    vllm serve <path-to-model>/Mistral-7B-Instruct-v0.3 \
       --served_model_name <choose-a-name-for-the-model>  \
       --tokenizer_mode mistral \
       --tensor_parallel_size <number-of-gpus> \
       --load_format mistral \
       --config_format mistral \
       --tokenizer <path-to-model>/Mistral-7B-Instruct-v0.3

Mixtral-8x7B-Instruct-v0.1

  1. Download the model from HuggingFace:

    git clone https://<your-hugging-face-username>:<your-hugging-face-token>@huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
  2. Rename the token config:

    cd <path-to-model>/Mixtral-8x7B-Instruct-v0.1
    cp tokenizer.model tokenizer.model.v3
  3. Run the model:

    vllm serve <path-to-model>/Mixtral-8x7B-Instruct-v0.1 \
      --tensor_parallel_size 4 \
      --served_model_name <choose-a-name-for-the-model> \
      --tokenizer_mode mistral \
      --load_format safetensors \
      --tokenizer <path-to-model>/Mixtral-8x7B-Instruct-v0.1

Disable request logging to reduce latency

When running vLLM in production, you can significantly reduce latency by using the --disable-log-requests flag to disable request logging.

Use this flag only when you do not need detailed request logging.

Disabling request logging minimizes the overhead introduced by verbose logs, especially under high load, and can help improve performance levels.

vllm serve <path-to-model>/<model-version> \
--served_model_name <choose-a-name-for-the-model>  \
--disable-log-requests

This change has been observed to notably improve response times in internal benchmarks.

Cloud-hosted model deployments

GitLab has validated and tested the following providers. The AI Gateway supports LLM providers that are compatible with LiteLLM.

Configure authentication with AWS Bedrock

You can use several methods to authenticate AWS Bedrock with your AI Gateway.

Prerequisites:

  • Models are automatically enabled in Bedrock when first invoked. For more information, see Bedrock model access.
  • Have AWS credentials configured with appropriate IAM permissions.

Use IRSA (IAM Roles for Service Accounts) for your AI Gateway pods to authenticate to AWS Bedrock, without storing static credentials.

After you authenticate Amazon EKS with IRSA, the AI Gateway automatically obtains temporary credentials from the IRSA role.

To use IRSA to authenticate Amazon EKS:

  1. Create an IAM policy that grants access to Bedrock models. You can scope this to specific models if you require more security:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "bedrock:InvokeModel",
            "bedrock:InvokeModelWithResponseStream"
          ],
          "Resource": "arn:aws:bedrock:*::foundation-model/*"
        }
      ]
    }
    aws iam create-policy \
      --policy-name bedrock-ai-gateway-access \
      --policy-document file://bedrock-policy.json \
      --description "Bedrock access for AI Gateway"
  2. Optional. For stricter access control, replace the wildcard resource with specific model Amazon Resource Name (ARN). This ensures only approved models can be accessed, even if GitLab configuration changes. For available model ARNs, see Amazon Bedrock model IDs.

    "Resource": [
      "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0",
      "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-haiku-20240307-v1:0"
    ]

    Some models might use different ARN formats. For example, newer models might require inference profile ARNs in addition to foundation model ARNs. To check the the ARN format for your specific model, see the Amazon Bedrock model IDs.

  3. Create an IAM role with a trust policy for your Amazon EKS service account to use. Replace the following values:

    • YOUR_ACCOUNT_ID: Your AWS account ID.
    • REGION: Your Amazon EKS cluster region (for example, us-east-1).
    • YOUR_OIDC_ID: Your Amazon EKS cluster’s OIDC provider ID.
    • NAMESPACE: Kubernetes namespace where AI Gateway is deployed.
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Federated": "arn:aws:iam::YOUR_ACCOUNT_ID:oidc-provider/oidc.eks.REGION.amazonaws.com/id/YOUR_OIDC_ID"
          },
          "Action": "sts:AssumeRoleWithWebIdentity",
          "Condition": {
            "StringEquals": {
              "oidc.eks.REGION.amazonaws.com/id/YOUR_OIDC_ID:sub": "system:serviceaccount:NAMESPACE:ai-gateway",
              "oidc.eks.REGION.amazonaws.com/id/YOUR_OIDC_ID:aud": "sts.amazonaws.com"
            }
          }
        }
      ]
    }
    # Create the role
    aws iam create-role \
      --role-name eks-ai-gateway-bedrock \
      --assume-role-policy-document file://trust-policy.json \
      --description "EKS IRSA role for AI Gateway to access Bedrock"
  4. Attach the Bedrock IAM policy to this role.

    # Attach the role
    aws iam attach-role-policy \
      --role-name eks-ai-gateway-bedrock \
      --policy-arn arn:aws:iam::YOUR_ACCOUNT_ID:policy/bedrock-ai-gateway-access
  5. To configure the Helm chart, install the AI Gateway with the IAM role annotation:

    serviceAccount:
      create: true
      name: ai-gateway
      annotations:
        eks.amazonaws.com/role-arn: arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_ROLE_NAME
    extraEnvironmentVariables:
      - name: AWS_REGION
        value: us-east-1

For more information, see IAM roles for service accounts.

Docker deployments

Configure IAM credentials through environment variables when starting the AI Gateway container:

docker run -d \
  -e AWS_ACCESS_KEY_ID=your-access-key \
  -e AWS_SECRET_ACCESS_KEY=your-secret-key \
  -e AWS_REGION=us-east-1 \
  -p 5052:5052 \
  registry.gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/model-gateway:self-hosted-vX.Y.Z-ee

The IAM user or role must have a policy similar to the one you would set in Amazon EKS with Helm Chart.

Kubernetes deployments

For Kubernetes clusters other than Amazon EKS, you can use Kubernetes secrets to store AWS credentials:

  1. Create a Kubernetes secret:

    kubectl create secret generic aws-credentials \
      --from-literal=access-key-id=YOUR_ACCESS_KEY_ID \
      --from-literal=secret-access-key=YOUR_SECRET_ACCESS_KEY \
      -n YOUR_NAMESPACE
  2. Configure the Helm chart to reference the secret:

    extraEnvironmentVariables:
      - name: AWS_ACCESS_KEY_ID
        valueFrom:
          secretKeyRef:
            name: aws-credentials
            key: access-key-id
      - name: AWS_SECRET_ACCESS_KEY
        valueFrom:
          secretKeyRef:
            name: aws-credentials
            key: secret-access-key
      - name: AWS_REGION
        value: us-east-1

AWS Bedrock API keys

To use AWS Bedrock API keys as an alternative to IAM credentials:

  1. Create a Bedrock API key

  2. Create a Kubernetes secret with the API key:

    kubectl create secret generic bedrock-api-key \
      --from-literal=token=YOUR_BEDROCK_API_KEY \
      -n YOUR_NAMESPACE
  3. Configure the AI Gateway (add to your values.yaml):

    extraEnvironmentVariables:
      - name: AWS_BEARER_TOKEN_BEDROCK
        valueFrom:
          secretKeyRef:
            name: bedrock-api-key
            key: token
      - name: AWS_REGION
        value: us-east-1

Private VPC endpoints

To use a private Bedrock endpoint in a VPC, set the AWS_BEDROCK_RUNTIME_ENDPOINT environment variable.

For Helm deployments:

extraEnvironmentVariables:
  - name: AWS_BEDROCK_RUNTIME_ENDPOINT
    value: https://bedrock-runtime.us-east-1.amazonaws.com

For Docker deployments:

docker run -d \
  -e AWS_BEDROCK_RUNTIME_ENDPOINT=https://bedrock-runtime.us-east-1.amazonaws.com \
  -e AWS_REGION=us-east-1 \
  # ... other configuration

For VPC endpoints, use the format: https://vpce-{vpc-endpoint-id}-{service-name}.{region}.vpce.amazonaws.com

Configure authentication with Google Vertex AI

To use models from Google Vertex AI, you must authenticate your AI Gateway instance. You can use any of the following mechanisms:

  • Export the environment variables when starting the Docker container. To do this, set the following environment variables when running the AI Gateway container:

    GOOGLE_APPLICATION_CREDENTIALS=/path/to/application_default_credentials.json
    VERTEXAI_PROJECT=<gcp-project-id>
    VERTEXAI_LOCATION=global
  • Run the AI Gateway container on Google Cloud Run and use the Cloud Run service account for Vertex AI access.