Skip to content

whisper-base

Brief: Transcribe audio to text using OpenAI whisper-base.


Overview

  • Method: POST
  • Path: /v1/audio/transcriptions
  • Content-Type: multipart/form-data

Authentication

  • Header: Authorization: Bearer <token>
  • Supports bearer token authentication

Request Body Parameters

ParameterTypeRequiredDescription
filefileYesAudio file object, supporting flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm. Max file size 25 MB
modelstringYesModel name, set to whisper-base
languagestringNoAudio language. Supported values: zh, en, de, es. Use ISO-639-1 codes to improve accuracy

Optional Header: X-Failover-Enabled enables failover. If the current compute service is unavailable, the request may switch to a backup compute endpoint.


curl Example

bash
curl -X POST "https://api.gpt.ge/v1/audio/transcriptions" \
  -H "Authorization: Bearer sk-xxxx" \
  -H "X-Failover-Enabled: true" \
  -F "file=@./audio.wav" \
  -F "model=whisper-base" \
  -F "language=zh"

JavaScript (fetch) Example

javascript
const formData = new FormData();
formData.append('file', audioFile);
formData.append('model', 'whisper-base');
formData.append('language', 'zh');

fetch('https://api.gpt.ge/v1/audio/transcriptions', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer sk-xxxx',
    'X-Failover-Enabled': 'true'
  },
  body: formData
}).then(r => r.json()).then(console.log);

Python (requests) Example

python
import requests

with open('audio.wav', 'rb') as f:
    files = {'file': f}
    data = {
        'model': 'whisper-base',
        'language': 'zh'
    }
    response = requests.post(
        'https://api.gpt.ge/v1/audio/transcriptions',
        headers={
            'Authorization': 'Bearer sk-xxxx',
            'X-Failover-Enabled': 'true'
        },
        files=files,
        data=data
    )
print(response.json())

Response Example (200)

json
{
  "text": "Hello, this is OpenAI Whisper Base transcription."
}

Note: whisper-base is an open-source ASR model with faster processing speed, suitable for multi-language, accent, and noisy audio transcription.