AI Transcriber Documentation
Overview
The AI Transcriber component in the Vantage platform is designed to transcribe audio data into text. It can process audio files either through direct URLs or base64-encoded content uploads. The transcribing is done using the user's selected AI provider, which can be OpenAI Whisper, Gemini multimodal, or a designated fallback option. The results are stored in a new column in the output dataset.
Purpose
The primary purpose of the AI Transcriber is to facilitate audio data processing by extracting spoken content and converting it into written text. This is particularly useful for applications involving audio content analysis, automated captioning, content generation, and more.
Settings
The AI Transcriber has several configurable settings that dictate its behavior. Below is a detailed explanation of each setting:
| Setting Name | Input Type | Description | Default Value |
|---|---|---|---|
| audioColumn | String | The name of the column in the input dataset that contains the audio file URLs or base64 content. If this column does not exist or is named incorrectly, the transcriber will return an error for those rows. | audio_url |
| outputColumn | String | The name of the column in the output dataset that will hold the transcribed text. Changing this value will affect where the transcription results are stored. | transcript |
| batchSize | Numeric | Determines the number of rows processed in a single batch. The component will process rows in batches up to the specified size. Adjusting this can optimize performance, but values should be between 1 and 10. | 5 |
Detailed Settings Explanation
-
audioColumn
- Input Type: String
- Description: This setting specifies the name of the column within the input dataset that contains either audio URLs or base64-encoded audio content. If this column cannot be found in the dataset, the component will output an error message for the affected rows, making it crucial for successful transcriptions.
- Default Value:
audio_url
-
outputColumn
- Input Type: String
- Description: This setting allows you to name the new column where the transcribed text will be stored. If changed, ensure that the new name does not conflict with existing columns to avoid overwriting valid data. The transcribed text for each audio file is placed here.
- Default Value:
transcript
-
batchSize
- Input Type: Numeric
- Description: Defines how many rows will be processed at once. The batch size impacts performance; a higher value may lead to faster processing but could also result in memory issues if the dataset is large. Valid range is 1–10.
- Default Value:
5
How It Works
Upon execution, the AI Transcriber performs the following steps:
- Input Handling: It checks the user-defined configuration settings and unwraps the input data. If the incoming data format is valid, it processes it accordingly.
- Validation: The component validates the presence of the specified audio column if base64 content is not included in the incoming data.
- AI Integration: The transcriber retrieves the preferred AI integration based on user context. If no integration is available, it outputs an error message.
- Batch Processing: The audio files are processed in batches determined by the batchSize setting. Each audio file undergoes an error check before being sent for transcription.
- Audio Processing: Depending on whether the audio data comes from either a URL or base64 content, it transcribes the audio and appends the text to the output dataset.
- Cleanup: After transcription, heavy binary fields are stripped from the output to prevent bloating. Only the relevant transcribed text is returned in the final output.
Data Expectations
The AI Transcriber expects the input data to contain either:
- URLs in the
audioColumnwhich point to audio files in supported formats (e.g.,.mp3,.wav, etc.). - Base64-encoded audio content if integrated with a file upload node.
The input dataset must be structured correctly, with a column name matching the audioColumn setting.
AI Integrations
The AI Transcriber components funnel the audio data through a preferred AI provider, which can be:
- OpenAI Whisper
- Gemini multimodal
- An alternative fallback option as defined in the integration setup.
This flexibility allows users to switch between different AI models depending on their needs, ensuring that they can choose the best model for their audio transcription tasks.
Billing Impacts
Utilizing the AI Transcriber will incur costs based on the selected AI provider and the volume of transcribed audio. Generally, billing could be influenced by:
- The number of audio files processed.
- The duration of audio being transcribed.
- Rate limits and pricing models of the chosen AI provider.
Users are recommended to consult the respective provider's pricing details and monitor their usage within Vantage to manage costs effectively.
Use Cases & Examples
Use Cases
-
Podcast Transcription: A media company that produces weekly podcasts can use the AI Transcriber to automatically convert their audio episodes into written transcripts, enhancing accessibility and SEO reach.
-
Legal Document Drafting: Legal firms can record client interviews or depositions and transcribe them to create detailed, accurate written records that serve as foundational documents.
-
Content Creation: Content creators can take audio interviews or discussions and turn them into blog posts or articles for their audiences, streamlining content generation processes.
Example Configuration
Use Case: Podcast Transcription
Scenario: A media company wants to transcribe their weekly podcast episodes stored as audio URLs in a dataset.
Configuration Data:
{
"audioColumn": "podcast_audio_url",
"outputColumn": "podcast_transcript",
"batchSize": 5
}- Description: In this configuration, the audio files are located in the column
podcast_audio_url, and the resulting transcriptions will be stored in thepodcast_transcriptcolumn. The transcriber is set to process up to 5 audio files at once, balancing efficiency and resource usage.
By following this configuration, the media company can seamlessly convert their audio content into text, allowing them to provide additional value to their audience and increase engagement through written content.