Multimodal Format

Comprehensive multimodal content format that enables AI understanding of images, videos, audio, and interactive media. Provides rich metadata, AI analysis results, cross-modal relationships, and accessibility information for complete media comprehension.

Media Types

Analysis Ready

WCAG

Accessible

Cross

Modal Links

Supported Media Types

Complete coverage of multimodal content with AI-optimized metadata

Images

Comprehensive image metadata with AI analysis, variants, attribution, and accessibility support

Multiple formats (JPEG, PNG, WebP, SVG)

Responsive variants

AI object detection

Color analysis

EXIF metadata

Geographic coordinates

Videos

Rich video content with quality variants, subtitles, chapters, and semantic analysis

Multiple qualities (240p-4K)

Subtitle tracks

Chapter markers

Transcript support

Thumbnail generation

Platform integration

Audio

Audio content with transcription, speaker identification, and sentiment analysis

Multiple formats (MP3, OGG, WAV)

Speech-to-text

Speaker diarization

Sentiment scoring

Topic detection

Sound classification

Interactive Visuals

3D models, VR/AR content, panoramas, and interactive media experiences

3D model support

360° panoramas

AR/VR content

Interactive tours

Product configurators

Virtual simulations

JSON Schema Definition

Core structure of the multimodal format component

multimodal-format.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "Standardized Multimodal Content Format",
  "description": "Reusable schema component for multimodal content across all entity types",
  "type": "object",
  "aimlVersion": "2.0.1",
  "schemaVersion": "2.0.1",
  "properties": {
    "visualContent": {
      "type": "object",
      "description": "Visual content and metadata",
      "properties": {
        "images": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "imageId": { "type": "string" },
              "url": { "type": "string", "format": "uri" },
              "alt": { "type": "string" },
              "width": { "type": "integer" },
              "height": { "type": "integer" },
              "assetRole": {
                "type": "string",
                "enum": ["primary", "secondary", "detail", "gallery", "background", "logo", "icon", "banner", "thumbnail"]
              }
            }
          }
        },
        "videos": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "videoId": { "type": "string" },
              "url": { "type": "string", "format": "uri" },
              "title": { "type": "string" },
              "duration": { "type": "number" },
              "thumbnail": { "type": "string", "format": "uri" }
            }
          }
        }
      }
    },
    "audioContent": {
      "type": "object",
      "description": "Audio content and metadata",
      "properties": {
        "audioClips": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "audioId": { "type": "string" },
              "url": { "type": "string", "format": "uri" },
              "title": { "type": "string" },
              "duration": { "type": "number" },
              "transcript": { "type": "string" }
            }
          }
        }
      }
    }
  }
}

Usage in Entity Schemas

How multimodal content is implemented within schemas - data is included directly, not via $ref

Note: Multimodal format provides the structure standard, but entity schemas include media data directly rather than using $ref references.

Multimodal content included directly in entity schema

{
  "visualContent": {
    "images": [
      {
        "imageId": "product-hero",
        "url": "https://techbazaar.com/assets/product-hero.jpg",
        "alt": "TechBazaar marketplace main interface showing product categories",
        "width": 1200,
        "height": 800,
        "assetRole": "primary",
        "semanticDescription": "Modern e-commerce interface with clean navigation and featured products"
      }
    ],
    "videos": [
      {
        "videoId": "platform-demo",
        "url": "https://techbazaar.com/assets/demo.mp4",
        "title": "TechBazaar Platform Overview",
        "duration": 120,
        "thumbnail": "https://techbazaar.com/assets/demo-thumb.jpg"
      }
    ]
  }
}

AI Analysis & Intelligence Features

AI-Powered Analysis

Object detection with confidence scores
Dominant color extraction
Image type classification
Style property analysis
Sentiment scoring
Text content extraction

Cross-Modal Relationships

Media relationship mapping
Time synchronization
Spatial coordinate mapping
Semantic element connections
Translation equivalents
Alternative representations

Accessibility Features

Alternative text generation
Extended descriptions
Audio descriptions
Closed captions
Sign language support
WCAG compliance tracking

Media Management

Quality variant handling
Attribution tracking
License management
Version control
Display prioritization
Device optimization

Asset Role Classification

Primary Assets

primaryMain content

heroHero image

logoBrand logo

bannerPage banner

Supporting Assets

secondarySupporting content

detailDetail views

galleryGallery images

thumbnailPreview images

Functional Assets

tutorialHow-to content

testimonialCustomer stories

demonstrationProduct demos

backgroundBackground media

Advanced Multimodal Features

Cross-Modal Relationships

Link related content across different media types with time and spatial synchronization.

Image captions ↔ Audio descriptions ↔ Video transcripts

AI Content Analysis

Automated content understanding with object detection, sentiment analysis, and style classification.

Objects, colors, emotions, topics, speakers, text extraction

Accessibility Integration

Complete accessibility support with WCAG compliance tracking and alternative content formats.

Alt text, captions, transcripts, audio descriptions, sign language

Related Components

Multimodal format component enables comprehensive AI understanding of visual, audio, and interactive content with rich metadata and analysis capabilities.