Multimodal Format

Comprehensive multimodal content format that enables AI understanding of images, videos, audio, and interactive media. Provides rich metadata, AI analysis results, cross-modal relationships, and accessibility information for complete media comprehension.

4

Media Types

AI

Analysis Ready

WCAG

Accessible

Cross

Modal Links

Supported Media Types
Complete coverage of multimodal content with AI-optimized metadata

Images

Comprehensive image metadata with AI analysis, variants, attribution, and accessibility support

Multiple formats (JPEG, PNG, WebP, SVG)
Responsive variants
AI object detection
Color analysis
EXIF metadata
Geographic coordinates

Videos

Rich video content with quality variants, subtitles, chapters, and semantic analysis

Multiple qualities (240p-4K)
Subtitle tracks
Chapter markers
Transcript support
Thumbnail generation
Platform integration

Audio

Audio content with transcription, speaker identification, and sentiment analysis

Multiple formats (MP3, OGG, WAV)
Speech-to-text
Speaker diarization
Sentiment scoring
Topic detection
Sound classification

Interactive Visuals

3D models, VR/AR content, panoramas, and interactive media experiences

3D model support
360° panoramas
AR/VR content
Interactive tours
Product configurators
Virtual simulations
JSON Schema Definition
Core structure of the multimodal format component
multimodal-format.json
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "Standardized Multimodal Content Format",
  "description": "Reusable schema component for multimodal content across all entity types",
  "type": "object",
  "aimlVersion": "2.0.1",
  "schemaVersion": "2.0.1",
  "properties": {
    "visualContent": {
      "type": "object",
      "description": "Visual content and metadata",
      "properties": {
        "images": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "imageId": { "type": "string" },
              "url": { "type": "string", "format": "uri" },
              "alt": { "type": "string" },
              "width": { "type": "integer" },
              "height": { "type": "integer" },
              "assetRole": {
                "type": "string",
                "enum": ["primary", "secondary", "detail", "gallery", "background", "logo", "icon", "banner", "thumbnail"]
              }
            }
          }
        },
        "videos": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "videoId": { "type": "string" },
              "url": { "type": "string", "format": "uri" },
              "title": { "type": "string" },
              "duration": { "type": "number" },
              "thumbnail": { "type": "string", "format": "uri" }
            }
          }
        }
      }
    },
    "audioContent": {
      "type": "object",
      "description": "Audio content and metadata",
      "properties": {
        "audioClips": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "audioId": { "type": "string" },
              "url": { "type": "string", "format": "uri" },
              "title": { "type": "string" },
              "duration": { "type": "number" },
              "transcript": { "type": "string" }
            }
          }
        }
      }
    }
  }
}
Usage in Entity Schemas
How multimodal content is implemented within schemas - data is included directly, not via $ref

Note: Multimodal format provides the structure standard, but entity schemas include media data directly rather than using $ref references.

Multimodal content included directly in entity schema
{
  "visualContent": {
    "images": [
      {
        "imageId": "product-hero",
        "url": "https://techbazaar.com/assets/product-hero.jpg",
        "alt": "TechBazaar marketplace main interface showing product categories",
        "width": 1200,
        "height": 800,
        "assetRole": "primary",
        "semanticDescription": "Modern e-commerce interface with clean navigation and featured products"
      }
    ],
    "videos": [
      {
        "videoId": "platform-demo",
        "url": "https://techbazaar.com/assets/demo.mp4",
        "title": "TechBazaar Platform Overview",
        "duration": 120,
        "thumbnail": "https://techbazaar.com/assets/demo-thumb.jpg"
      }
    ]
  }
}
AI Analysis & Intelligence Features

AI-Powered Analysis

  • Object detection with confidence scores
  • Dominant color extraction
  • Image type classification
  • Style property analysis
  • Sentiment scoring
  • Text content extraction

Cross-Modal Relationships

  • Media relationship mapping
  • Time synchronization
  • Spatial coordinate mapping
  • Semantic element connections
  • Translation equivalents
  • Alternative representations

Accessibility Features

  • Alternative text generation
  • Extended descriptions
  • Audio descriptions
  • Closed captions
  • Sign language support
  • WCAG compliance tracking

Media Management

  • Quality variant handling
  • Attribution tracking
  • License management
  • Version control
  • Display prioritization
  • Device optimization
Asset Role Classification

Primary Assets

primaryMain content
heroHero image
logoBrand logo
bannerPage banner

Supporting Assets

secondarySupporting content
detailDetail views
galleryGallery images
thumbnailPreview images

Functional Assets

tutorialHow-to content
testimonialCustomer stories
demonstrationProduct demos
backgroundBackground media
Advanced Multimodal Features

Cross-Modal Relationships

Link related content across different media types with time and spatial synchronization.

Image captions ↔ Audio descriptions ↔ Video transcripts

AI Content Analysis

Automated content understanding with object detection, sentiment analysis, and style classification.

Objects, colors, emotions, topics, speakers, text extraction

Accessibility Integration

Complete accessibility support with WCAG compliance tracking and alternative content formats.

Alt text, captions, transcripts, audio descriptions, sign language

Multimodal format component enables comprehensive AI understanding of visual, audio, and interactive content with rich metadata and analysis capabilities.