Multimodal Large Language Models for Human-AI Interaction

Foundations, Agents, and Inclusive Applications

Half-day tutorial at EACL 2026

Overview

Multimodal large language models (MLLMs) are redefining how humans communicate and collaborate with machines. They extend the capabilities of text-based LLMs to perceive, reason, and act across text, images, charts, forms, and graphical user interfaces (GUIs). These models are now capable of answering questions about charts, summarizing infographics, operating software through natural language, and supporting multilingual and accessible visualization.

This tutorial offers a concise, three-hour introduction to the foundations, agentic capabilities, and inclusive applications of MLLMs, with a focus on visually grounded and interactive language tasks. We will cover core architectural designs (encoders, connectors, fusion and decoding mechanisms), multimodal alignment and learning strategies, and reasoning techniques for structured visuals such as charts, forms, and infographics. The tutorial then examines multimodal and conversational agents that perform dialogue-driven reasoning and co-creative analysis in graphical user interfaces. We conclude with discussions on accessibility, multilingual communication, responsible deployment, and future challenges in building human-centered multimodal AI.

Schedule & Materials

Part I: Foundations of Multimodal Large Language Models [45 mins]
  • From Text to Multimodality: evolution of LLMs from text-only to multimodal reasoning.
  • Architectural Principles:
    • Vision encoders (ViT, Swin, CLIP, SigLIP) and language decoders (Llama, Qwen).
    • Connector modules for cross-modal fusion (linear projections, adapters, gated tokens).
    • Unified multimodal architectures: Flamingo, Kosmos, LLaVa, Pali, QwenVL.
  • Alignment and Reinforcement Learning:
    • Instruction tuning and multimodal alignment.
    • Reinforcement learning for grounding and reward optimization: RLHF, GRPO, and visual reward shaping.
    • Case studies: ChartGemma (visual instruction tuning), BigCharts-R1 (RL for charts).
Part II: Reasoning Across Visual and Structured Modalities [45 mins]
  • Datasets and Benchmarks:
    • Visualization datasets (FigureQA, CharXiv, ChartQA, ChartQAPro, DashboardQA).
    • Generative and bidirectional corpora (Chart-to-Text, Text2Vis, DataNarrative, and InfographicVQA).
    • Document and GUI datasets: DocVQA, FormNet, ScreenAI).
  • Reasoning Techniques: multimodal chain-of-thought, ReAct prompting, and structured decoding for spatial/tabular data.
  • Vision-Language Alignment: Challenges of mapping visual features to linguistic embedding spaces; Case study: AlignVLM.
Part III: Human-AI Interaction and Multimodal Agents [45 mins]
  • Emergence of Multimodal Agents: From passive perception to active reasoning, tool use, and environment manipulation (WebArena, Mind2Web, OSWorld, AndroidWorld, UI-Tars, OpenCua).
  • GUI Grounding & Interactive Data Analysis: natural-language control of UIs and visualizations; case study: DashboardQA.
  • Co-Creation and Collaboration: agentic frameworks for visualizations and interactive data analysis; Case study: DataNarrative.

Coffee Break

Part IV: Inclusive & Responsible MLLMs [20 mins]
  • Accessibility: alt-text, captions, sonification, and assistive systems (SeeChart, VizAbility).
  • Multilingual Narration: cross-cultural and low-resource multimodal content understanding.
  • Responsible AI Practices: fairness, bias, hallucination, and visual deception.
Part V: Future Challenges & Outlook [30 mins]
  • Toward unified agentic frameworks combining perception, reasoning, and interaction.
  • Expanding evaluation standards for factuality, explainability, accessibility and inclusivity.
  • Human-in-the-loop and reinforcement-based grounding for reliable multimodal reasoning.

Slides

Slides will be made available here.

Organizers

Shafiq Joty
Shafiq Joty

Salesforce Research & NTU Singapore

Enamul Hoque
Enamul Hoque

York University

Ahmed Masry
Ahmed Masry

York University

Spandana Gella
Spandana Gella

ServiceNow Research

Samira Ebrahimi Kahou
Samira Ebrahimi Kahou

University of Calgary & Mila