Multimodal Large Language Models for Human-AI Interaction
Foundations, Agents, and Inclusive Applications
Half-day tutorial at EACL 2026
Overview
Multimodal large language models (MLLMs) are redefining how humans communicate and collaborate with machines. They extend the capabilities of text-based LLMs to perceive, reason, and act across text, images, charts, forms, and graphical user interfaces (GUIs). These models are now capable of answering questions about charts, summarizing infographics, operating software through natural language, and supporting multilingual and accessible visualization.
This tutorial offers a concise, three-hour introduction to the foundations, agentic capabilities, and inclusive applications of MLLMs, with a focus on visually grounded and interactive language tasks. We will cover core architectural designs (encoders, connectors, fusion and decoding mechanisms), multimodal alignment and learning strategies, and reasoning techniques for structured visuals such as charts, forms, and infographics. The tutorial then examines multimodal and conversational agents that perform dialogue-driven reasoning and co-creative analysis in graphical user interfaces. We conclude with discussions on accessibility, multilingual communication, responsible deployment, and future challenges in building human-centered multimodal AI.
Schedule & Materials
- From Text to Multimodality: evolution of LLMs from text-only to multimodal reasoning.
- Architectural Principles:
- Vision encoders (ViT, Swin, CLIP, SigLIP) and language decoders (Llama, Qwen).
- Connector modules for cross-modal fusion (linear projections, adapters, gated tokens).
- Unified multimodal architectures: Flamingo, Kosmos, LLaVa, Pali, QwenVL.
- Alignment and Reinforcement Learning:
- Instruction tuning and multimodal alignment.
- Reinforcement learning for grounding and reward optimization: RLHF, GRPO, and visual reward shaping.
- Case studies: ChartGemma (visual instruction tuning), BigCharts-R1 (RL for charts).
- Datasets and Benchmarks:
- Visualization datasets (FigureQA, CharXiv, ChartQA, ChartQAPro, DashboardQA).
- Generative and bidirectional corpora (Chart-to-Text, Text2Vis, DataNarrative, and InfographicVQA).
- Document and GUI datasets: DocVQA, FormNet, ScreenAI).
- Reasoning Techniques: multimodal chain-of-thought, ReAct prompting, and structured decoding for spatial/tabular data.
- Vision-Language Alignment: Challenges of mapping visual features to linguistic embedding spaces; Case study: AlignVLM.
- Emergence of Multimodal Agents: From passive perception to active reasoning, tool use, and environment manipulation (WebArena, Mind2Web, OSWorld, AndroidWorld, UI-Tars, OpenCua).
- GUI Grounding & Interactive Data Analysis: natural-language control of UIs and visualizations; case study: DashboardQA.
- Co-Creation and Collaboration: agentic frameworks for visualizations and interactive data analysis; Case study: DataNarrative.
Coffee Break
- Accessibility: alt-text, captions, sonification, and assistive systems (SeeChart, VizAbility).
- Multilingual Narration: cross-cultural and low-resource multimodal content understanding.
- Responsible AI Practices: fairness, bias, hallucination, and visual deception.
- Toward unified agentic frameworks combining perception, reasoning, and interaction.
- Expanding evaluation standards for factuality, explainability, accessibility and inclusivity.
- Human-in-the-loop and reinforcement-based grounding for reliable multimodal reasoning.
Slides
Slides will be made available here.