User Interfaces In the beginning: Wimpy Computing

Multimodale Räume “Smart Rooms” “Intelligent Environments” Seminar SS 03

User Interfaces In the beginning: Wimpy Computing
Windows, Icons, Menus, Pointing

2nd Generation:Human-Machine Interaction
Speaking Pointing, Gesturing Hand-Writing Drawing Presence/Focus of Attention Combination Sp+HndWrtg+Gestr. Repair Multimodal NLP & Dialog “Please show me… hm… all Hotels in THIS area.. er..part of the city"

“Perceptual” User Interfaces
Perceptive human-like perceptual capabilities (what is the user saying, who is the user, where is the user, what is he doing?) Multimodal People use multiple modalities to communicate (speech, gestures, facial expressions, …) Multimedia Text, graphics, audio and video (Matthew Turk (Ed.), Proceedings of the 1998 Workshop on Perceptual User Interfaces)

Next: Pervasive Computing
Human-Computer Interaction not the Only Exchange Humans Want to Interact with Other Humans Computers in the Human Interaction Loop (CHIL) The Transparent, Invisible Computer Computers Needs to be Context Aware Should Require little or no Learning or Attention Should be proactive rather than command driven Produce Little or No Distraction Permit a HCI and CHIL Mix

Smart/Intelligent Rooms
Use of computation to enhance everyday activity Integrate computers seamlessly into the real world (e.g. offices, homes) Use “natural” interfaces for communication (voice, gesture, etc. ) Computer should adapt to the human, not vice-versa!

Perception In order to respond appropriately, objects/room need(s) to pay attention to People and Context Machines have to be aware of their environment: Who, What, When, Where and Why? Interfaces must be adaptive to Overall situation Individual User

Intelligent Environments
Classroom 2000 (Georgia Tech) Mozer’s Adaptive House Enhanced Meeting Rooms Kids Room (MIT) … Enhanced Objects such as Whiteboards, Desks, Chairs, … See also the Intelligent Environments Resource Page (

Intelligent Rooms, Univ. California, San Diego

Classroom 2000 Capturing activity in a classroom Speaker’s voice Video
Slides Handwritten Notes

Classroom 2000 Presenting (recorded) lectures through a web-based interface Integration of Slides, Notes, Audio, Video Searching Adding additional material

Microsoft Easy Living Project
XML-based distributed agent system Computer vision for person-tracking and visual user interaction. Multiple sensor modalities combined. Use of a geometric model of the world to provide context. Automatic or semi-automatic sensor calibration and model building. Fine-grained events and adaptation of the user interface. Device-independent communication and data protocols. Ability to extend the system in many ways.

Mozer’s Adaptive House
Operated as an ordinary home Usual light-switches, thermostats, doors etc. Adjustments are measured and used to train the house to automatically adjust temperature adjust lighting choose music or TV channel The house infers the users desires from their actions and behaviours

Adaptive House (Mozer)
Sensors: Light Level Sound Level Temperature Motion Door status Window status Light settings Fan Heaters … (M. Mozer, Univ. of Colorado, Boulder)

Issues in Perception Visual Auditory Other: Haptic, Olfactoric, … ?
Face-detection / Tracking Body-Tracking Face Recognition Gesture Recognition Action Recognition Gaze Tracking / Tracking Focus of Attention Auditory Speech Recognition Speaker Tracking Auditory Scene Analysis Speaker Identification Other: Haptic, Olfactoric, … ?

Enhanced Meeting Rooms
Capturing of Meetings Transcription Summarization Dialog Processing Who was there ? Who talked to whom ?

Work at ISL Face Tracking Facial Feature Tracking (Eyes, Nose, Mouth)
Head Pose Estimation / Gaze Tracking Lip-Reading (Audio-Visual Speech Reco.) 3D Person Tracking Pointing Gesture Tracking Other Modalities: Speech (!!!, see John), Dialogue, Translation, Handwriting, ...

Tracking of Human Faces
A face provides different functions: identification perception of emotional expressions Human Computer Interaction requires tracking of faces: lip-reading eye/gaze tracking facial action analysis / synthesis Video Conferencing / video telephony application: tracking the speaker achieving low bit rate transmission

Demo: FaceTracker

Color Based Face Tracking
Human skin-colors: cluster in a small area of a color space skin-colors of different people mainly differ in intensity! variance can be reduced by color normalization distribution can be characterized by a Gaussian model Chromatic colors:

Color Model Disadvantages: Advantages: environment dependent very fast
(light-sources heavily affect color distribution) Advantages: very fast orientation invariant stable object representation not person-dependent model parameters can be quickly adapted

Tracking Gaze and Focus of Attention
In meetings: to determine the addressee of a speech act to track the participants attention to analyse, who was in the center of focus for meeting indexing / retrieval Interactive rooms to guide the environments focus to the right application to suppress unwanted responses Virtual collaborative workspaces (CSCW) Human-Robot Cooperation Cars (Driver monitoring)

Tracking a User’s Focus of Attention
Focus of Attention tracking: To detect a person’s interest To know what a user is interacting with To understand his actions/intentions To know whether a user is aware of something In meetings: to determine the addressee of a speech act to understand the dynamics of interaction for meeting indexing / retrieval Other areas Smart environments Video-conferencing Human-Robot Interaction

Head Pose Estimation Model-based approaches: Example-based approaches:
Locate and track a number of facial features Compute head pose from 2D to 3D correspondences (Gee & Cipolla '94, Stiefelhagen et.al '96, Jebara & Pentland '97,Toyama '98) Example-based approaches: estimate new pose with function approximator (such as ANN) (Beymer et.al.'94, Schiele & Waibel '95, Rae & Ritter '98) use face database to encode images (Pentland et.al. '94)

Model-based Head Pose estimation
Find correspondences between points in a 3D model and points in the image Iteratively solve linear equation system to find pose parameters (rx, ry, rz, tx, ty, tz) Y Z X Feature Tracking Pose Estimation 3D Model Real World Image

Demo: Facial Feature Tracking

Demo: Model-based Head Pose

Model-based Head Pose Pose estimation accuracy depends on correct feature localization! Problems: Choice of good features Occlusion due to strong head rotation Fast head movement Detection of tracking failure / re-initialization Requires good image resolution Video

Estimating Head Pose with ANNs
Train neural network to estimate head orientation Preprocessed image of the face used as input

Network Architecture Pan (Tilt) Hidden Layer: 40 to 150 units
Input Retina: up to 3 x 20x30 pixel 1.800 units

Tracking People in a Panoramic View
Camera View Perspective View Panoramic View

Training Separate nets for pan and tilt
Trained with Std.-Backprop with Momentum Term Datasets: Training on 6100 images from 12 users Crossevaluation on 750 images from same users Tested on 750 images from same users Additional User Independent Testset: 1500 images from two new users

Results histo: Histogram-normalized image used as input
edges: Horizontal- and Vertical Edge Image used as input both: Both, Histogram-image plus Edge Images used

Spatial-Awareness in Smart Rooms
Tracking people indoors To focus sensors on people To resolve spatial lrelationships To avoid bumping into humans To analyze activity Motivation

Person Tracking Vision based localization of people/objects:
Single Perspective: Pfinder - W3S - Hydra - etc. Multiple Perspective: AVIARY - Easy Living

Person Tracking in the ISL Smart Room
Cam3 Features Cam2 People Feature extractor Cam1 Cam0 Tracking agent Features

Personen-Tracking mit mehreren Kameras
Ziel: 3D Tracking von Personen in Räumen Segmentierung von Vordergrundobjekten in jedem Bild „3D Schnitt“ der Strahlen durch die Objektmitten Kalman-Filter

Adaptive Silhouette Extraction
Background subtraction: Adaptive Multi-Gaussian background model [Stauffer et al., CVPR 1998] Morphological operators smooth foreground output Connected components form silhouettes Silhouette extraction

Locating people Use calibrated sensors to calculate absolute position
1 Extract reference point: Centroid Use calibrated sensors to calculate absolute position Create list of location hypotheses 1 2 3 Location Hypotheses: i) (X,Y) ii) (X,Y) a b b a b a

Tracking people Best Hypothesis Tracking: Match location hypotheses
ato tracks Smooth tracks with Kalman afilter Track 1 ii) i) Track 2 Hypotheses i) (X,Y) ii) (X,Y) Track 1 Track 2

Tracking Problems Imperfect and Merged silhouettes: Counterstrategies
Better Vision algorithm Probabilistic Multi-Hypothesis aTracking Reference point: Head

Reference point: Head Use head as reference point instead of centroid
- Tracking error b - False alarm rate Use head as reference point instead of centroid Head tracker has significantly lower tracking error and false alarm rate

Erkennung von Zeigegesten
Ziele: Menschliche Zeigegesten erkennen Zeigerichtung in 3D extrahieren Einsatzgebiete: Mensch-Roboter-Interaktion smart rooms Anforderungen: Personenunabhängig Echtzeitbetrieb Kamerabewegung möglich

Erkennung von Zeigegesten
Stereokamera Linkes/rechtes Bild

3D-Tracker: Verarbeitungsschritte
Kamera Hautfarbe Disparität 3D-Clustering von Hautfarbpixeln liefert Hinweise auf Position von Kopf und Hände.

Gestenerkennung: Bewegungsphasen
Zeigegesten bestehen aus drei intuitiv unterscheidbaren Bewegungsphasen: Beginn Halten Ende Genaue Lokalisierung der Haltephase wichtig zur Bestimmung der Zeigerichtung μ [sec] σ [sec] Komplette Geste 1.75 0.48 Beginn 0.52 0.17 Halten 0.76 0.40 Ende 0.47 0.12 Mittlere Dauer der Bewegungsphasen

Gestenerkennung: Modelle
Modellierung der 3 Phasen mit separaten Modellen Kontinuierliche HMMs mit 2 Gaussians pro Zustand Null-Modell als Schwellwert für die Phasen-Modelle Training auf handgelabelten Daten

Gestenerkennung: Detektion
Eine Zeigegeste wird erkannt, wenn 3 Zeitpunkte tB < tH < tE gefunden werden, so dass PE(tE) > PB(tE) und PE(tE) > 0 PB(tB) > PE(tB) und PB(tB) > 0 PH(tH) > 0

Gestenerkennung: Merkmale
Merkmalsvektor: (r, Δθ, Δy ) Experimente: zylindrische Koordinaten besser als sphärische und kartesische Hand relativ zum Kopf  unabhängig von Position im Raum Δθ, Δy  keine Anpassung an Zeigeziele aus dem Training Spline-Interpolation der Merkmals-sequenzen auf konstant 40Hz.

Zeigerichtung Kopf-Hand-Linie Unterarmlinie Sehstrahl Auge-Hand
Einfach zu messen Unterarmlinie Potenziell überlegen bei abgewinkeltem Arm Schwieriger zu messen

Audio-Visual Speech Recognition

Lip Tracking Module Feature based
detects localization failures and automatic recover from failures tracks facial features (pupils, nostrils, lips)

Audio-Visual Recognition
hypc = la hypa + lv hypv 1 = la + lv Kombinations Methoden SNR Gewichte Entropie Gewichte trainierte Gewichte

Fusion Levels Word Level (Vote, Decide based on A and V score)
Phoneme Level (Combine by Diff. Weighting Schemes) Feature Level (Combine Features)

Audio-Visual Speech

Mögliche Themen Personentracking Gestenerkennung Attentive Interfaces
Face Detection Lippenlesen (Audio-Visual Speech Reco.) Audio-Visual Tracking Emotion Recognition Person Identification Microphone-Arrays Sensor Fusion Smart Room Infrastructure Intelligent Camera Control Self-Calibration Other Smart Room Projects (MIT, Georgia Tech, IM2) Other Sensors: Pressure, IR, etc Speech Recognition in Meetings Far-Field Efficient Microphone-Arrays

User Interfaces In the beginning: Wimpy Computing

Ähnliche Präsentationen

Präsentation zum Thema: "User Interfaces In the beginning: Wimpy Computing"— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback

Anmelden

Anmeldung über soziales Netzwerk:

User Interfaces In the beginning: Wimpy Computing

Ähnliche Präsentationen

Präsentation zum Thema: "User Interfaces In the beginning: Wimpy Computing"— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback