Präsentation herunterladen
Die Präsentation wird geladen. Bitte warten
Veröffentlicht von:Wotan Karren Geändert vor über 10 Jahren
1
Multimodale Räume “Smart Rooms” “Intelligent Environments” Seminar SS 03
2
User Interfaces In the beginning: Wimpy Computing
Windows, Icons, Menus, Pointing
3
2nd Generation:Human-Machine Interaction
Speaking Pointing, Gesturing Hand-Writing Drawing Presence/Focus of Attention Combination Sp+HndWrtg+Gestr. Repair Multimodal NLP & Dialog “Please show me… hm… all Hotels in THIS area.. er..part of the city"
4
“Perceptual” User Interfaces
Perceptive human-like perceptual capabilities (what is the user saying, who is the user, where is the user, what is he doing?) Multimodal People use multiple modalities to communicate (speech, gestures, facial expressions, …) Multimedia Text, graphics, audio and video (Matthew Turk (Ed.), Proceedings of the 1998 Workshop on Perceptual User Interfaces)
5
Next: Pervasive Computing
Human-Computer Interaction not the Only Exchange Humans Want to Interact with Other Humans Computers in the Human Interaction Loop (CHIL) The Transparent, Invisible Computer Computers Needs to be Context Aware Should Require little or no Learning or Attention Should be proactive rather than command driven Produce Little or No Distraction Permit a HCI and CHIL Mix
6
Smart/Intelligent Rooms
Use of computation to enhance everyday activity Integrate computers seamlessly into the real world (e.g. offices, homes) Use “natural” interfaces for communication (voice, gesture, etc. ) Computer should adapt to the human, not vice-versa!
7
Perception In order to respond appropriately, objects/room need(s) to pay attention to People and Context Machines have to be aware of their environment: Who, What, When, Where and Why? Interfaces must be adaptive to Overall situation Individual User
8
Intelligent Environments
Classroom 2000 (Georgia Tech) Mozer’s Adaptive House Enhanced Meeting Rooms Kids Room (MIT) … Enhanced Objects such as Whiteboards, Desks, Chairs, … See also the Intelligent Environments Resource Page (
9
Intelligent Rooms, Univ. California, San Diego
10
Classroom 2000 Capturing activity in a classroom Speaker’s voice Video
Slides Handwritten Notes
11
Classroom 2000 Presenting (recorded) lectures through a web-based interface Integration of Slides, Notes, Audio, Video Searching Adding additional material
12
Microsoft Easy Living Project
XML-based distributed agent system Computer vision for person-tracking and visual user interaction. Multiple sensor modalities combined. Use of a geometric model of the world to provide context. Automatic or semi-automatic sensor calibration and model building. Fine-grained events and adaptation of the user interface. Device-independent communication and data protocols. Ability to extend the system in many ways.
13
Mozer’s Adaptive House
Operated as an ordinary home Usual light-switches, thermostats, doors etc. Adjustments are measured and used to train the house to automatically adjust temperature adjust lighting choose music or TV channel The house infers the users desires from their actions and behaviours
14
Adaptive House (Mozer)
Sensors: Light Level Sound Level Temperature Motion Door status Window status Light settings Fan Heaters … (M. Mozer, Univ. of Colorado, Boulder)
15
Issues in Perception Visual Auditory Other: Haptic, Olfactoric, … ?
Face-detection / Tracking Body-Tracking Face Recognition Gesture Recognition Action Recognition Gaze Tracking / Tracking Focus of Attention Auditory Speech Recognition Speaker Tracking Auditory Scene Analysis Speaker Identification Other: Haptic, Olfactoric, … ?
16
Enhanced Meeting Rooms
Capturing of Meetings Transcription Summarization Dialog Processing Who was there ? Who talked to whom ?
17
Work at ISL Face Tracking Facial Feature Tracking (Eyes, Nose, Mouth)
Head Pose Estimation / Gaze Tracking Lip-Reading (Audio-Visual Speech Reco.) 3D Person Tracking Pointing Gesture Tracking Other Modalities: Speech (!!!, see John), Dialogue, Translation, Handwriting, ...
18
Tracking of Human Faces
A face provides different functions: identification perception of emotional expressions Human Computer Interaction requires tracking of faces: lip-reading eye/gaze tracking facial action analysis / synthesis Video Conferencing / video telephony application: tracking the speaker achieving low bit rate transmission
19
Demo: FaceTracker
20
Color Based Face Tracking
Human skin-colors: cluster in a small area of a color space skin-colors of different people mainly differ in intensity! variance can be reduced by color normalization distribution can be characterized by a Gaussian model Chromatic colors:
21
Color Model Disadvantages: Advantages: environment dependent very fast
(light-sources heavily affect color distribution) Advantages: very fast orientation invariant stable object representation not person-dependent model parameters can be quickly adapted
22
Tracking Gaze and Focus of Attention
In meetings: to determine the addressee of a speech act to track the participants attention to analyse, who was in the center of focus for meeting indexing / retrieval Interactive rooms to guide the environments focus to the right application to suppress unwanted responses Virtual collaborative workspaces (CSCW) Human-Robot Cooperation Cars (Driver monitoring)
23
Tracking a User’s Focus of Attention
Focus of Attention tracking: To detect a person’s interest To know what a user is interacting with To understand his actions/intentions To know whether a user is aware of something In meetings: to determine the addressee of a speech act to understand the dynamics of interaction for meeting indexing / retrieval Other areas Smart environments Video-conferencing Human-Robot Interaction
24
Head Pose Estimation Model-based approaches: Example-based approaches:
Locate and track a number of facial features Compute head pose from 2D to 3D correspondences (Gee & Cipolla '94, Stiefelhagen et.al '96, Jebara & Pentland '97,Toyama '98) Example-based approaches: estimate new pose with function approximator (such as ANN) (Beymer et.al.'94, Schiele & Waibel '95, Rae & Ritter '98) use face database to encode images (Pentland et.al. '94)
25
Model-based Head Pose estimation
Find correspondences between points in a 3D model and points in the image Iteratively solve linear equation system to find pose parameters (rx, ry, rz, tx, ty, tz) Y Z X Feature Tracking Pose Estimation 3D Model Real World Image
26
Demo: Facial Feature Tracking
27
Demo: Model-based Head Pose
28
Model-based Head Pose Pose estimation accuracy depends on correct feature localization! Problems: Choice of good features Occlusion due to strong head rotation Fast head movement Detection of tracking failure / re-initialization Requires good image resolution Video
29
Estimating Head Pose with ANNs
Train neural network to estimate head orientation Preprocessed image of the face used as input
30
Network Architecture Pan (Tilt) Hidden Layer: 40 to 150 units
Input Retina: up to 3 x 20x30 pixel 1.800 units
31
Tracking People in a Panoramic View
Camera View Perspective View Panoramic View
32
Training Separate nets for pan and tilt
Trained with Std.-Backprop with Momentum Term Datasets: Training on 6100 images from 12 users Crossevaluation on 750 images from same users Tested on 750 images from same users Additional User Independent Testset: 1500 images from two new users
33
Results histo: Histogram-normalized image used as input
edges: Horizontal- and Vertical Edge Image used as input both: Both, Histogram-image plus Edge Images used
34
Demo
35
Spatial-Awareness in Smart Rooms
Tracking people indoors To focus sensors on people To resolve spatial lrelationships To avoid bumping into humans To analyze activity Motivation
36
Person Tracking Vision based localization of people/objects:
Single Perspective: Pfinder - W3S - Hydra - etc. Multiple Perspective: AVIARY - Easy Living
37
Person Tracking in the ISL Smart Room
Cam3 Features Cam2 People Feature extractor Cam1 Cam0 Tracking agent Features
38
Personen-Tracking mit mehreren Kameras
Ziel: 3D Tracking von Personen in Räumen Segmentierung von Vordergrundobjekten in jedem Bild „3D Schnitt“ der Strahlen durch die Objektmitten Kalman-Filter
39
Adaptive Silhouette Extraction
Background subtraction: Adaptive Multi-Gaussian background model [Stauffer et al., CVPR 1998] Morphological operators smooth foreground output Connected components form silhouettes Silhouette extraction
40
Locating people Use calibrated sensors to calculate absolute position
1 Extract reference point: Centroid Use calibrated sensors to calculate absolute position Create list of location hypotheses 1 2 3 Location Hypotheses: i) (X,Y) ii) (X,Y) a b b a b a
41
Tracking people Best Hypothesis Tracking: Match location hypotheses
ato tracks Smooth tracks with Kalman afilter Track 1 ii) i) Track 2 Hypotheses i) (X,Y) ii) (X,Y) Track 1 Track 2
42
Tracking Problems Imperfect and Merged silhouettes: Counterstrategies
Better Vision algorithm Probabilistic Multi-Hypothesis aTracking Reference point: Head
43
Reference point: Head Use head as reference point instead of centroid
- Tracking error b - False alarm rate Use head as reference point instead of centroid Head tracker has significantly lower tracking error and false alarm rate
44
Demo
45
Erkennung von Zeigegesten
Ziele: Menschliche Zeigegesten erkennen Zeigerichtung in 3D extrahieren Einsatzgebiete: Mensch-Roboter-Interaktion smart rooms Anforderungen: Personenunabhängig Echtzeitbetrieb Kamerabewegung möglich
46
Erkennung von Zeigegesten
Stereokamera Linkes/rechtes Bild
47
3D-Tracker: Verarbeitungsschritte
Kamera Hautfarbe Disparität 3D-Clustering von Hautfarbpixeln liefert Hinweise auf Position von Kopf und Hände.
48
Gestenerkennung: Bewegungsphasen
Zeigegesten bestehen aus drei intuitiv unterscheidbaren Bewegungsphasen: Beginn Halten Ende Genaue Lokalisierung der Haltephase wichtig zur Bestimmung der Zeigerichtung μ [sec] σ [sec] Komplette Geste 1.75 0.48 Beginn 0.52 0.17 Halten 0.76 0.40 Ende 0.47 0.12 Mittlere Dauer der Bewegungsphasen
49
Gestenerkennung: Modelle
Modellierung der 3 Phasen mit separaten Modellen Kontinuierliche HMMs mit 2 Gaussians pro Zustand Null-Modell als Schwellwert für die Phasen-Modelle Training auf handgelabelten Daten
50
Gestenerkennung: Detektion
Eine Zeigegeste wird erkannt, wenn 3 Zeitpunkte tB < tH < tE gefunden werden, so dass PE(tE) > PB(tE) und PE(tE) > 0 PB(tB) > PE(tB) und PB(tB) > 0 PH(tH) > 0
51
Gestenerkennung: Merkmale
Merkmalsvektor: (r, Δθ, Δy ) Experimente: zylindrische Koordinaten besser als sphärische und kartesische Hand relativ zum Kopf unabhängig von Position im Raum Δθ, Δy keine Anpassung an Zeigeziele aus dem Training Spline-Interpolation der Merkmals-sequenzen auf konstant 40Hz.
52
Zeigerichtung Kopf-Hand-Linie Unterarmlinie Sehstrahl Auge-Hand
Einfach zu messen Unterarmlinie Potenziell überlegen bei abgewinkeltem Arm Schwieriger zu messen
53
Audio-Visual Speech Recognition
54
Lip Tracking Module Feature based
detects localization failures and automatic recover from failures tracks facial features (pupils, nostrils, lips)
55
Audio-Visual Recognition
hypc = la hypa + lv hypv 1 = la + lv Kombinations Methoden SNR Gewichte Entropie Gewichte trainierte Gewichte
56
Fusion Levels Word Level (Vote, Decide based on A and V score)
Phoneme Level (Combine by Diff. Weighting Schemes) Feature Level (Combine Features)
57
Audio-Visual Speech
58
Mögliche Themen Personentracking Gestenerkennung Attentive Interfaces
Face Detection Lippenlesen (Audio-Visual Speech Reco.) Audio-Visual Tracking Emotion Recognition Person Identification Microphone-Arrays Sensor Fusion Smart Room Infrastructure Intelligent Camera Control Self-Calibration Other Smart Room Projects (MIT, Georgia Tech, IM2) Other Sensors: Pressure, IR, etc Speech Recognition in Meetings Far-Field Efficient Microphone-Arrays
Ähnliche Präsentationen
© 2024 SlidePlayer.org Inc.
All rights reserved.