Projektseminar Bildverarbeitung – Image Processing

Projektseminar Bildverarbeitung – Image Processing
Guido H. Bruck

Projektseminar (Praktikum) Bildverarbeitung – Image Processing
Wahlpflichtfach Projektseminar „Anwendung von Bilddatenkompressionsverfahren“ MPEG-1, -2, -4, Realmedia, Windows Media etc. Jede Gruppe ein Verfahren Besprechung und Vorträge der Gruppen bei Treffen Anmeldung online bis zum unter Donnerstags von 11:00 Uhr -14:00 Uhr, Beginn , Raum BB130 Teilnehmer holen bitte die Unterlagen ab, da Vorbereitungsaufgaben zu lösen sind Beginn der Veranstaltung , 11:00-14:00 Am Ende: Klausur, da Note nötig

Anwendung von Bilddatenkompressionsverfahren
Kennenlernen von verschiedenen Verfahren zur Bilddatenkompression Software finden Software installieren Bildsequenzen für verschiedene Parameter codieren Bildqualitäten vergleichen Ziel: Welches Verfahren ist derzeit das geeignetste für welche Parameter und ... Quelle: c‘t 10/2001, S. 122ff ; c‘t 10/2003 S. 146ff ; c‘t 10/2005 S. 146ff Teilnehmer holen bitte die Unterlagen ab, da Vorbereitungsaufgaben zu lösen sind Beginn der Veranstaltung , 11:00-14:00 Am Ende: Klausur, da Note nötig

Vorgehensweise Gesamtprojekt in Teilprojekte aufspalten
Gruppen von Studenten bilden Teilprojekte den Studentengruppen zuordnen. Jede Studentengruppe bearbeitet ein Teilprojekt selbständig Jede Woche berichtet jede Gruppe mit einem kurzen Vortrag über den Fortgang der Arbeiten, stellt Fragen zur Diskussion und sammelt Vorschläge ein. Die Präsentation fasst jede Gruppe auf „Papier“ zusammen und stellt sie allen anderen Gruppen zur Verfügung Am Ende des Semesters haben wir das Gesamtprojekt bearbeitet und die Fragen gelöst.

Interessierende Bildgrößen und Datenraten
Bildgrößen SD-Video-Quellmaterial 720 x 576 Bildelemente Datenraten von 50 kBit/s bis 2MBit/s

Testsequenzen zur Bildcodierung
Teilnehmer holen bitte die Unterlagen ab, da Vorbereitungsaufgaben zu lösen sind Beginn der Veranstaltung , 11:00-14:00 Am Ende: Klausur, da Note nötig

Verfahren zur Untersuchung
MPEG-1, MPEG-2: Tsunami MPEG-Encoder: DivX DivX-6: RealVideo: Windows Media Video 9: MediaRig: On2: Xvid

Vorgehensweise Informationen zu Codierungsprogramm im Internet finden, lesen, nötige Software herunterladen Software auf PC installieren nach Absprache Software in Betrieb nehmen Verschiedene Bildsequenzen mit unterschiedlichen Parametereinstellungen codieren und beurteilen

Weitere interessante Internet-Seiten mit interessanter Software
Digital Digest Vcd-Helper Tha video capturing source Virtual Dub

Gruppeneinteilung I Pro Gruppe bis zu 5 Teilnehmer
Gruppe 1 MPEG-1, MPEG-2 Gruppe 2 RealVideo Gruppe 3 Windows Media Series 9 Gruppe 4 DivX-6, DivX-3.11 Gruppe 5 H.264/AVC

H.264/AVC x264 mpegable mpegable provides a free VFW AVC/H.264 Enc/Decoder (not based on the reference), which uses P-Frames only MainConcept MainConcept's AVC/H.264 implementation (based on the reference software?) is currently available as a free preview version (adds a watermark) JM (Reference Software) The Reference Software of the AVC/H.264 standard already offers a very rich featureset Hdot264 Hdot264 is an opensource (GPL) VFW version of the reference software, written by doom9 member charact3r

H.264/AVC http://forum.doom9.org/showthread.php?t=96059
- x264: the first publically available High Profile encoder, opensource (GPL) (Source), available for VFW: x264vfw, ffdshow (output .avi), as commandline: x264cli (outputs .mp4, .mkv, raw), mencoder (outputs raw, .avi) (Doom9's MeGUI) or ffmpeg x264 supports 2pass, CABAC, Loop, multiple B-Frames, B-References, multiple Reference Frames, 4x4 P-Frame, 8x8 B-Frame Blocksizes, anamorphic signalling and High Profile: 8x8 dct and intra prediction, lossless and custom quant matrices - NeroDigital AVC: useable in Nero Recode2, outputs .mp4 ND AVC supports 2pass, CABAC, (adaptive) Loop, multiple B-Frames, mulitple Reference Frames, weighted prediction, 8x8 P-Frame Blocksizes, 16x16 B-Frame Blocksizes, Adaptive Quant. (Psy High) - Sorenson: useable in Sorenson Squeeze 4, outputs .mp4, Sorenson supports 2pass, max 2 B-Frames, B-References, Loop and multiple Slices - Apple: useable in Quicktime 7, outputs .mp4, .3gp and .mov, totally slow uses 2pass, max 1 B-frame, Loop (0,0), P8x8,B8x8,I4x4, Adapt. Quant, 5 Slices, no CABAC, no Weighted Pred., no multi Ref. - JM: The AVC Reference Software offers in v9.3 Main and High Profile: B/SP-Frames, CABAC, Loop Filter, 4x4 Blocksizes, multiple Reference Frames, Adaptive Quant, Error Resilience, RDO, Lossless Coding, Custom Quants, Rate Control aso... - Hdot264: opensource (GPL) VFW version of the reference software by doom9 member charact3r, still based on a very old version of the reference (JM 4.0c) - VSS: free preview VFW Encoder (limited to 5 days), based on the reference encoder - Elecard: useable in Elecard Mobile Converter, outputs .mp4 and MainConcept's v2 encoder, outputs .264 and .mpg PS/TS not publically available anymore: - Moonlight: useable in Moonlight's OneClick Compressor v1.1 and CyberLink's PowerEncoder, outputs .mpg Moonlight supports 1pass (VBR/CBR/Fixed Quants), CABAC, Loop, 2 B-Frames, 8x8 P-Frame Sizes, Adapt. Quant, PAR, Interlacing - MainConcept: was useable in the v1 encoder (adds a watermark), outputs .264 and .mpg PS/TS 1pass (CBR/VBR/fixed Quants), P-Frame Reorder, CABAC, Loop, Multiple B-Vops, Multiple Ref, 4x4 P-Frame Sizes, PAR, RDO - mpegable: offered for some time a free VFW Encoder (not based on the reference), doesnt handle YV12 mpegable supports 1pass (fixed quants) uses P-Frames only, 8x8 P-Frame Blocksizes, CAVLC only, Loop - Envivio: useable in 4Coder, outputs .mp4

Grundlagen Bildqualitätsbeurteilung, Untersuchungsparameter
Irrelevanzreduktion Zeilensprung (Interlace) Beispiel für Bilddatenkompression Pixelanzahlen Abtastformate MPEG-Codierung

Beurteilung der Qualität von Videosignalen
Beurteilung auf der 5-stufigen „Impairment scale“ nach ITU-R BT 500 5.0 = transparent (this should be the original signal) 4.0 = perceptible, but not annoying (first differences noticable) 3.0 = slightly annoying 2.0 = annoying 1.0 = very annoying to annoy = stören, ärgern, lästig sein

Testsequenzen bei unterschiedlichen Datenraten beurteilen

Testsequenzen bei unterschiedlichen Datenraten beurteilen 2MBit/s 720x576 Pixel, 25fps 1MBit/s 720x576 Pixel, 352x288 Pixel, 25fps 500 kBit/s 720x576 Pixel, 352x288 Pixel, 25fps 300 kBit/s 720x576 Pixel, 352x288 Pixel, 25fps 150 kBit/s x288 Pixel, 25fps 64 kBit/s 352x288 Pixel, 25fps Datenrate R: S=Dateigröße n=Anzahl der Bilder in der Sequenz

1.4.1 Nutzung der Unvollkommenheiten des menschlichen Gesichtssinns zur Irrelevanzreduktion
In der elektronischen Bildübertragungstechnik nutzt man das begrenzte örtliche Auflösungsvermögen des menschlichen Auges dadurch aus, dass man das einzelne Bild in streifenförmige Elemente (“Zeilen” gelegentlich auch „Spalten“) mit Breiten zerlegt, die vom menschlichen Auge aus dem Normalbetrachtungsabstand unter einem Winkel gesehen werden, der kleiner als 1' ist und die über die Streifenbreite variierende Leuchtdichte z. B. durch deren “Mittelwert” ersetzt. Bei der elektronischen Bewegtbildübertragungstechnik wird zusätzlich das begrenzte zeitliche Auflösungsvermögen des menschlichen Gesichtssinns dadurch ausgenutzt, dass man eine begrenzte Anzahl von Bildern pro Zeiteinheit überträgt. Bei digitalen Bildübertragungsverfahren wird die Anzahl der Bits zur Amplitudendarstellung so gewählt, dass zu einem kontinuierlichen Signal kein Unterschied sichtbar wird.

1.4.2 Nutzung der Unvollkommenheiten des menschlichen Gesichtssinns zur Irrelevanzreduktion
Prinzip der Standard-Bildzerlegung in zwei Teilbilder: Bei den eingeführten Fernsehsystemen verwendet man das Teilbildverfahren. Bei ihm wird ein Vollbild in zwei Teilbilder (Halbbilder) zerlegt, die zeilenweise ineinander verschachtelt sind (Zeilensprungverfahren, Interlace-Verfahren). Blendenöffnung bei der Aufnahme bzw. Leuchtfleck bei der Wiedergabe 1 2 Horizontalrücklauf 3 4 5 Ende 6 7 8 9 10 11 Vertikalrücklauf 1. Teilbild Zerlegung in 11 “Zeilen” nach dem Zeilensprungverfahren 2. Teilbild

3.1.1a Digitale Fernsehstandards
DVB (-S, -C, -T…) Europa, teilw. Andere Kontinente ATSC USA, Korea ISDB (-T) Japan, Brasilien ISDTV Brasilien DTMB China ATSC DVB-T ISDB-T ISDTV DTMB Video Digitization MPEG-2 H.264 Audio Digitization Dolby AC-3 MPEG-2 AAC Multiplexing MPEG Signal Transmission 8-VSB modulation Multiplex COFDM SCM and MCM Mittlere netto-Datenrate 166 Mbit/s

3.1.1b Digitale Fernsehstandards
DVB = Digital Video Broadcast ATSC = Advanced Telecision Systems Comitee (USA, Korea) ISDB = Integrated Services Digital Broadcasting (Japan, Brasilien) ISDTV = International System for Digital Television (Brasilien) DTMB = Digital Terrestrial Television Multimedia Broadcasting (China) MPEG = Moving Pictures Experts Group Dolby AC-3 = Dolby Digital („Audio Coding 3“) AAC = Advanced Audio Coding VSB = Vestigial Side Band – Restseitenband COFDM = Coded Orthogonal Frequency Division Multiplex SCM = Single Carrier Modulation MCM = Multi Carrier Modulation Mittlere netto-Datenrate 166 Mbit/s

3.1.1c Digitales Fernsehen - SDTV
Für die digitale Übertragung von Videosignalen werden für die Übertragung im Studio und für die Übertragung zum Endteilnehmer unterschiedliche Standards verwendet. Standard für die Studioanwendung (ITU-R, BT.601) Bei diesem Standard werden das Luminanzsignal (Y) und zwei Chrominanzsignale (CR,CB) getrennt, aber in fester zeitlicher bzw. räumlicher Zuordnung zueinander abgetastet. Die Abtastfrequenzen werden im Verhältnis 4:2:2 gewählt. Für die Standards, denen die Parameter Z = 625 und fv = 50 Hz zugrunde liegen, gilt: Anzahl der Abtastwerte pro Zeile Abtastfrequenz Y: 864 (brutto) 720 (netto) 13,5MHz CR,CB: 432 (brutto) 360 (netto) 6,75MHz Die Signale werden einer gleichförmigen Quantisierung mit einer Auflösung entsprechend 8 Bit unterzogen. Im Interesse einer einfachen standardübergreifenden Konvertierbarkeit der Signale gelten die Werte für die Abtastfrequenz und die Anzahl der Abtastwerte pro Zeile weltweit. Aus den o.a. Werten ergibt sich eine Gesamtdatenrate von 216 Mbit/s (brutto). Mittlere netto-Datenrate 166 Mbit/s

3.1.3 Bild-Abtastformate (Vollbild-Schemata)
Mittlere netto-Datenrate 166 Mbit/s X: Luminanz-Signal (Y) ; : Chrominanz-Signal (CR,CB)

3.1.4 Bild-Abtastformate (Vollbild-Schemata)
Mittlere netto-Datenrate 166 Mbit/s X: Luminanz-Signal (Y) ; : Chrominanz-Signal (CR,CB)

3.1.5 Standard für die Übertragung zum Endteilnehmer
Die in genannte Datenrate ist für die Übertragung zum Endteilnehmer viel zu hoch. Zur Senkung der Rate verwendet man eine redundanz- und irrelevanzreduzierende Datenreduktion („Datenkompression“). Sie erfolgt weltweit nach den sogenannten MPEG-Standards (Speziell: MPEG-2-Standard „ISO/IEC IS 13818“). Der MPEG-2-Standard: (MPEG: „Motion Picture Experts Group“) Bei Bildsignalen bestehen große Abhängigkeiten zwischen zeitlich aufeinanderfolgenden Bildern und ebenfalls Abhängigkeiten zwischen benachbarten Bildelementen in eine Bild. Diese Abhängigkeiten sind in den Quellsignalen enthalten. Durch eine Entfernung dieser Abhängigkeiten lässt sich die zur Übertragung nötige Datenmenge reduzieren. Die Abhängigkeiten können auf der Empfangsseite in das Signal wieder eingefügt und so das Quellsignal rekonstruiert werden.

3.1.6 Diskrete-Cosinus-Transformation (DCT)
Bei Bildcodierungsverfahren wird häufig eine Transformationscodierung unter Verwendung einer 2-dimensionalen DCT verwendet. Damit lassen sich Abhängigkeiten der Bildelemente innerhalb eines Bildes verringern. Dies geschieht auch beim MPEG-2-Standard. Die 2-dimensionale DCT ist beschrieben durch: bis hierhin

3.1.6a Diskrete-Cosinus-Transformation (DCT)
Die 2-dimensionale DCT kann wegen ihrer Separierbarkeit auch als Hintereinanderschaltung zweier eindimensionaler DCT aufgefasst werden:

Basisfunktionen der 1-dimensionalen DCT für eine Blocklänge von N=8:

Basisfunktionen der 2-dimensionalen DCT für eine Blocklänge von N=8: ende

DCT-Explorer von Lutz Lißeck Download von den Internet-Seiten des Fachgebiets KT („Freie Software“) Programm erklären DCT ohne Quantisierung: - alle 100 - Beispiel Vorlesung - Grauverlauf - senkrechte Linie - Waggerechte Linie - Zufallsmuster DCT mit Quzantísierung - alle 100 - Beispiel Vorlesung - Grauverlauf - senkrechte Linie - Waggerechte Linie - Zufallsmuster mit unterschiedlicher Skalierung

Die Quantisierung Zur Transformation werden jeweils 8*8 Bildelemente (Pixel) zu einem Transformationsblock zusammengefaßt. Aus 8*8 Originalwerten entstehen durch die Transformation 8*8 Spektralwerte. Zur im Regelfall irrelevanzreduzierenden Codierung werden in einem nachfolgenden Verarbeitungsschritt die Spektralkoeffizienten einer Quantisierung unterzogen. Sie geschieht dadurch, dass man die Spektralkoeffizienten zunächst durch ihnen zugeordnete Q-Werte dividiert und die Ergebniswerte auf ganze Zahlen rundet. Die Q-Werte sind den Blöcken entsprechend in zweidimensional angeordneten Tabellen zusammengefasst. Dabei sind für unterschiedliche Bildtypen und Verfahren auch unterschiedliche Tabellen üblich:

3.1.11 Die Quantisierung Quantisierungstabellen für MPEG-2:

3.1.12 Codierung eines Blocks mittels der DCT
Das folgende Beispiel einer JPEG-Codierung (Joint Photographic Experts Group), einem der MPEG-2-Codierung ähnlichen Codierungsverfahren für Standbilder, zeigt die Anwendung der 2-dimensionalen DCT mit N=8 und Quantisierung der Spektralkoeffizienten für einen Bildblock. Signalwerte des Originalbildes DCT-Spektralkoeffizienten

Die DCT Spektralkoeffizienten werden durch die ihnen zugeordneten Q-Werte dividiert und die Ergebniswerte auf ganze Zahlen gerundet. Die so entstandenen Zahlen seine als „normalisierte Spektralkoeffizienten“ bezeichnet. Quantisierungstabelle (Q-Werte) Normalisierte Spektralkoeffizienten

Auf der Empfangsseite werden die normalisierten Spektralkoeffizienten mit den zugehörigen Q-Werten multipliziert (oftmals irreführend als „inverse Quantisierung“ bezeichnet). Mittels der DCT-Rücktransformation wird dann aus den denormalisierten Spektralkoeffizienten ein Bildsignalblock rekonstruiert. Denormalisierte Spektralkoeffizienten Rekonstruierte Bildsignalwerte

x8 DCT Beispiel

x8 DCT-Beispiel

x8 DCT-Beispiel Bis hierhin , noch ohne DCT-Explorer

3.2.1 Der MPEG-2-Standard Im allgemeinen wird man davon ausgehen können, dass das zu codierende Programmaterial dem Studio-Standard entsprechend im 4:2:2-Format vorliegt. Die Werteauflösung entspricht dabei 8 Bit (in Sonderfällen 10 Bit). Der Codierung nach dem MPEG-2-Standard liegt in der Regel das 4:2:0–Abtastformat zugrunde. Ausgehend vom 4:2:2–Format wird bei den Chrominanzsignalen, im Standard mit CR und CB bezeichnet, eine Abtastraten-Konversion vorgenommen. Sie bewirkt neben der Konversion der Abtastfrequenzen in vertikaler Richtung auch die Veränderung der „Abtastposition“ um einen halben Zeilenabstand. Zur Transformation werden jeweils 8*8 Bildelemente (Pixel) zu einem Transformationsblock zusammengefaßt. Bei Vorliegen eines 4:2:0-Abtastformats deckt dementsprechend ein Chrominanzsignalblock die vierfache Bildfläche eines Luminanzsignalblocks ab. Vier im Quadrat angeordnete Luminanzsignalblöcke und zwei zugeordnete Chrominanzblöcke bilden einen Makroblock.

3.2.2 Der MPEG-2-Standard Makroblock-Struktur beim Main-Profile Y (EY)
CB (EU) Block 4 CR (EV) Block 5 Y (EY) Block 2 Y (EY) Block 3

3.2.3 Der MPEG-2-Standard Im MPEG-2-Standard unterscheidet man verschiedene Bildtypen: I-Bilder: Intraframe codierte Bilder P-Bilder: Unidirektional prädizierte Bilder B-Bilder: Bidirektional prädizierte Bilder Die Abfolge der verschiedenen Bildtypen in einer Bildfolge ist weitgehend durch den Codierer festlegbar. Eine für die fernsehtechnische Anwendung günstige Abfolge zeigt Abb. 1. bis hierhin , 1 I 2 B 3 B 4 P 5 B 6 B 7 P 8 B 9 B 10 P 11 B 12 B 13 I Abb. 1

3.2.3 Der MPEG-2-Standard Die I-Bilder werden eigenständig, ohne den Einsatz einer Prädiktion codiert. Sie stellen „Stützbilder“ dar und erlauben dem Empfänger einen Einstieg in die Decodierung der Bilder. Die P-Bilder werden unter Verwendung eines jeweils vorausgegangenen I- oder P-Bildes prädiziert. Das prädizierte Bild ist aber nicht einfach ein entsprechendes vorausgegangenes Bild, sondern ein durch Bewegungsschätzung aus dem vorausgegangenen hervorgehendes Bild. Die Bewegungsschätzung erfolgt makroblockweise. Dazu wird untersucht, welcher Block mit den Abmessungen eines Makroblocks einen Makroblock in einem aktuellen Block am besten annähert. Die Blockbewegungen werden in Form von Bewegungsvektoren zum Empfänger übertragen. Die B-Bilder werden jeweils aus einem vorausgegangenen I- oder P-Bild und einem nachfolgenden I- oder P-Bild geschätzt. Dazu werden auch bidirektional geschätzte Bewegungsvektoren ermittelt. Durch die Verwendung von B-Bildern kann die datenreduzierende Wirkung der Codierung merklich gesteigert werden.

3.2.4 Der MPEG-2-Standard Ein zur Bewegungsvektorschätzung häufig eingesetztes Verfahren ist das Block-Matching-Verfahren. Beispiel: bis hierher am

3.2.5 Der MPEG-2-Standard Zur Bildprädiktion werden Bildspeicher benötigt. Auch die Bewegungsvektorschätzung ist mit einem Speicherbedarf verbunden. Um den Speicherbedarf auf der Empfängerseite so klein wie möglich zu halten, sieht der MPEG-2-Standard eine Bildumsortierung vor. Z.B. die in dargestellte Bildfolge erfährt zur Übertragung folgende Umsortierung: I B B P B B P B B P B B I I P B B P B B P B B I B B Durch das beschriebene Bild-Prädiktionsverfahren wird eine Wertedekorrelation in den Differenz-Bildern erreicht. Dementsprechend ist die dekorrelierende Wirkung der DCT in diesen Bildern im Vergleich mit der in den I-Bildern gering. Dies findet Berücksichtigung in der Auslegung der Quantisierungtabellen in Für die I-Bilder werden die QI(x,h)-Werte und für die P- und B-Bilder die QP,B(x,h)-Werte verwendet. Die Tabelle der QI(x,h)-Werte berücksichtigt, dass die wesentliche Energie eines I-Bild-Blocks in den „niederfrequenten“ Spektralanteilen steckt. Die I-, P- und B-Bilder werden in Blöcke eingeteilt und die DCT-Spektralkoeffizienten berechnet und quantisiert.

3.2.6 Der MPEG-2-Standard Die quantisierten Spektralkoeffizienten werden vor ihrer Übertragung noch einer redundanzmindernden Codierung unterzogen, und zwar einer Lauflängencodierung und einer anschließenden „Optimalkodierung“ (z.B. Huffmann-Codierung). Die entsprechend den Originalwerten ebenfalls im Quadrat angeordneten Spektralwerte werden zunächst in eine eindimensionale Wertefolge umgewandelt. Die Wertefolge enthält im allgemeinen mehr oder weniger lange Null-Folgen, die als Runs bezeichnet werden. Ein Run wird jeweils in Kombination mit einem nachfolgenden Nicht-Null-Wert, einem sogenannten Level, codiert. Dabei sind auch Runs der Länge Null möglich. Die Run-Level-Kombinationen werden einer „Optimalcodierung“ (entsprechend einer abgebrochenen Huffmann-Codierung) unterzogen.

3.2.7 Der MPEG-2-Standard Die Überführung der quadratisch angeordneten Spektralwerte-Koeffizienten in eindimensionaler Folge wird mithilfe eines sogenannten Zick-Zack-Scans vorgenommen. Ein solcher Scan ist in Abb. 1 gezeigt. Er schließt den Koeffizienten des Gleichanteils aus. Dieser wird in Form eines Differenzwertes gegenüber dem des vorausgegangenen Blocks getrennt codiert.  bis hierhin Bis hierhin  Abb. 1

3.2.8 Der MPEG-2-Standard Die Gesamtstruktur des MPEG-2-Coders geht aus Abb. 1 hervor (z.B. mit P-Bildschätzung). QF QD RLC VLC DCT Q MUX B DCT: Diskrete Kosinus-Transformation DCT-1: Inverse Diskrete Kosinus-Transformation Q: Quantisierungseinheit Q-1: Rückgewinnung der (quantisierten) Spektralkoeff. RLC: Lauflängen-Codierung VLC: „Optimalcodierung“ ME: Bewegungsschätzung MC: Bewegungskompensation P: Prädiktor PS: Bildspeicher MV: Bewegungsvektoren MUX: Multiplexer B: Puffer QF: Quantisierungsfaktor QD: Quantisierungsdaten „Q-1“ DCT-1 MC ME PS P MV Abb. 1

3.2.9 Der MPEG-2-Standard Abb. 1 zeigt die MPEG-2-Decodierungsschaltung. QD B MUX-1 VLC-1 RLC-1 Q-1 DCT-1 P Abb. 1 MV MUX-1: Demultiplexer RLC-1: Zu RLC und VLC inverse Operationen VLC-1: Zu RLC und VLC inverse Operationen Durch die Verwendung der Operationen DCT-1 und Q-1 auch auf der Codierungsseite wird sichergestellt, dass im Coder und Decoder gleiche Werte prädiziert werden.

3.2.10 Der MPEG-2-Standard Level High-1440 Low High Profile
(wie Spat. Prof., + 4:2:2-Cod.) (wie SNR Prof., + Spat. Scal.) (wie Main Profile + SNR Scal.) (4:2:0, keine Scal.) (wie Main Profile, ohne B-Bilder) Levels Profiles High Profile SNR Scalable Spatial Scalable Main Simple 352*288 Pixel 4 Mbit/s 1920*1152 Pixel (960*576) 100(80,25) Mbit/s 1440*1152 Pixel (720*576) 80(60,20) Mbit/s 60(40,15) Mbit/s 4(3) Mbit/s 720*576 Pixel (352*288) 20(15,4) Mbit/s 15(10) Mbit/s 15 Mbit/s 60 Mbit/s 80 Mbit/s Level Low High-1440

Der MPEG-2-Standard Der MPEG-2-Standard beschreibt einen sogenannten generischen Code (Gatterungscode) für die „Gatterung“ Bewegtbildcodierung. Er unterteilt deren Gesamtbereich in Merkmalsbereiche, die hinsichtlich „qualitativer“ und „quantitativer“ Merkmale unterschieden werden. In der Sprache der MPEG werden die Merkmalsbereiche nach Profils und Levels unterschieden. Eine Übersicht über die Profile-Level-Struktur des Standards gibt Die in der Übersicht angegebenen Parameterkombinationen sind als Maximumwerte zu verstehen. Die in Klammern stehenden Werte beziehen sich auf niedere, sogenannte Layer der Profile-Level-Kombinationen der Formen der skalierbaren Codierung. Die MPEG-2-Codierung ist abwärtskompatibel aufgebaut. Das bedeutet: Ein Decoder, der einem Feld in der Tabelle zugeordnet ist, muss auch die Datenströme entsprechend den links und unterhalb dieses Feldes liegenden Felder decodieren können. bis hierhin Jeong Kim

3.3.1 MPEG-4 MPEG-4 hat eine gegenüber MPEG-2 erweiterte Funktionalität: Kompression Flexibilität keine Festlegung auf bestimmtes Verfahren, oft nur Interfaces standardisiert stattdessen: Sammlung verschiedener „Tools“ „Profiles“ limitieren Tools, die der Decoder implementieren kann optimale Qualität für spezielle Anwendungen Wie bei MPEG-1 und –2 sind der Decoder und der Datenstrom standardisiert. Verschiedene Hersteller können so verschiedene Encoder herstellen, die zusammen mit einem standardgemäßen Decoder unterschiedliche Bildqualitäten liefern. Wegen der besseren Darstellung wird hier teilweise der Codierungsvorgang dargestellt.

3.3.2 MPEG-4 MPEG-4 Entwicklungen des Standards im Bereich Visual:
MPEG-4-Standard ISO 14496, Part 2, Visual Version 1: Basic MPEG-4, International Standard (IS) Mai 1999 Version 2: Neue Tools und Profiles, IS Feb. 2000 Amendment 1, Studio Profile: Tools und Profiles für Studio und Produktion IS März 2001 Amendment 2, Streaming Video Profile: Weitere Tools und Profiles, IS März 2001 Amendment 3, Neue Levels und Tools: IS Dec. 2002

3.3.3 MPEG-4 Die zu übertragenen Daten werden auf unterschiedliche Art und Weise codiert: Szene wird in „Einheiten“ unterteilt, sog. „Audio-Visual Objects“ Zusammensetzung der Audio-Visual Objects wird gespeichert im „Szenengraph“ Audio-Visual Objects: Video Objects Natural (Kamera) Synthetic (computergeneriert) Audio Objects Natural (Mikrofon) Szenengraph sorgt für korrekte Positionierung der Media Objects räumlich zeitlich Realisiert durch gerichteten, azyklischen Graphen

3.3.4 MPEG-4 Szenegraph Jedes Audio-Visual Object besitzt ein eigenes Koordinatensystem Szenengraph verfügt selbst auch über ein Koordinatensystem durch Berechnung: „virtuelle Kamera“ möglich Blickwinkel Ausschnitt Audio-Visual Objects einer Szene sind hierarchisch sortiert Blätter: Primitive Audio-Visual Objects Still Images - unbewegte Bilder (z. B. Hintergrund) Video Objects (z. B. Person) Audio Objects (z. B. Stimme) scene Person 2D Background Video voice picture Quelle: MPEG-4 – Ein Überblick, Paderborn Center for mobile Computing

3.3.5 MPEG-4 Codierung Bespiel für die Codierung einzelner Audio-Visual Objects , Quelle: Kohtaro Asai and Fuminobu Ogawa: MPEG Coding Technologies, Mitsubishi Electric ADVANCE December 1998

3.4.1 Datenstrom MPEG-4 Visual
Codierung von visuellen Objekten in natürlichen und synthetischen, bewegten Bildern. Hierarchische Beschreibung einer „visual scene“: Visual Object Sequence (VS): Die komplette MPEG-4-Szene kann beliebig viele 2D oder 3D natürliche oder synthetische Objekte und deren Erweiterungen enthalten Video Object (VO): Entspricht einem bestimmten (2D-)Objekt in der Szene. Im einfachsten Fall ein rechteckiges Bild oder ein beliebig geformtes Objekt Video Object Layer (VOL): Jedes Video-Objekt kann skalierbar (multi layer) codiert werden. Es gibt räumliche (spatial) und zeitliche (temporal) Skalierbarkeit, von grober bis zu feiner Auflösung. Der Decoder kann die passende Skalierung einstellen. Group of Video Object Planes (GOV): Sie fasst mehrere Video Object Planes zusammen. Auf den Anfang einer GOV kann wahlfrei zugegriffen werden. Video Object Plane (VOP)

Video Object Plane (VOP): Sie stellt einen zeitlichen Abtastwert eines Video-Objektes dar und kann unabhängig oder durch Bewegungskompensation in Abhängigkeit von anderen VOPs codiert werden. Ein herkömmliches Videobild kann durch eine VOP mit rechteckiger Form repräsentiert werden. Ein natürliches Video-Objekt enthält die folgenden Informationen: Form (shape) Bewegungsinformation (motion) Texturen (texture) Zusätzliche Kompression durch sprite Coding Sprites: zeitunveränderliche Video-Objekte, größer als das übliche Bild, zur Repräsentation größerer, statischer Bildinhalte (Hintergrund).

Entsprechend enthalten die VOPs in Makroblöcken: Zeitliche Abtastwerte des Video-Objekts: Form-Informationen (Shape) Bewegungsvektoren und –parameter Texturdaten: Luminanzsignal und unterabgetastetes Chrominanzsignal wie bei MPEG-2 im Abtastformat 4:2:0 für DCT-Codierung in 8x8-Blöcken

Beispiel einer VOP-Decodierung in MPEG-4 ohne Sprites Quelle: Touradj Ebrahimi and Caspar Horne: MPEG-4 Natural Video Coding - An overview, aus dem Internet

Beispiel einer Überlagerung mehrerer Video Object Planes (VOPs): Quelle: Andrew Perkis: MPEG-4 visual, SIE2070 Multimedia signal processing, 4/3/2002

3.5.1 Tools zur Codierung von VOPs natürlicher Video-Objekte
Shape Coding Tools Binary Shape Coding Gray Scale Shape Coding Bewegungskompensation (Motion Compensation Tools) mit optionalen Erweiterungen: Overlapped Motion Compensation, Advanced Motion Compensation Texturcodierung Sprites Skalierbarkeit Räumliche Skalierbarkeit Zeitliche Skalierbarkeit Error Resilence ...

3.6.1 Binary-Shape-Coding Binary-Shape-Coding
Matrix (Binary Alpha Map) definiert, ob Pixel zu Objekt gehört 0 - liegt nicht drin 255 - liegt drin 255 Quelle: MPEG-4 – Ein Überblick, Paderborn Center for mobile Computing

3.6.2 Binary-Shape-Coding Jede VOP wird von einer „Bounding Box“ umschlossen Zu der VOP gehört eine binäre Maske, die für jedes Pixel, das zur VOP gehört, den Wert 255 enthält. Alle anderen Pixel haben der Wert 0. Die binäre Maske wird in „Binary Alpha Blocks“ (BAB) der Größe 16x16 Pixel unterteilt, die einzeln codiert werden. Haben alle Pixel den Wert 0, ist der Block transparent und es erfolgt keine weitere Codierung. Haben alle Pixel den Wert 255, ist der Block undurchsichtig, er gehört zur VOP und es erfolgt eine Codierung. BAB werden durch eine „Context based Arithmetic Encoding“ (CAE) mit Bewegungskompensation codiert. Dabei gibt es InterCAE und IntraCAE mit und ohne Bewegungskompensation. Quelle: Touradj Ebrahimi, MPEG-4 Natural Video Tools, Signal Processing Laboratory Swiss Federal Institute of Technology, Lausanne

3.6.3 Gray-Scale-Shape-Coding
Entspricht der Binary-Shape-Coding mit dem Unterschied, dass jeder Wert der Maske Werte zwischen 0 (transparent) und 255 (undurchsichtig) annehmen kann und mit den dazwischen liegenden Werten die Transparenz des zugehörigen Pixels beschrieben wird. Gray-Scale-Shape-Information wird mit einer bewegungskompensierten DCT ähnlich der Texturcodierung codiert

3.7.1 Bewegungskompensation (Motion Compensation)
Die bekannten Verfahren anderer Video-Codierungsstandards wurden an die VOP-Struktur von MPEG-4 angepasst. Daher gibt es drei Typen von Video Object Planes (VOP): Eine VOP wird unabhängig von anderen VOPs codiert. Diese VOP wird Intra-VOP (I-VOP) genannt. Eine VOP wird durch Bewegungskompensation auf Basis einer anderen vorher codierten VOP codiert. Solche VOPs werden Predicted VOP (P-VOP) genannt. Eine VOP wird durch Bewegungskompensation auf Basis von früheren und zukünftigen VOPs codiert. Diese VOP werden Bidirectional Interpolated VOPs (B-VOP) genannt. Sie dürfen nur auf Basis von I-VOPs und P-VOPs interpoliert werden.

Die Skizze zeigt verschiedene Typen von VOPs. Wenn ein 16x16 Pixel großer Macroblock ganz in der VOP liegt, geschieht die Bewegungskompensation auf Basis eines Block-Matching mit den Macroblöcken bzw. mit 8x8 Pixeln großen Blöcken im „Advanced Prediction Mode“. Bis hierhin Quelle: Touradj Ebrahimi, MPEG-4 Natural Video Tools, Signal Processing Laboratory Swiss Federal Institute of Technology, Lausanne

Wenn ein Macroblock nicht ganz innerhalb der VOP liegt, werden verschiedene Füllverfahren für die außerhalb liegenden Teile sowie modifizierte-Block (Polygon)-Matching-Verfahren verwendet. Quelle: Touradj Ebrahimi, MPEG-4 Natural Video Tools, Signal Processing Laboratory Swiss Federal Institute of Technology, Lausanne

3.8.1 Texturcodierung Die Texturinformation eines Video-Objektes wird mit Hilfe des Luminanzsignals Y und den beiden Chrominanzsignalkomponenten CR und CB beschrieben. Im Fall von I-VOPs enthalten das Luminanzsignal und das Chrominanzsignal direkt die zugehörigen Werte. Im Fall von bewegungskompensierten VOPs enthalten sie nur die nach der Bewegungskompensation verbleibenden Fehlersignale. Zur Codierung der Texturinformation wird eine übliche 8x8 blockbasierte DCT verwendet. Zeilensprung-Bilder: Field-DCT Coding Progressive Bilder: Frame DCT-Coding Eine Übersicht über den Codierungsvorgang zeigt das folgende Bild: Codierung mit variabler Wortlänge VOP Textur DCT Quantisierung Prädiktion der Koeffizienten Scan der Koeffizienten Datenstrom

3.8.2 Texturcodierung Diskrete Cosinus-Transformation (DCT)
Blöcke der Größe 8x8 Pixel werden der VOP überlagert. Blöcke innerhalb der VOP werden ohne Veränderung codiert. Blöcke, die teilweise auf der Grenze der VOP liegen, werden anders behandelt. Quelle: Touradj Ebrahimi, MPEG-4 Natural Video Tools, Signal Processing Laboratory Swiss Federal Institute of Technology, Lausanne

3.8.3 Texturcodierung Blöcke der Größe 8x8 Pixel an den Grenzen der VOP enthalten nur teilweise Werte aus der VOP. Die übrigen Werte, die außerhalb der VOP lokalisiert sind, werden aus den Werten der VOP extrapoliert: Bei Luminanzsignal-Blöcken wird auf Basis 16x16 extrapoliert, bei Chrominanzsignal-Blöcken auf Basis 8x8. Bei P- und B-Blöcken werden außerhalb der VOP liegende Pixel mit dem Wert Null gefüllt. Bei I-Blöcken werden die außerhalb der VOP liegenden Werte durch einen zweistufigen Prozess mit Werten gefüllt. Dieser Prozess wird Low Pass Extrapolation (LPE) genannt.

3.8.4 Texturcodierung Low Pass Extrapolation:
Die Werte außerhalb der VOP werden auf den Mittelwert der Werte innerhalb der VOP gesetzt: Die Werte in der Nähe des Randes der VOP werden angepasst: Als Signalwerte für die rechte Seite der obigen Gleichung kommen nur Werte in Betracht, die innerhalb der VOP liegen. Wenn Werte außerhalb liegen, werden sie nicht berücksichtigt und der Nenner des Bruches entsprechend angepasst. Nach der Erweiterung werden die Randblöcke genauso behandelt wie die inneren Blöcke.

3.8.5 Texturcodierung Quantisierung
Die nach der Transformation vorliegenden Spektralkoeffizienten werden quantisiert, indem die Spektralkoeffizienten durch einen Wert dividiert und die Ergebnisse auf ganze Zahlen gerundet werden. Dabei gibt es verschiedene Methoden: Division durch Werte einer Quantisierungsmatrix ähnlich wie bei MPEG-2. Division durch einen festen Wert. Eine ungleichförmige Quantisierung („non-linear“) des DC-Koeffizienten ist möglich. bis

3.8.6 Texturcodierung Prädiktion der Spektralkoeffizienten in Abhängigkeit der Unterschiede der Koeffizienten in den Blöcken A, B und C DC-Prädktion AC-Prädiktion Nachbar-block B Kandidaten-block C Kandidaten-block C Nachbar-block B Kandidaten-block A Aktueller Block Aktueller Block Kandidaten-block A Prädiktion des DC-Koeffizienten durch den Koeffizienten von Block A oder von Block C Entweder die Koeffizienten der ersten Zeile oder der ersten Spalte werden aus den entsprechenden Koeffizienten eines ausgewählten Kandidatenblocks prädiziert.

3.8.7 Texturcodierung Scan der Koeffizienten
Die Koeffizienten werden in vorgegebener Weise in einen eindimensionalen Datenvektor überführt. Dazu dient der Zig-Zag-Scan nach oder zwei alternative Muster (Alternate-horizontal scan und Alternate-vertical scan), die in Abhängigkeit von der Prädiktion der DC-Koeffizienten benutzt werden. ende ,

Figure 11: Parent child relationship of wavelet coefficients
3.8.8 Texturcodierung Statische Texturen Mit MPEG-4 können statische Texturen, die sich also nicht bewegen, auf animierte 2D- oder 3D-Oberflächen abgebildet werden. Zu diesem Zweck verfügt MPEG-4 über Methoden zur Codierung von statischen Texturen, die eine größere Skalierbarkeit ermöglichen als die Codierung mittels der DCT. Die Codierung beruht auf einer Wavelet-Transformation, wobei die DC- und AC-Komponenten unterschiedlich codiert werden. Die sich ergebenden Koeffizienten werden quantisiert und mit einem sog. Zero-Tree-Algorithmus und arithmetischer Codierung codiert. 5.7 Static texture One of the functionalities supported by MPEG-4 is the mapping of static textures onto 2-D or 3-D surfaces. MPEG-4 visualsupports this functionality by providing a separate mode for encoding static texture information. The static texture coding technique provides a high degree of scalability, more than the DCT based texture coding technique. The static coding technique is based on a wavelet transform, where the AC and DC bands are coded separately. The wavelet coefficients are quantized, and encoded using a zero-tree algorithm and arithmetic coding. 5.7.1 Wavelet Texture information is separated into subbands by applying a discrete wavelet transform to the data. The inverse discrete wavelet transform is applied on the subbands to synthesize the texture information from the bitstream. The discrete wavelet transform can either be applied in floating point, or in integer, which is signaled in the bitstream. The discrete wavelet transform is applied recursively on the obtained subbands, yielding a decomposition tree of subbands. An example of a wavelet transform with two decomposition levels, is shown in Fig. 10. The original texture is decomposed into four subbands, and the lowest frequency subband is split again into four subbands. Here, subband 1 represents the lowest spectral band, and is called the DC component. The other subbands are called the AC subbands. DC and AC subbands are processed differently. 5.7.2 DC subband The wavelet coefficients of the DC subband are treated differently from the other subbands. The coefficients are coded using a predictive scheme. Each coefficient can be predicted from its left or its top neighbor. The choice of the predictor coefficient depends on the magnitude of the horizontal and vertical gradient of the neighboring coefficients. If the horizontal gradient is smallest, then prediction from the left neighboring coefficient is performed, otherwise prediction from the top neighboring coefficient is performed. The coefficient is then quantized, and encoded using arithmetic coding. Figure 10: Illustration of a wavelet transform with two decomposition levels. 5.7.3 AC subbands The wavelet coefficients in the remaining subbands are processed in the following way. Typically many coefficients in the remaining subbands become zero after quantization, and the coding efficiency depends heavily on encoding both the value and the location of the non-zero coefficients effectively. The technique used to achieve a high efficiency is based on the strong correlation between the amplitudes of the wavelet coefficients across scales, at the same spatial location, and of similar orientation. Thus, the coefficient at a coarse scale, the parent, and its descending coefficients at a finer scale, the children, exhibit a strong correlation. These relationships are illustrated in Fig. 11. Figure 11: Parent child relationship of wavelet coefficients Zero tree algorithm exploiting such relationships is used to code both coefficients values and locations. The algorithm relies on the fact that if a wavelet coefficient is zero at a coarse scale, it is very likely that its descendent coefficients are also zero, forming a tree of zeros. Zero trees exist at any tree node where the coefficient is zero, and all its descendants are also zero. Using this principle, wavelet coefficients in the tree are encoded by arithmetic coding, using a symbol that indicates if a zero tree exists, and the value of the coefficient.

3.9.1 Sprites Sprites sind Hintergrund-Objekte, die sich nicht bewegen und nur einmal übertragen werden. Sie werden wie I-VOPs am Anfang eines Datenstroms übertragen und stehen dann in einem Speicher dem Empfänger zur Verfügung. Bei einem Wechsel der Kameraposition rechnet der Empfänger den jeweils sichtbaren Teil aus den im Speicher vorliegenden Daten aus. Es lassen sich große Kompressionraten erzielen. A sprite consists of those regions of a VO that are present in the scene, throughout the video segment. An obvious example is a `background sprite' (also referred to as the `background mosaic'), which would consist of all pixels belonging to the background in a camera-panning sequence. This is essentially a static image that could be transmitted only once, at the beginning of the transmission. Sprites have been included in MPEG-4 mainly because they provide high compression efficiency in such cases. For any given instant of time, the background VOP can be extracted by warping/cropping this sprite appropriately. Sprite-based coding is very suitable for synthetic objects, but can also be used for objects in natural scenes that undergo rigid motion. Similar to the representation of VOPs, the texture information for a sprite is represented by one luminance component and two chrominance components. The three components are processed separately, but the methods used for processing the chrominance components are the same as those used for the luminance components, after appropriate scaling. Shape and texture information for a sprite is encoded as for an I-VOP. Static sprites are generated, before the encoding process begins, using the original VOPs. The decoder receives each static sprite before the rest of the video segment. The static sprites are encoded in such a way that the reconstructed VOPs can be generated easily, by warping the quantized sprite with the appropriate parameters. In order to support low-latency applications, several possibilities are envisaged for the transmission of sprites. One way to meet the latency requirements is to transmit only a portion of the sprite in the beginning. The transmitted portion should be sufficient for reconstructing the first few VOPs. The remainder of the sprite is transmitted, piece-wise, as required or as the bandwidth allows. Another method is to transmit the entire sprite in a progressive fashion, starting with a low quality version, and gradually improving its quality by transmitting residual images. In practice, a combination of these methods can be used. Quelle: Touradj Ebrahimi, MPEG-4 Natural Video Tools, Signal Processing Laboratory Swiss Federal Institute of Technology, Lausanne

Skalierbarkeit Durch verschiedene Video-Object-Layer wird eine räumliche oder zeitliche Skalierung implementiert. Beispiel: Base-Layer und Enhancement-Layer Im Fall der räumlichen Skalierung enthält der Base-Layer Informationen über das Videoobjekt mit geringer Auflösung und der Enhancement-Layer die nötige Zusatzinformation zur Auflösungserhöhung. MPEG-4 uses a generalized scalability framework to enable spatial and temporal scalabilities. This framework allows the inclusion of separate modules, as necessary, to enable the various scalabilities. As shown in Fig. 13, a scalability preprocessor is used to implement the desired scalability. It operates on VOPs. For example, in the case of spatial scalability, the preprocessor down-samples the input VOPs to produce the base-layer VOPs that are processed by the VOP encoder. The midprocessor takes the reconstructed base-layer VOPs and up-samples them. The difference between the original VOP and the output of the midprocessor forms the input to the encoder for the enhancement layer. To implement temporal scalability, the preprocessor separates out the frames into two streams. One stream forms the input for the base-layer encoder, while the other is processed by the enhancement-layer encoder. The midprocessor is bypassed in this case Quelle: Touradj Ebrahimi, MPEG-4 Natural Video Tools, Signal Processing Laboratory Swiss Federal Institute of Technology, Lausanne

3.10.2 Räumliche Skalierbarkeit (Spatial Scalability)
Der VOPs des Base-Layers werden codiert wie vorher beschrieben. Die VOPs des Enhancement-Layers werden als P-VOPs oder als B-VOPs codiert. The base-layer VOPs are encoded in the same way as in the non-scalable case discussed in previous sections. VOPs of the enhancement layer are encoded as P-VOPs or B-VOPs. If a VOP in the enhancement-layer is temporally coincident with an I-VOP in the base-layer, it could be treated as a P-VOP. VOPs in the enhancement-layer that are coincident with P VOPs in the base-layer could be coded as B-VOPs. Since the base-layer serves as the reference for the enhancement layer, VOPs in the base-layer must be encoded before their corresponding VOPs in the enhancement layer. Figure 14 illustrates an example of how the enhancement layer can be decoded from the base layerusing spatial scalability. Quelle: Touradj Ebrahimi, MPEG-4 Natural Video Tools, Signal Processing Laboratory Swiss Federal Institute of Technology, Lausanne

3.10.3 Räumliche Skalierbarkeit (Spatial Scalability)
Beispiel für eine räumliche Skalierbarkeit, hier mit zwei Enhancement-Layern Quelle: Andrew Perkis: MPEG-4 visual, SIE2070 Multimedia signal processing, 4/3/2002

3.10.4 Zeitliche Skalierbarkeit (Temporal Scalability)
Bei der zeitlichen Skalierbarkeit wird die Wiederholrate der VOPs verbessert, der Enhancement-Layer enthält die Informationen, die zwischen den Bildern des Base-Layers wiedergegeben werden sollen. Zwei Typen: Typ I: Der Enhancement-Layer verbessert nur die Auflösung eines Teils des Base-Layers. Typ II: Der Enhancement-Layer verbessert die Auflösung des gesamten Base-Layers. Base Layer Enhancement Layer Enhancement Typ I VOL 0: Ges. Bild VOL 1: Auto The two enhancement types in MPEG-4 temporal scalability. In enhancement type I, only a selected region of the VOP (i.e. just the car) is enhanced, while the rest (i.e. the landscape) is not. In enhancement type II, enhancement is applicable only at entire VOP level. Enhancement Typ II VO 0 VOL 0: Ges. Bild VOL 1: Ges. Bild VO 1 VOL 0: Auto VOL 1: Auto Verbesserter Bereich Quelle: Touradj Ebrahimi, MPEG-4 Natural Video Tools, Signal Processing Laboratory Swiss Federal Institute of Technology, Lausanne

Typ I: Der Enhancement-Layer Verbessert nur die zeitliche Auflösung eines Teils des Base-Layers. Quelle: Touradj Ebrahimi, MPEG-4 Natural Video Tools, Signal Processing Laboratory Swiss Federal Institute of Technology, Lausanne

Typ II: Der Enhancement-Layer verbessert die zeitliche Auflösung des gesamten Base-Layers. Quelle: Touradj Ebrahimi, MPEG-4 Natural Video Tools, Signal Processing Laboratory Swiss Federal Institute of Technology, Lausanne

Beispiel für eine zeitliche Skalierbarkeit mit zwei Enhancement-Layern Quelle: Andrew Perkis: MPEG-4 visual, SIE2070 Multimedia signal processing, 4/3/2002

3.11.1 Error-Resilence-Tools
Speziell für die Übertragung über gestörte Kanäle, wie z.B. über Mobilfunkkanäle, sind sog. Error-Risilence-Tools vorhanden, die die sichtbaren Auswirkungen von Übertragungsfehlern verringern sollen: Resynchronization Markers Extended Header Code Data Partitioning Reversible VLCs Das Bild zeigt die Auswirkungen dieser Tools auf die Datenstrom-Syntax This functionality is important for universal access through error-prone environments, such as mobile communications. MPEG-4 provides several mechanisms to allow error resilience with different degrees of robustness and complexities [4]. These mechanisms are offered by tools providing means for resynchronization, error detection, data recovery and error concealment. There are four error resilience tools in MPEG-4 visual, namely, resynchronization, data partitioning, header extension code, and reversible variable length codes. 1. Resynchronization: This is the most frequent way to bring error resilience to a bitstream. It consists of inserting unique markers in the bitstream so that in the case of an error, the decoder can skip the remaining bits until the next marker and restart decoding from that point on. MPEG-4 allows for insertion of resynchronization markers after an approximately constant number of coded bits (video packets), as opposed to MPEG-2 and H.263 which allow for resynchronization after a constant number of coded macroblocks (typically a row of macroblocks). Experiments show that the former is a more efficient way of recovering from transmission errors. 1. Data partitioning: This method separates the bits for coding of motion information and those for the texture information. In the event of an error, a more efficient error concealment may be applied when for instance the error occurs on the texture bits only, by making use of the decoded motion information. 2. Header extension code: These binary codes allow an optional inclusion of redundant header information, vital for correct decoding of video. This way, the chances of corruption of header information and complete skipping of large portions of bitstream will be reduced. 3. Reversible VLCs: These VLCs allow to further reduce the influence of error occurrence on the decoded data. RVLCs are codewords which can be decoded in forward as well as backward manners. In the event of an error and skipping of the bitstream until the next resynchronization marker, it is possible to still decode portions of the corrupted bitstream in the reverse order to limit the influence of the error. Figure 12 summarizes the influence of these tools on the MPEG-4 bitstream syntax. Vorwärts-Decodierung Rückwärts-Decodierung Quelle: Touradj Ebrahimi, MPEG-4 Natural Video Tools, Signal Processing Laboratory Swiss Federal Institute of Technology, Lausanne

3.12.1 Synthetische Video-Objekte
MPEG-4 enthält verschiedene Tools, um synthetische Video-Objekte zu übertragen (Auswahl): Facial Animation - Animation von Gesichtern Body Animation – Animation von Körpern 2D-Meshes (2D-Gittermodelle) 3D-Meshes (3D-Gittermodelle) View-dependent Scalability 5.2 Synthetische Videoobjekte In diesem Abschnitt wird ein Überblick über die Tools zur Darstellung synthetischer Videoobjekte gegeben. Neben den Möglichkeiten zur Darstellung von synthetischen Objekten, die bereits durch den VRML-Standard gegeben sind, existieren in MPEG-4 weitere Tools für synthetische Videoobjekte. Hierzu gehören Tools zur Darstellung und Animation von Gesichtern (facial animation) und Körper (body animation). Weiterhin existiert ein Tool zur Darstellung und Animation von 2D-Gittermodellen (2D-meshes). In den folgenden Abschnitten werden diese Tools beschrieben.

Abbildung 7 (Bild aus [4]) Abbildung 8 (Bild aus [4])
Facial Animation Form, Textur und Gesichtsausdruck werden nur durch Parameter übergeben Facial Definition Parameters (FDP) Gesicht wird durch 84 FDPs beschrieben FDPs werden vom Decoder in Gesicht umgesetzt Animation: Facial Animation Parameters Gesichtsausdrücke 5.2.1 Gesichts- und Körperanimation In MPEG-4 ist ein mächtiges Tool zur Beschreibung von Gesichts und Körpermodellen integriert. Neben den Möglichkeiten die Form und Textur festzulegen, besteht bei der Beschreibung von Gesichtern auch die Möglichkeit den Gesichtsausdruck zu bestimmen. Diese Einstellungen werden durch die Gesichts Definitionsparameter (Facial definition Parameter) FDP's bestimmt. Ein Gesicht besteht hierbei aus 84 feature Points, die die Form des Gesichtes bestimmen (siehe Abbildung 7). Ein Gesicht wird hierbei mit einem neutrtalen Ausdruck übertragen und kann durch die FAP animiert werden. Weiterhin wird durch sie die Synchronisation mit Sprachausgabe, wie z.B. Lippenbewegung, realisiert. Abbildung 7 (Bild aus [4]) Abbildung 8 (Bild aus [4]) Für die Darstellung und Animation von Körpern existiert in MPEG-4, analog zur Gesichtsanimation, die Body Definition Parameter (BDP) und Body Animation Parameter (BAP). Ein Beispiel hierzu ist in Abbildung 8 zu sehen. Für weitere Informationen sei auf [4] verwiesen. FACE ANIMATION The face animation part of the standard allows sending parameters that calibrate and animate synthetic faces. The face models themselves are not standardized by MPEG-4, only the parameters are. Figure 2 shows some of the control points for a face. The facial animation object can be used to render an animated face. Facial definition parameters (FDPs) and facial animation parameters (FAPs) control the shape, texture, and expressions of the face. Initially, a face object contains a generic face with a neutral expression. Then, the face is rendered and receives the animation parameters from the bitstream, which will produce animation of the face (expressions, speech, etc.). The systems binary format for scenes provides features to support facial animation. FDPs provide model data that you may download. You use the data to configure a baseline face model stored in the terminal into a particular face before FAP decoding or to install a specific face model at the beginning of a session along with the information about how to animate it. The face animation table (FAT) provides a functional mapping (may be downloaded) from incoming FAPs to feature control points in the face mesh. It provides linear mappings by the piece of incoming FAPs for controlling facial movements. The face interpolation technique (FIT) provides a definition (may be downloaded) of mapping of incoming FAPs into a total set of FAPs before their application to feature points. This is accomplished by rated rational polynomial functions invoked by conditional evaluation of a face interpolation graph. It can be used for complex cross coupling of FAPs to link their effects or to interpolate FAPs missing in the stream using the FAPs that are available in the terminal. Quelle: MPEG-4 – Ein Überblick, Paderborn Center for mobile Computing

3.12.3 Body Animation Vergleichbar mit der Face Animation.
Form, Textur und Haltung des Körpers werden nur durch Parameter übergeben BODY ANIMATION V.2 adds body animation to the face animation. The body is an object capable of producing virtual body models and animations in the form of a set of 3-D polygonal meshes ready for rendering. Two sets of parameters are defined for the body: body definition parameter (BDP) and body animation parameter (BAP). Figure 3 shows the location of the control point to the lower part of the body. The body animation standard has been developed by MPEG in concert with the Humanoid Animation Working Group within the VRML Consortium. The objective is to achieve consistent conventions and control of body models, which are being established by humanoid animation. Quelle: MPEG-4 – Ein Überblick, Paderborn Center for mobile Computing „Body Control Points“ in unteren Körperteilen Quelle: Streaming Multimedia – The MPEG-4 Approach, Circuit Cellular, Issue 137, Dec. 2001

3.12.4 2D-Meshes Aufteilung des 2-dimensionalen Raumes in Polygone
in MPEG-4 nur Dreiecke Textur wird über das Gittermodell gelegt Texture Mapping Animation Verschieben der Knotenpunkte 2-D AND 3-D MESHES 2-D mesh includes 2-D Delaunay or regular mesh 0. The 2-D meshes are used for tracking synthetic objects. [3] A 2-D mesh is a partition of a 2-D planar region into polygonal patches. The vertices of the polygonal patches are referred to as the node points of the mesh. MPEG-4 considers only triangular meshes where the patches are triangles. A 2-D dynamic mesh refers to 2-D mesh geometry and motion information of all mesh node points within a temporal segment of interest. Triangular meshes have long been used for efficient 3-D object shape (geometry) modeling and rendering in computer graphics. 2-D mesh modeling may be considered as projection of such 3-D triangular meshes onto the image plane. A dynamic mesh is a forwardtracking mesh; the node points of the initial mesh track image features are forwarded in time by their respective motion vectors (see Figure 4). The initial mesh may be regular or may be adapted to the image content, which is called a contentbased mesh. 2-D contentbased mesh modeling then corresponds to nonuniform sampling of the motion field at a number of salient feature points (node points) along the contour and interior of a video object. Methods for selection and tracking of these node points are not subject to standardization. In 2-D mesh-based texture mapping, triangular patches in the current frame are deformed by the movements of the node points into triangular patches in the reference frame. The texture inside each patch in the reference frame is warped onto the current frame using parametric mapping, defined as a function of the node point motion vectors. For triangular meshes, the affine mapping is a common choice. Its linear form implies texture mapping with low computational complexity. Affine mappings can model translation, rotation, scaling, reflection, shear, and preserve straight lines. The degrees of freedom given by the three motion vectors of the vertices of a triangle match the six parameters of the affine mapping. This implies that the motion of the node points, from which a continuous, affine motion field defined by pieces can be reconstructed, can compactly represent the original 2-D motion field. Now that I’ve covered 2-D meshes, let’s move on to 3-D meshes. MPEG-4 V.2 provides a suite of tools for coding 3-D polygonal meshes. It compresses the connectivity, geometry, and properties such as shading normals, colors, and texture coordinates of 3-D polygonal meshes. Coding of generic 3-D polygonal meshes enables the efficient encoding of 3-D polygonal meshes. The incremental representation enables the reconstruction of a number of faces in a mesh proportional to the number of bits in the bitstream that have been processed. The error resilience enables a decoder to partially recover a mesh when subsets of the bitstream are missing or corrupted. And, the level of detail (LOD) scalability enables reconstruction of a simplified version of the original mesh using a subset of the bitstream. Quelle: MPEG-4 – Ein Überblick, Paderborn Center for mobile Computing

3.12.5 View-dependent Scalebility
wird in 3D-Welten benutzt ermöglicht Übertragung von Texturen nur sichtbare Informationen werden übertragen spart Bandbreite Berechnung auf Encoder- und Decoder-Seite Rückkanal erforderlich Ende ,

Profile (Profiles) Die meisten Anwendungen benötigen nur einen Teil der vielfältigen Werkzeuge („Tools“), die im MPEG-4-Standard definiert sind. Die Codierungswerkzeuge werden zu sog. Profilen („Profiles“) zusammengefasst. Diese Profile werden auf bestimmte Anwendungen angepasst, dadurch verringert sich der Implementierungsaufwand in Encoder und Decoder. Für jedes Profil kann es verschiedene „Levels“ geben, die die Komplexität z.B. in Form von Bildgröße, Anzahl der Video-Objekte, Speicherbedarf, einschränken. Typen von Profilen: Szenenbeschreibung Objektbeschreibungen Audio (natürliche und synthetische): Typen von Objekten Visual (natürliche und synthetische): Typen von Objekten Grafiken

Profile (Profiles) Beispiel für die Bearbeitung von Profilen in einem MPEG-4 Encoder/Decoder. Profile werden in Form von Video-Objekt-Typen definiert Quelle: Olivier Avaro: Overview of MPEG-4 Developments, Davic European Information Day, Geneva, 12 January 1999

3.13.3 Profile (Profiles) für natürliche Videobilder („natural Video“)
Auswahl von Video-Tools zu einigen Video-Objekttypen „Simple“: Nur rechteckförmige Video-Objekte (keine Shape-Verarbeitung) Quelle: Touradj Ebrahimi and Caspar Horne: MPEG-4 Natural Video Coding - An overview, aus dem Internet

Auswahl von Typen von Video-Objekten, die innerhalb einer Auswahl von Profiles bearbeitet werden können: Quelle: Touradj Ebrahimi and Caspar Horne: MPEG-4 Natural Video Coding - An overview, aus dem Internet

Wichtige vorgesehene Anwendungen der Profiles: Simple: Einfache fehlergeschützte Übertragung von rechteckigen Bildern, z.B. für Übertragung in Mobilfunksystemen Simple Scalable: Anwendung, wo Skalierbarkeit z.B. wegen unterschiedlicher zur Verfügung stehender Bandbreite eingeschränkt wird, oder bei Begrenzungen von Rechenleistungen: Internet oder Software-Decodierung Core: Übertragung für beliebig geformte Bildobjekte. Für einfache Interaktivität, z.B. Internet-Multimedia-Anwendungen Main Visual: Für interaktive, Fernseh- und DVD-Anwendungen

Weitere Ergänzungen durch neue Tools und Profiles: Advanced Simple: Verbesserte Codierverfahren gegenüber dem simple Profile. Simple Studio: Hohe Bildqualität, Verwendung für die Bildbearbeitung im Studio, nur I-VOPs, Datenrate bis 2GBit/s. Core Studio: Wie Simple-Studio, aber zusätzlich mit P-VOPs, die eine verbesserte Datenreduktion erlauben aber gleichzeitig einen erhöhnten Implementierungsaufwand verlangen. Core Scalable: Wie Core, aber zusätzlich Skalierbarkeit. Anwendung im Internet, bei Mobilfunkübertragung und für Rundfunkzwecke. Advanced Real-Time Simple: Für Echtzeitanwendungen wie Videotelefon, Telekonferenz. Advanced Coding Efficiency: Mit allen neuen Verfahren zur verbesserten Codierung, für mobile Rundfunkübertragung, Camcorder etc. Fine Granularity Scalable: Mehrere Verfahren zur Skalierbarkeit, um den Datenstrom schnell veränderlichen Datenübertragungsraten besser anpassen zu können (Streaming Video).

Levels Mit sog. Levels werden Parametervorgaben für den Datenstrom gemacht. Die folgende Tabelle gibt einige Levels für die wichtigsten Profiles wider. In der Spalte „Total mblk memory“ ist die für den Decodiervorgang nötige Speichermenge in „Macroblock-Units“ angegeben. A level within a profile defines constraints on parameters in the bitstream that relate to the tools of that profile. Currently there are 11 natural video profile and level definitions that each constrains about 15 parameters that are defined for each level. To provide some insight into the levels, for the three most important profiles, core, simple, and main, a subset of level constraints is given in Tab. 3. The macroblock memory size is the bound on the memory (in macroblock units) which can be used by the (Video reference Memory Verifier) VMV algorithm. This algorithm models the pixel memory needed by the entire visual decoding process. Quelle: Touradj Ebrahimi and Caspar Horne: MPEG-4 Natural Video Coding - An overview, aus dem Internet

Datenrate Die Datenrate kann sowohl für Anwendungen mit variabler Datenrate (VBR – Variable Bitrate) als auch für Anwendungen mit fester Datenrate (CBR – Constant Bitrate) eingestellt werden. Dazu ist ein Algorithmus erforderlich, der die Quantisierung der zu übertragenden Größen einstellt. Bei VBR-Anwendungen wird versucht, die optimale Bildqualität mit einer vorgegebenen Bitrate zu erreichen. Bei CBR-Anwendungen muss die vorgegebene Verzögerungszeit und die vorgegebene Größe des Pufferspeichers berücksichtigt werden, der die ungleichmäßig anfallenden Datenbits zwischenspeichert, bevor sie mit konstanter Datenrate über den Übertragungskanal geschickt werden. Ein vergleichbarer Pufferspeicher ist auf der Decoderseite vorhanden. Wenn der Pufferspeicher droht überzulaufen, wird die Quantisierung vergröbert, so dass weniger Daten anfallen. Wenn der Pufferspeicher droht leer zu laufen, wird die Quantisierung feiner eingestellt, so dass mehr Daten anfallen. bis hierher , An important conformance point is the maximum bitrate, or the maximum size of the decoder bitstream buffer. Therefore, while encoding a video scene, rate control and buffer regulation algorithms are important building blocks in the implementation of the MPEG-4 video standard. In a variable bitrate (VBR) environment, the rate control scheme attempts to achieve optimum quality for the decoded scene given a target bitrate. In constant bitrate (CBR) applications, the rate controller has to meet the constraints of fixed latency and buffer size. To meet these requirements, the rate control algorithm controls the quantization parameters. As a guideline to implementers, the MPEG-4 video standard describes a possible implementation of a rate control algorithm. The algorithm uses a Scalable Rate Control scheme (SRC) that can satisfy both VBR and CBR requirements. The SRC is based on the assumption that rate-distortion function can be modeled by the following equation: where is the rate, and are modeling parameters, is a measure of activity in the frame, and is the quantization parameter. The first and second order coefficients, and , are initialized at the beginning of the process and updated based on the encoding results of each frame. The quantization parameter is computed based on this equation. This scheme achieves frame rate control for both CBR and VBR cases. A rate control algorithm for multiple video objects is derived from this algorithm, by using a bit allocation table based on Human Visual Sensitivity (HVS) of color tolerance. This results in a bit allocation for each object. Next, the number of coded bits per block are estimated based on block variance classification, and a bits estimation model. This results in a reference quantization parameter. The object is encoded using this reference parameter. While encoding the object, small adjustments to the reference parameter can be made depending on the possible deviation of the predicted bits from the actual bits. If the number of actual bits produced is much higher than the allocated bit budget, a frame skip parameter is computed that allows to skip several instances of the video object to reduce the current buffer level.

H.264/AVC In the early 1990’s, the first video compression standards were introduced: H.261 (1990) and H.263 (1995) from ITU MPEG-1 (1993) and MPEG-2 (1996) from ISO Since then, the technology has advanced rapidly H.263 was followed by H.263+, H.263++, H.26L MPEG-1/2 followed by MPEG-4 visual But industry and research coders are still way ahead H.264/AVC is a joint project of ITU and ISO, to create an up-to-date standard. An important conformance point is the maximum bitrate, or the maximum size of the decoder bitstream buffer. Therefore, while encoding a video scene, rate control and buffer regulation algorithms are important building blocks in the implementation of the MPEG-4 video standard. In a variable bitrate (VBR) environment, the rate control scheme attempts to achieve optimum quality for the decoded scene given a target bitrate. In constant bitrate (CBR) applications, the rate controller has to meet the constraints of fixed latency and buffer size. To meet these requirements, the rate control algorithm controls the quantization parameters. As a guideline to implementers, the MPEG-4 video standard describes a possible implementation of a rate control algorithm. The algorithm uses a Scalable Rate Control scheme (SRC) that can satisfy both VBR and CBR requirements. The SRC is based on the assumption that rate-distortion function can be modeled by the following equation: where is the rate, and are modeling parameters, is a measure of activity in the frame, and is the quantization parameter. The first and second order coefficients, and , are initialized at the beginning of the process and updated based on the encoding results of each frame. The quantization parameter is computed based on this equation. This scheme achieves frame rate control for both CBR and VBR cases. A rate control algorithm for multiple video objects is derived from this algorithm, by using a bit allocation table based on Human Visual Sensitivity (HVS) of color tolerance. This results in a bit allocation for each object. Next, the number of coded bits per block are estimated based on block variance classification, and a bits estimation model. This results in a reference quantization parameter. The object is encoded using this reference parameter. While encoding the object, small adjustments to the reference parameter can be made depending on the possible deviation of the predicted bits from the actual bits. If the number of actual bits produced is much higher than the allocated bit budget, a frame skip parameter is computed that allows to skip several instances of the video object to reduce the current buffer level. Quellen: Gidon Shavit The H.264/AVC Video Coding Standard, Based on Material from IEEE Transactions on Circuits and Systems for Video Technology, July 2003, homes/gidon/presentations/h264.ppt Tom McMahon: The Emerging H.264/AVC Video Coding Standard, Dr. Pankaj Topiwala H.264: The Next Revolution? Mohsen Saneei: H-264 ece.ut.ac.ir/classpages/Multimedia/h264.ppt

3.14.2 H.264/AVC - Scope and Context
Aimed at providing high-quality compression for various services: IP streaming media ( kbps) SDTV and HDTV Broadcast and video-on-demand ( Mbps) DVD Conversational services (<1 Mbps, low latency) Standard defines: Decoder functionality (but not encoder) File and stream structure Final results: 2-fold improvement in compression Same fidelity, half the size compared to H.263 and MPEG-2 MPEG-4 Part 10 (Official MPEG Term) ISO/IEC AVC H.264 (Official ITU Term)

3.14.3 H.264/AVC - New features New features in H.264
Motion compensation and intra-prediction Image transform Deblocking filters Entropy coding Frames and slices

3.14.4 H.264/AVC - Changes in Motion Compensation
Quarter-pixel accuracy A gain of 1.5-2dB across the board over ½-pixel Variable block-size: Every 16x16 macroblock can be subdivided Each sub-block gets predicted separately Multiple and arbitrary reference frames Vs. only previous (H.263) or previous and next (MPEG). Anti-aliasing sub-pixel interpolation Removes some common artifacts in residual

3.14.5 H.264/AVC - Variable Block-Size MC
Motivation: size of moving/stationary objects is variable Many small blocks may take too many bits to encode Few large blocks give lousy prediction In H.264/AVC, each 16x16 macroblock may be: Kept whole, Divided horizontally (vertically) into two sub-blocks of size 16x8 (8x16) Divided into 4 sub-blocks In the last case, the 4 sub-blocks may be divided once more into 2 or 4 smaller blocks. Bit rate savings of more than 15% as compared to using only a 16x16 block size. Motion vector accuracy 1/4 8x8 4x8 1 2 3 4x4 8x4 Types 16x16 8x16 MB 16x8

3.14.6 H.264/AVC - Variable Block Sizes

3.14.7 H.264/AVC - Motion Scale Example

3.14.10 H.264/AVC – Variable Block Size Example
T=1 T=2

3.14.11 H.264/AVC – Arbitrary Reference Frames
In H.263, the reference frame for prediction is always the previous frame In MPEG and H.26L, some frames are predicted from both the previous and the next frames (bi-prediction) In H.264/AVC, any one frame may be used as reference: Encoder and decoder maintain synchronized buffers of available frames (previously decoded) Reference frame is specified as index into this buffer In bi-predictive mode, each macroblock may be: Predicted from one of the two references Predicted from both, using weighted mean of predictors Multiple Reference Frames for Motion Compensation

3.14.12 H.264/AVC – Intra Prediction
Motivation: intra-frames are natural images, so they exhibit strong spatial correlation Implemented to some extent in H and MPEG-4, but in transform domain Macroblocks in intra-coded frames are predicted based on previously-coded ones Above and/or to the left of the current block The macroblock may be divided into 16 4x4 sub-blocks which are predicted in cascading fashion An encoded parameter specifies which neighbors should be used to predict, and how

3.14.13 H.264/AVC – Intra Prediction
For I macroblocks For luminance samples 4x4 prediction process 16x16 prediction process For chrominance samples 8x8 prediction process

3.14.14 H.264/AVC – 4x4 Intra Prediction Modes
9 types for luminance signal

For example in Mode 3 (Diagonal-Down-Left prediction) the values of a to p are given as follows: a is equal to (A+2B+C+2)/4 b, e are equal to (B+2C+D+2)/4 c, f, i are equal to (C+2D+E+2)/4 d, g, j, m are equal to (D+2E+F+2)/4 h, k, n are equal to (E+2F+G+2)/4 l, o are equal to (F+2G+H+2)/4 p is equal to (G+3H+2)/4 Mode 2 (DC) Predict all pixels from (A+B+C+D+I+J+K+L+4)/8 or (A+B+C+D+2)/4 or (I+J+K+L+2)/4

3.14.17 H.264/AVC - Intra Prediction
Intra 16x16 luminance (smooth macroblocks containing little detail) and 8x8 chrominance prediction modes Mode 0 (Vertical) Mode 1 (Horizontal) Mode 2 (DC) Mode 4 (Plane): a linear “plane” function is fitted to the upper and left-hand samples H and V. If any of the 8x8 blocks in the luminance component are coded in Intra mode, both chrominance blocks (Cr,Cb) are also intra coded

3.14.18 H.264/AVC - Intra Prediction

3.14.19 H.264/AVC - Image Transform
Motivation: DCT requires real-number operations, which may cause inaccuracies in inversion Better motion compensation means less spatial correlation – no need for 8x8 transform H.264 uses a very simple integer 4x4 transform A (pretty crude) approximation to 4x4 DCT Transform matrix contains only +/-1 and +/-2 Can be computed with only additions, subtractions, and shifts Results show negligible loss in quality (~0.02dB)

EXACT MATCH Simplified Transform Based primarily on 4x4 transform (all prior standards: 8x8) Requires only 16 bit arithmetic (including intermediate values) Easily extensible to bits per component

Transform in “baseline” profile of H.264/AVC 4x4 block transform for residual data 4x4 transform of luminance DC coefficients in intra macroblocks (intra predicted in 16x16 mode) 2x2 transform of chrominance DC coefficients (in any macroblock) Luminance signal

In “adaptive block size transform” mode further transforms are chosen depending on the motion compensation block size(4x8, 8x4, 8x8, 16x8, etc) transform operates on 4x4 blocks of residual data after motion-compensated prediction or Intra prediction

3.14.23 H.264/AVC - Deblocking Filters
Motivation: block-based MC and transforms generate blocking artifacts Very visible to human eye at low bit-rates Previous standards applied simple filters to “smudge” edges between blocks H.264/AVC adaptively chooses for each edge which one of 5 deblocking filters to apply. For instance, if both blocks have the same motion vector, less filtering is needed. Improves objective quality as well: about 7-9% reduction in bit-rate for same PSNR.

3.14.24 H.264/AVC - Deblocking Filters
1) Without Filter ) with H264/AVC Deblocking

H.264/AVC - Entropy Coding Motivation: traditional coders use fixed, variable-length codes Essentially Huffman-style codes Non-adaptive Can’t encode symbols with probability > 0.5 efficiently, since at least one bit required H.263 Annex E defines an arithmetic coder Still non-adaptive Uses multiple non-binary alphabets, which results in high computational complexity H.264/AVC uses two entropy codings: CAVLC (context-adaptive variable length coding) and UVLC (Uniform Variable Length Coding) CABAC (context-adaptive binary arithmetic coding)

3.14.26 H.264/AVC - Entropy Coding: CABAC
Arithmetic coding framework designed specifically for H.264 Binarization: all syntax symbols are translated to bit-strings 399 predefined context models, used in groups E.g. models used to code macroblock type for inter-frames The model to use next is selected based on previously coded information (the context) Each context model adapts to empirical distribution

3.14.27 H.264/AVC - Frames and Slices
In H.263 and MPEG, each frame is either inter (P-frame) or intra (I-frame). Exception: some macroblocks in P-frames may be intra-coded, and are called I-blocks. H.264/AVC generalizes this: each frame consists of one or more slices Contiguous groups of macroblocks Processed in internal raster order Each is independently encoded and decoded I-slices, P-slices, B-slices (two reference frames) Slice 0 Slice 1 Slice 2

3.14.28 H.264/AVC vs MPEG-2, Low Rate Foreman, QCIF, 30 fps MPEG-2
203 kbps H.264 39 kbps!

3.14.29 Quick Look at WM9 WM4 onward were variants of MPEG-4
WM9 is a variant of H.264/AVC Fast, 16-bit integer transform Sophisticated motion estimation Scalar quantization Adaptive VLC (no CABAC) This would give it performance similar to H.264/AVC Might expect ~10% less since no CABAC In our tests, they appear comparable WM9 performed well in tests at DVD Forum. But WM9 may have used optimized, tuned, 2-pass encoding, prefiltering. H.264/AVT used 1-pass ref. sw. (No 2-pass, tuning, filtering, opt.)

3.14.30 Profiles Baseline Extended Main High High 10 High 4:2:2
I and P Slices Ja B Slices Nein SI and SP Slices Multiple Reference Frames In-Loop Deblocking Filter CAVLC Entropy Coding CABAC Entropy Coding Flexible Macroblock Ordering (FMO) Arbitrary Slice Ordering (ASO) Redundant Slices (RS) Data Partitioning Interlaced Coding (PicAFF, MBAFF) 4:2:0 Chroma Format 4:2:2 Chroma Format 4:4:4 Chroma Format 8 Bit Sample Depth

3.14.31 Profiles Baseline Extended Main High High 10 High 4:2:2
9 and 10 Bit Sample Depth Nein Ja 11 and 12 Bit Sample Depth 8x8 vs. 4x4 Transform Adaptivity Quantization Scaling Matrices Separate Cb and Cr QP control Monochrome Video Format Residual Color Transform Predictive Lossless Coding

3.14.32 Levels und Profiles Macroblocks pro Beispiele für
Videobitrate (VCL) für Profiles Level Frame Sekunde Auflösung/Bildrate dieses Levels Baseline Extended Main High High 10 High 4:2:2 High 4:4:4 1 99 1 485 176 × 144 / 15 128 × 96 / 30 64 kbit/s 80 kbit/s 192 kbit/s 256 kbit/s 1b 128 kbit/s 160 kbit/s 384 kbit/s 512 kbit/s 1.1 396 3 000 352 × 288 / × 240 / 10 176 × 144 / 30 240 kbit/s 576 kbit/s 768 kbit/s 1.2 6 000 352 × 288 / 15 320 × 240 / 20 176 × 144 / 60 480 kbit/s 1152 kbit/s 1536 kbit/s 1.3 11 880 352 × 288 / 30 320 × 240 / 40 960 kbit/s 2304 kbit/s 3072 kbit/s 2 2 Mbit/s 2,5 Mbit/s 6 Mbit/s 8 Mbit/s 2.1 792 19 800 352 × 576 / 25 352 × 288 / 50 4 Mbit/s 5 Mbit/s 12 Mbit/s 16 Mbit/s 2.2 1 620 20 250 720 × 480 / 15 352 × 288 / 50

3.14.33 Levels und Profiles Macroblocks pro Beispiele für
Videobitrate (VCL) für Profiles Level Frame Sekunde Auflösung/Bildrate dieses Levels Baseline Extended Main High High 10 High 4:2:2 High 4:4:4 3 1 620 40 500 720 × 576 / 25 720 × 480 / 30 10 Mbit/s 12,5 Mbit/s 30 Mbit/s 40 Mbit/s 3.1 3 600 108 000 1280 × 720 / 30 720 × 576 / 60 14 Mbit/s 17,5 Mbit/s 42 Mbit/s 56 Mbit/s 3.2 5 120 216 000 1280 × 1024 / × 720 / 60 20 Mbit/s 25 Mbit/s 60 Mbit/s 80 Mbit/s 4 8 192 245 760 1920 × 1080 / × 720 / 60 4.1 50 Mbit/s 62,5 Mbit/s 150 Mbit/s 200 Mbit/s 4.2 8 704 522 240 2048 × 1080 / 60 5 22 080 589 824 3072 × 1620 / × 1080 / 60 135 Mbit/s 168,75 Mbit/s 405 Mbit/s 540 Mbit/s 5.1 36 864 983 040 4096 × 2304 / × 1080 / 100 240 Mbit/s 300 Mbit/s 720 Mbit/s 960 Mbit/s

3.14.34 AVCHD Subtype High Definition(AVCHD-HD)
Standard Definition(AVCHD-SD) Frame size in pixels 1920×1080 1440×1080 1280×720 720×480 720×576 Frame rate 29.97, interlaced 25, interlaced 23.976, progressive 59.94, progressive 50, progressive Frame aspect ratio 16:9 4:3, 16:9 Video Compression MPEG-4 AVC/H.264 Luminance sampling frequency 74.25 MHz 55.7 MHz 74.25 MHz 13.5 MHz Chroma sampling format 4:2:0 Quantization 8 bits (both luminance and chrominance) System Stream type MPEG transport stream System data rate up to 18 Mbit/s (DVD media) up to 24 Mbit/s (all other media)

3.14.35 AVCHD 2.0 Subtype AVCHD Progressive AVCHD 3D (stereoscopic)
Frame size in pixels 1440×1080 1920×1080 1280×720 Frame rate 59.94, progressive 50, progressive 23.976, progressive 25, interlaced 29.97, interlaced Frame aspect ratio 16:9 Video Compression MPEG-4 AVC/H.264 Luminance sampling frequency 111.4 MHz 148.5 MHz 74.25 MHz Chroma sampling format 4:2:0 Quantization 8 bits (both luminance and chrominance) System Stream type MPEG transport stream System data rate up to 28 Mbit/s

3.15.1 High Efficiency Video Coding (HEVC)
MPEG-H part 2 (ISO/IEC ) ITU-T Recommendation H.265 Increased video resolution Increased parallel processing architectures The video coding layer of HEVC employs the same hybrid approach (inter-/intrapicture prediction and 2-D transform coding) used in all video compression standards since H.261.

Typical HEVC video encoder Quelle: Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, Thomas Wiegand: Overview of the High Efficiency Video Coding (HEVC) Standard IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

Each picture is split into block-shaped regions, with the exact block partitioning being conveyed to the decoder. The first picture of a video sequence is coded using only intrapicture prediction, with no dependence on other pictures For all remaining pictures of a sequence or between random access points, interpicture temporally predictive coding modes are typically used for most blocks. The encoding process for interpicture prediction consists of choosing motion data comprising the selected reference picture and motion vector (MV) to be applied for predicting the samples of each block. The encoder and decoder generate identical interpicture prediction signals by applying motion compensation (MC) using the MV and mode decision data, which are transmitted as side information. The residual signal of the intra- or interpicture prediction, which is the difference between the original block and its prediction, is transformed by a linear spatial transform. The transform coefficients are then scaled, quantized, entropy coded, and transmitted together with the prediction information. Quelle: Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, Thomas Wiegand: Overview of the High Efficiency Video Coding (HEVC) Standard IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

The encoder duplicates the decoder processing loop (see gray-shaded boxes in such that both will generate identical predictions for subsequent data. Video material to be encoded by HEVC is generally expected to be input as progressive scan imagery. No explicit coding features are present in the HEVC design to support the use of interlaced scanning, as interlaced scanning is no longer used for displays and is becoming substantially less common for distribution. A metadata syntax has been provided in HEVC to allow an encoder to indicate that interlace-scanned video has been sent by coding each field (i.e., the even or odd numbered lines of each video frame) of interlaced video as a separate picture or that it has been sent by coding each interlaced frame as an HEVC coded picture. This provides an efficient method of coding interlaced video without burdening decoders with a need to support a special decoding process for it. Quelle: Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, Thomas Wiegand: Overview of the High Efficiency Video Coding (HEVC) Standard IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

Quelle: Jérôme VIERON HEVC: High-Efficiency Video Coding - Next generation video compression - WBU-ISOG FORUM November 2012

Coding tree units and coding tree block (CTB) structure: size selected by the encoder, can be larger than a traditional macroblock. The CTU consists of a luma CTB and the corresponding chroma CTBs and syntax elements. The size L×L of a luma CTB can be chosen as L = 16, 32, or 64 samples, with the larger sizes typically enabling better compression. HEVC then supports a partitioning of the CTBs into smaller blocks using a tree structure and quadtree-like signaling. Quelle: Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, Thomas Wiegand: Overview of the High Efficiency Video Coding (HEVC) Standard IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

Coding units (CUs) and coding blocks (CBs): The quadtree syntax of the CTU specifies the size and positions of its luma and chroma CBs. The root of the quadtree is associated with the CTU. Hence, the size of the luma CTB is the largest supported size for a luma CB. The splitting of a CTU into luma and chroma CBs is signaled jointly. One luma CB and ordinarily two chroma CBs, together with associated syntax, form a coding unit (CU). A CTB may contain only one CU or may be split to form multiple CUs, and each CU has an associated partitioning into prediction units (PUs) and a tree of transform units (TUs). Prediction units and prediction blocks (PBs): The decision whether to code a picture area using interpicture or intrapicture prediction is made at the CU level. A PU partitioning structure has its root at the CU level. Quelle: Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, Thomas Wiegand: Overview of the High Efficiency Video Coding (HEVC) Standard IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

Transform Units (TUs) and transform blocks (TBs): The prediction residual is coded using block transforms. A TU tree structure has its root at the CU level. The luma CB residual may be identical to the luma transform block (TB) or may be further split into smaller luma TBs. The same applies to the chroma TBs. Integer basis functions similar to those of a discrete cosine transform (DCT) are defined for the square TB sizes 4×4, 8×8, 16×16, and 32×32. For the 4×4 transform of luma intrapicture prediction residuals, an integer transform derived from a form of discrete sine transform (DST) is alternatively specified. Motion vector signaling: Advanced motion vector prediction (AMVP) is used, including derivation of several most probable candidates based on data from adjacent PBs and the reference picture. A merge mode for MV coding can also be used, allowing the inheritance of MVs from temporally or spatially neighboring PBs. Moreover, compared to H.264/MPEG-4 AVC, improved skipped and direct motion inference are also specified. Quelle: Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, Thomas Wiegand: Overview of the High Efficiency Video Coding (HEVC) Standard IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

Motion compensation: Quarter-sample precision is used for the MVs, and 7-tap or 8-tap filters are used for interpolation of fractional-sample positions (compared to six-tap filtering of half-sample positions followed by linear interpolation for quarter-sample positions in H.264/MPEG-4 AVC). Similar to H.264/MPEG-4 AVC, multiple reference pictures are used. For each PB, either one or two motion vectors can be transmitted, resulting either in unipredictive or bipredictive coding, respectively. Intrapicture prediction: The decoded boundary samples of adjacent blocks are used as reference data for spatial prediction in regions where interpicture prediction is not performed. Intrapicture prediction supports 33 directional modes (compared to eight such modes in H.264/MPEG-4 AVC), plus planar (surface fitting) and DC (flat) prediction modes. The selected intrapicture prediction modes are encoded by deriving most probable modes (e.g., prediction directions) based on those of previously decoded neighboring PBs. Quelle: Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, Thomas Wiegand: Overview of the High Efficiency Video Coding (HEVC) Standard IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

Intrapicture prediction modes Quelle: Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, Thomas Wiegand: Overview of the High Efficiency Video Coding (HEVC) Standard IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

Quantization control: As in H.264/MPEG-4 AVC, uniform reconstruction quantization (URQ) is used in HEVC, with quantization scaling matrices supported for the various transform block sizes. Entropy coding: Context adaptive binary arithmetic coding (CABAC) is used for entropy coding. This is similar to the CABAC scheme in H.264/MPEG-4 AVC, but has undergone several improvements to improve its throughput speed (especially for parallel-processing architectures) and its compression performance, and to reduce its context memory requirements. In-loop deblocking filtering: A deblocking filter similar to the one used in H.264/MPEG-4 AVC is operated within the interpicture prediction loop. However, the design is simplified in regard to its decision-making and filtering processes, and is made more friendly to parallel processing. Quelle: Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, Thomas Wiegand: Overview of the High Efficiency Video Coding (HEVC) Standard IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

Sample adaptive offset (SAO): A nonlinear amplitude mapping is introduced within the interpicture prediction loop after the deblocking filter. Its goal is to better reconstruct the original signal amplitudes by using a look-up table that is described by a few additional parameters that can be determined by histogram analysis at the encoder side. Quelle: Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, Thomas Wiegand: Overview of the High Efficiency Video Coding (HEVC) Standard IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

New features are introduced in the HEVC standard to enhance the parallel processing capability or modify the structuring of slice data for packetization purposes. Tiles: The option to partition a picture into rectangular regions called tiles has been specified. The main purpose of tiles is to increase the capability for parallel processing rather than provide error resilience. Tiles are independently decodable regions of a picture that are encoded with some shared header information. Tiles can additionally be used for the purpose of spatial random access to local regions of video pictures. A typical tile configuration of a picture consists of segmenting the picture into rectangular regions with approximately equal numbers of CTUs in each tile. Tiles provide parallelism at a more coarse level of granularity (picture/ subpicture), and no sophisticated synchronization of threads is necessary for their use. Quelle: Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, Thomas Wiegand: Overview of the High Efficiency Video Coding (HEVC) Standard IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

Wavefront parallel processing: When wavefront parallel processing (WPP) is enabled, a slice is divided into rows of CTUs. The first row is processed in an ordinary way, the second row can begin to be processed after only two CTUs have been processed in the first row, the third row can begin to be processed after only two CTUs have been processed in the second row, and so on. The context models of the entropy coder in each row are inferred from those in the preceding row with a two-CTU processing lag. WPP provides a form of processing parallelism at a rather fine level of granularity, i.e., within a slice. WPP may often provide better compression performance than tiles (and avoid some visual artifacts that may be induced by using tiles). Quelle: Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, Thomas Wiegand: Overview of the High Efficiency Video Coding (HEVC) Standard IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

Dependent slice segments: A structure called a dependent slice segment allows data associated with a particular wavefront entry point or tile to be carried in a separate NAL unit, and thus potentially makes that data available to a system for fragmented packetization with lower latency than if it were all coded together in one slice. A dependent slice segment for a wavefront entry point can only be decoded after at least part of the decoding process of another slice segment has been performed. Dependent slice segments are mainly useful in low-delay encoding, where other parallel tools might penalize compression performance. Quelle: Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, Thomas Wiegand: Overview of the High Efficiency Video Coding (HEVC) Standard IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

Transform Coding HEVC uses transform coding of the prediction error residual in a similar manner as in prior standards. The residual block is partitioned into multiple square TBs. The supported transform block sizes are 4×4, 8×8, 16×16, and 32×32. Core Transform: Two-dimensional transforms are computed by applying 1-D transforms in the horizontal and vertical directions. The elements of the core transform matrices were derived by approximating scaled DCT basis functions, under considerations such as limiting the necessary dynamic range for transform computation and maximizing the precision and closeness to orthogonality when the matrix entries are specified as integer values. For simplicity, only one integer matrix for the length of 32 points is specified, and subsampled versions are used for other sizes. Quelle: Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, Thomas Wiegand: Overview of the High Efficiency Video Coding (HEVC) Standard IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

Quelle: Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, Thomas Wiegand: Overview of the High Efficiency Video Coding (HEVC) Standard IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 12, DECEMBER 2012

Level Max Luma picture size (samples) Max luma sample rate (samples/sec) Main Tier Max bit rate (1000 bits/s) High Tier max bit rate (1000 bits/s) Min comp. ratio 36, , ,228,224 8,912,896 8, , ,423,360 33, ,423,360 552,960 3, , , , , , ,760 1,069,547,520 1,069,547, ,600 4,010,803,200 , , ,000 Quelle: Jérôme VIERON HEVC: High-Efficiency Video Coding - Next generation video compression - WBU-ISOG FORUM November 2012

Projektseminar Bildverarbeitung – Image Processing

Ähnliche Präsentationen

Präsentation zum Thema: "Projektseminar Bildverarbeitung – Image Processing"— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback

Anmelden

Anmeldung über soziales Netzwerk:

Projektseminar Bildverarbeitung – Image Processing

Ähnliche Präsentationen

Präsentation zum Thema: "Projektseminar Bildverarbeitung – Image Processing"— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback