ref: http://msdn.microsoft.com/en-us/library/windows/desktop/bb530104%28v=vs.85%29.aspx
Digital video is often encoded in a YUV format. This article
explains the general concepts of YUV video, along with some
terminology, without going deeply into the mathematics of YUV video
processing.
If you have worked with computer graphics, you are probably familiar with RGB color. An RGB color is encoded using three values: red, green, and blue. These values correspond directly to portions of the visible spectrum. The three RGB values form a mathematical coordinate system, called a color space. The red component defines one axis of this coordinate system, blue defines the second, and green defines the third, as shown in the following illustration. Any valid RGB color falls somewhere within this color space. For example, pure magenta is 100% blue, 100% red, and 0% green.
Although RGB is a common way to represent colors, other coordinate systems are possible. The term YUV refers to a family of color spaces, all of which encode brightness information separately from color information. Like RGB, YUV uses three values to represent any color. These values are termed Y', U, and V. (In fact, this use of the term "YUV" is technically inaccurate. In computer video, the term YUV almost always refers to one particular color space named Y'CbCr, discussed later. However, YUV is often used as a general term for any color space that works along the same principles as Y'CbCr.)
The Y' component, also called luma, represents the brightness value of the color. The prime symbol (') is used to differentiate luma from a closely related value, luminance, which is designated Y. Luminance is derived from linear RGB values, whereas luma is derived from non-linear (gamma-corrected) RGB values. Luminance is a closer measure of true brightness but luma is more practical to use for technical reasons. The prime symbol is frequently omitted, but YUV color spaces always use luma, not luminance.
Luma is derived from an RGB color by taking a weighted average of the red, green, and blue components. For standard-definition television, the following formula is used:
This formula reflects the fact that the human eye is more sensitive to certain wavelengths of light than others, which affects the perceived brightness of a color. Blue light appears dimmest, green appears brightest, and red is somewhere in between. This formula also reflects the physical characteristics of the phosphors used in early televisions. A newer formula, taking into account modern television technology, is used for high-definition television:
The luma equation for standard-definition television is defined in a specification named ITU-R BT.601. For high-definition television, the relevant specification is ITU-R BT.709.
The U and V components, also called chroma values or color difference values, are derived by subtracting the Y value from the red and blue components of the original RGB color:
Together, these values contain enough information to recover the original RGB value.
YUV has another advantage that is more relevant today. The human eye is less sensitive to changes in hue than changes in brightness. As a result, an image can have less chroma information than luma information without sacrificing the perceived quality of the image. For example, it is common to sample the chroma values at half the horizontal resolution of the luma samples. In other words, for every two luma samples in a row of pixels, there is one U sample and one V sample. Assuming that 8 bits are used to encode each value, a total of 4 bytes are needed for every two pixels (two Y', one U, and one V), for an average of 16 bits per pixel, or 30% less than the equivalent 24-bit RGB encoding.
YUV is not inherently any more compact than RGB. Unless the chroma is downsampled, a YUV pixel is the same size as an RGB pixel. Also, the conversion from RGB to YUV is not lossy. If there is no downsampling, a YUV pixel can be converted back to RGB with no loss of information. Downsampling makes a YUV image smaller and also loses some of the color information. If performed correctly, however, the loss is not perceptually significant.
These ranges assume 8 bits of precision for the Y'CbCr components. Here is the exact derivation of Y'CbCr, using the BT.601 definition of luma:
The following table shows RGB and YCbCr values for various colors, again using the BT.601 definition of luma.
As this table shows, Cb and Cr do not correspond to intuitive ideas about color. For example, pure white and pure black both contain neutral levels of Cb and Cr (128). The highest and lowest values for Cb are blue and yellow, respectively. For Cr, the highest and lowest values are red and cyan.
Gary Sullivan and Stephen Estrop
Microsoft Corporation
April 2002, updated November 2008
This topic describes the 8-bit YUV color formats that are recommended for video rendering in the Windows operating system. This article presents techniques for converting between YUV and RGB formats, and also provides techniques for upsampling YUV formats. This article is intended for anyone working with YUV video decoding or rendering in Windows.
The formats described in this article all use 8 bits per pixel location to encode the Y channel (also called the luma channel), and use 8 bits per sample to encode each U or V chroma sample. However, most YUV formats use fewer than 24 bits per pixel on average, because they contain fewer samples of U and V than of Y. This article does not cover YUV formats with 10-bit or higher Y channels.
The dominant form of 4:2:2 sampling is defined in ITU-R Recommendation BT.601. There are two common variants of 4:2:0 sampling. One of these is used in MPEG-2 video, and the other is used in MPEG-1 and in ITU-T Recommendations H.261 and H.263.
Compared with the MPEG-1 scheme, it is simpler to convert between the MPEG-2 scheme and the sampling grids defined for 4:2:2 and 4:4:4 formats. For this reason, the MPEG-2 scheme is preferred in Windows, and should be considered the default interpretation of 4:2:0 formats.
The bytes marked A contain values for alpha.
If the image is addressed as an array of little-endian WORD values, the first WORD contains the first Y sample in the least significant bits (LSBs) and the first U (Cb) sample in the most significant bits (MSBs). The second WORD contains the second Y sample in the LSBs and the first V (Cr) sample in the MSBs.
YUY2 is the preferred 4:2:2 pixel format for Microsoft DirectX Video Acceleration (DirectX VA). It is expected to be an intermediate-term requirement for DirectX VA accelerators supporting 4:2:2 video.
where pY is a byte pointer to the start of the memory array, as shown in the following diagram.
Conversion between RGB and 4:4:4 YUV
We first describe conversion between RGB and 4:4:4 YUV. To convert 4:2:0 or 4:2:2 YUV to RGB, we recommend converting the YUV data to 4:4:4 YUV, and then converting from 4:4:4 YUV to RGB. The AYUV format, which is a 4:4:4 format, uses 8 bits each for the Y, U, and V samples. YUV can also be defined using more than 8 bits per sample for some applications.
Two dominant YUV conversions from RGB have been defined for digital video. Both are based on the specification known as ITU-R Recommendation BT.709. The first conversion is the older YUV form defined for 50-Hz use in BT.709. It is the same as the relation specified in ITU-R Recommendation BT.601, also known by its older name, CCIR 601. It should be considered the preferred YUV format for standard-definition TV resolution (720 x 576) and lower-resolution video. It is characterized by the values of two constants Kr and Kb:
The second conversion is the newer YUV form defined for 60-Hz use in
BT.709, and should be considered the preferred format for video
resolutions above SDTV. It is characterized by different values for
these two constants:
Conversion from RGB to YUV is defined by starting with the following:
The YUV values are then obtained as follows:
where
For input data in the form of studio video RGB, the clip operation is necessary to keep the U and V values within the range 0 to (2^M)-1. If the input is computer RGB, the clip operation is not required, because the conversion formula cannot produce values outside of this range.
These are the exact formulas without approximation. Everything that follows in this document is derived from these formulas. This section describes the following conversions:
These formulas produce 8-bit results using coefficients that require
no more than 8 bits of (unsigned) precision. Intermediate results will
require up to 16 bits of precision.
Therefore, given:
the formulas to convert YUV to RGB can be derived as follows:
where
These formulas use some coefficients that require more than 8 bits of
precision to produce each 8-bit result, and intermediate results will
require more than 16 bits of precision.
To convert 4:2:0 or 4:2:2 YUV to RGB, we recommend converting the YUV data to 4:4:4 YUV, and then converting from 4:4:4 YUV to RGB. The sections that follow present some methods for converting 4:2:0 and 4:2:2 formats to 4:4:4.
where clip() denotes clipping to a range of [0..255].
In signal processing terms, the vertical upconversion should ideally include a phase shift compensation to account for the half-pixel vertical offset (relative to the output 4:2:2 sampling grid) between the locations of the 4:2:0 sample lines and the location of every other 4:2:2 sample line. However, introducing this offset would increase the amount of processing required to generate the samples, and make it impossible to reconstruct the original 4:2:0 samples from the upsampled 4:2:2 image. It would also make it impossible to decode video directly into 4:2:2 surfaces and then use those surfaces as reference pictures for decoding subsequent pictures in the stream. Therefore, the method provided here does not take into account the precise vertical alignment of the samples. Doing so is probably not visually harmful at reasonably high picture resolutions.
If you start with 4:2:0 video that uses the sampling grid defined in H.261, H.263, or MPEG-1 video, the phase of the output 4:2:2 chroma samples will also be shifted by a half-pixel horizontal offset relative to the spacing on the luma sampling grid (a quarter-pixel offset relative to the spacing of the 4:2:2 chroma sampling grid). However, the MPEG-2 form of 4:2:0 video is probably more commonly used on PCs and does not suffer from this problem. Moreover, the distinction is probably not visually harmful at reasonably high picture resolutions. Trying to correct for this problem would create the same sort of problems discussed for the vertical phase offset.
There are various C/C++ macros that make it easier to declare FOURCC values in source code. For example, the MAKEFOURCC macro is declared in Mmsystem.h, and the FCC macro is declared in Aviriff.h. Use them as follows:
You can also declare a FOURCC code directly as a string literal simply by reversing the order of the characters. For example:
Reversing the order is necessary because the Windows operating system
uses a little-endian architecture. 'Y' = 0x59, 'U' = 0x55, and '2' =
0x32, so '2YUY' is 0x32595559.
In Media Foundation, formats are identified by a major type GUID and a subtype GUID. The major type for computer video formats is always MFMediaType_Video . The subtype can be constructed by mapping the FOURCC code to a GUID, as follows:
where
Constants for the most common YUV format GUIDs are defined in the header file mfapi.h. For a list of these constants, see Video Subtype GUIDs.
This topic describes two related concepts, picture aspect ratio and
pixel aspect ratio. It then describes how these concepts are expressed
in Microsoft Media Foundation using media types.
Sometimes the video image does not have the same shape as the display area. For example, a 4:3 video might be shown on a widescreen (16×9) television. In computer video, the video might be shown inside a window that has an arbitrary size. In that case, there are three ways the image can be made to fit within the display area:
The reverse case, scaling a 4:3 image to fit a widescreen display, is sometimes called pillarboxing. However, the term letterbox is also used in a general sense, to mean scaling a video image to fit any given display area.
When a digital image is captured, the image is sampled both vertically and horizontally, resulting in a rectangular array of quantized samples, called pixels or pels. The shape of the sampling grid determines the shape of the pixels in the digitized image.
Here is an example that uses small numbers to keep the math simple. Suppose that the original image is square (that is, the picture aspect ratio is 1:1); and suppose the sampling grid contains 12 elements, arranged in a 4×3 grid. The shape of each resulting pixel will be taller than it is wide. Specifically, the shape of each pixel will be 3×4. Pixels that are not square are called non-square pixels.
Pixel aspect ratio also applies to the display device. The physical shape of the display device and the physical pixel resolution (across and down) determine the PAR of the display device. Computer monitors generally use square pixels. If the image PAR and the display PAR do not match, the image must be scaled in one dimension, either vertically or horizontally, in order to display correctly. The following formula relates PAR, display aspect ratio (DAR), and image size in pixels:
DAR = (image width in pixels / image height in pixels) × PAR
Note that image width and image height in this formula refer to the image in memory, not the displayed image.
Here is a real-world example: NTSC-M analog video contains 480 scan lines in the active image area. ITU-R Rec. BT.601 specifies a horizontal sampling rate of 704 visible pixels per line, yielding a digital image with 704 x 480 pixels. The intended picture aspect ratio is 4:3, yielding a PAR of 10:11.
To display this image correctly on a display device with square pixels, you must scale either the width by 10/11 or the height by 11/10.
If you know anything about RGB color, you know that (255, 255, 255) is the 8-bit RGB triplet for the color white. But what is the actual color defined by this triplet?
The answer may be surprising: Without some additional information, this triplet does not define any particular color! The meaning of any RGB value depends on the color space. If we don't know the color space, then strictly speaking we don't know the color.
A color space defines how the numeric representation of a given color value should be reproduced as physical light. When video is encoded in one color space but is displayed in another, it results in distorted colors, unless the video is color-corrected. To achieve accurate color fidelity, therefore, it is crucial to know the color space of the source video. Previously, the video pipeline in Windows did not carry information about the intended color space. Starting in Windows Vista, both DirectShow and Media Foundation support extended color information in the media type. This information is also available for DirectX Video Acceleration (DXVA).
The standard way to describe a color space mathematically is to use the CIE XYZ color space, defined by the International Commission on Illumination (CIE). It is not practical to use CIE XYZ values directly in video, but the CIE XYZ color space can be used as an intermediate representation when converting between color spaces.
To reproduce colors accurately, the following information is needed:
For example, to convert from RT.601 to RT.709 requires the following stages:
This topic describes the 10- and 16-bit YUV formats that are
recommended for capturing, processing, and displaying video in the
Microsoft Windows operating system.
This topic contains the following sections:
The 16-bit representations described here use little-endian WORD values for each channel. The 10-bit formats also use 16 bits for each channel, with the lowest 6 bits set to zero, as shown in the following diagram.
Because the 10-bit and 16-bit representations of the same YUV format have the same memory layout, it is possible to cast a 10-bit representation to a 16-representation with no loss of precision. It is also possible to cast a 16-bit representation down to a 10-bit representation. (The Y416 and Y410 formats are an exception to this general rule, however, because they do not share the same memory layout.)
When the graphics hardware reads a surface that contains a 10-bit representation, it should ignore the low-order 6 bits of each channel. If a surface contains valid 16-bit data, however, it should be identified as a 16-bit surface.
In the formats that contain alpha, a completely transparent pixel has an alpha value of zero, and a completely opaque pixel has an alpha value of (2^n) – 1, where n is the number of alpha bits. Alpha is assumed to be a linear value that is applied to each component after the component has been converted into its normalized linear form.
For images in video memory, the graphics driver selects the memory alignment of the surface. The surface must be DWORD aligned. That is, individual lines within a surface are guaranteed to start at a 32-bit boundary, although the alignment can be larger than 32 bits. The origin (0,0) is always the upper-left corner of the surface.
For the purposes of this documentation, the term U is equivalent to Cb, and the term V is equivalent to Cr.
Subtype GUIDs have also been defined from these FOURCCs; see Video Subtype GUIDs.
If the combined U-V array is addressed as an array of DWORDs, the least significant word (LSW) contains the U value and the most significant word (MSW) contains the V value. The stride of the combined U-V plane is equal to the stride of the Y plane. The U-V plane has half as many lines as the Y plane.
These two formats are the preferred 4:2:0 planar pixel formats for higher precision YUV representations. They are expected to be an intermediate-term requirement for DirectX Video Acceleration (DXVA) accelerators that support 10-bit or 16-bit 4:2:0 video.
If the combined U-V array is addressed as an array of DWORDs, the LSW contains the U value and the MSW contains the V value. The stride of the combined U-V plane is equal to the stride of the Y plane. The U-V plane has the same number of lines as the Y plane.
These two formats are the preferred 4:2:2 planar pixel formats for higher precision YUV representations. They are expected to be an intermediate-term requirement for DirectX Video Acceleration (DXVA) accelerators that support 10-bit or 16-bit 4:2:2 video.
The first WORD in the array contains the first Y sample in the pair, the second WORD contains the U sample, the third WORD contains the second Y sample, and the fourth WORD contains the V sample.
Y210 is identical to Y216 except that each sample contains only 10 bits of significant data. The least significant 6 bits are set to zero, as described previously.
Bits 0-9 contain the U sample, bits 10-19 contain the Y sample, bits 20-29 contain the V sample, and bits 30-31 contain the alpha value. To indicate that a pixel is fully opaque, an application must set the two alpha bits equal to 0x03.
Bits 0-15 contain the U sample, bits 16-31 contain the Y sample, bits 32-47 contain the V sample, and bits 48-63 contain the alpha value.
To indicate that a pixel is fully opaque, an application must set the two alpha bits equal to 0xFFFF. This format is intended primarily as an intermediate format during image processing to avoid the accumulation of errors.
It is recommended that if an object supports a given bit depth and chroma sampling scheme, it should support the corresponding YUV formats listed in this table. (Objects might support additional formats not listed here.)
Many video formats have FOURCC codes assigned to them. A FOURCC code
is a 32-bit unsigned integer that is created by concatenating four ASCII
characters. For example, the FOURCC code for YUY2 video is 'YUY2'.
Various C/C++ macros are defined for declaring FOURCC values in source code. The MAKEFOURCC macro is defined in Mmsystem.h, and the FCC macro is defined in Aviriff.h and various other header files. You can also declare a FOURCC code directly as a string literal simply by reversing the order of the characters. Thus, the following statements are equivalent:
(In the last example, reversing the byte order is necessary because
Windows uses a little-endian architecture. 'Y' = 0x59, 'U' = 0x55, and
'2' = 0x32, so '2YUY' is 0x32595559.)
Some of the DirectX Video Acceleration 2.0 APIs use a D3DFORMAT value to describe a video format. A FOURCC code can be used in this context as well:
About YUV Video
21 out of 27 rated this helpful - Rate this topic
If you have worked with computer graphics, you are probably familiar with RGB color. An RGB color is encoded using three values: red, green, and blue. These values correspond directly to portions of the visible spectrum. The three RGB values form a mathematical coordinate system, called a color space. The red component defines one axis of this coordinate system, blue defines the second, and green defines the third, as shown in the following illustration. Any valid RGB color falls somewhere within this color space. For example, pure magenta is 100% blue, 100% red, and 0% green.
Although RGB is a common way to represent colors, other coordinate systems are possible. The term YUV refers to a family of color spaces, all of which encode brightness information separately from color information. Like RGB, YUV uses three values to represent any color. These values are termed Y', U, and V. (In fact, this use of the term "YUV" is technically inaccurate. In computer video, the term YUV almost always refers to one particular color space named Y'CbCr, discussed later. However, YUV is often used as a general term for any color space that works along the same principles as Y'CbCr.)
The Y' component, also called luma, represents the brightness value of the color. The prime symbol (') is used to differentiate luma from a closely related value, luminance, which is designated Y. Luminance is derived from linear RGB values, whereas luma is derived from non-linear (gamma-corrected) RGB values. Luminance is a closer measure of true brightness but luma is more practical to use for technical reasons. The prime symbol is frequently omitted, but YUV color spaces always use luma, not luminance.
Luma is derived from an RGB color by taking a weighted average of the red, green, and blue components. For standard-definition television, the following formula is used:
Y' = 0.299R + 0.587G + 0.114B
This formula reflects the fact that the human eye is more sensitive to certain wavelengths of light than others, which affects the perceived brightness of a color. Blue light appears dimmest, green appears brightest, and red is somewhere in between. This formula also reflects the physical characteristics of the phosphors used in early televisions. A newer formula, taking into account modern television technology, is used for high-definition television:
Y' = 0.2125R + 0.7154G + 0.0721B
The luma equation for standard-definition television is defined in a specification named ITU-R BT.601. For high-definition television, the relevant specification is ITU-R BT.709.
The U and V components, also called chroma values or color difference values, are derived by subtracting the Y value from the red and blue components of the original RGB color:
U = B - Y'
V = R - Y'
Together, these values contain enough information to recover the original RGB value.
Benefits of YUV
Analog television uses YUV partly for historical reasons. Analog color television signals were designed to be backward compatible with black-and-white televisions. The color television signal carries the chroma information (U and V) superimposed onto the luma signal. Black-and-white televisions ignore the chroma and display the combined signal as a grayscale image. (The signal is designed so that the chroma does not significantly interfere with the luma signal.) Color televisions can extract the chroma and convert the signal back to RGB.YUV has another advantage that is more relevant today. The human eye is less sensitive to changes in hue than changes in brightness. As a result, an image can have less chroma information than luma information without sacrificing the perceived quality of the image. For example, it is common to sample the chroma values at half the horizontal resolution of the luma samples. In other words, for every two luma samples in a row of pixels, there is one U sample and one V sample. Assuming that 8 bits are used to encode each value, a total of 4 bytes are needed for every two pixels (two Y', one U, and one V), for an average of 16 bits per pixel, or 30% less than the equivalent 24-bit RGB encoding.
YUV is not inherently any more compact than RGB. Unless the chroma is downsampled, a YUV pixel is the same size as an RGB pixel. Also, the conversion from RGB to YUV is not lossy. If there is no downsampling, a YUV pixel can be converted back to RGB with no loss of information. Downsampling makes a YUV image smaller and also loses some of the color information. If performed correctly, however, the loss is not perceptually significant.
YUV in Computer Video
The formulas listed previously for YUV are not the exact conversions used in digital video. Digital video generally uses a form of YUV called Y'CbCr. Essentially, Y'CbCr works by scaling the YUV components to the following ranges:Component | Range |
---|---|
Y' | 16–235 |
Cb/Cr | 16–240, with 128 representing zero |
These ranges assume 8 bits of precision for the Y'CbCr components. Here is the exact derivation of Y'CbCr, using the BT.601 definition of luma:
- Start with RGB values in the range [0...1]. In other words, pure black is 0 and pure white is 1. Importantly, these are non-linear (gamma corrected) RGB values.
- Calculate the luma. For BT.601, Y' = 0.299R + 0.587G + 0.114B, as described earlier.
- Calculate the intermediate chroma difference values (B - Y') and (R - Y'). These values have a range of +/- 0.886 for (B - Y'), and +/- 0.701 for (R - Y').
-
Scale the chroma difference values as follows:
Pb = (0.5 / (1 - 0.114)) × (B - Y')
Pr = (0.5 / (1 - 0.299)) × (R - Y')
These scaling factors are designed to give both values the same numerical range, +/- 0.5. Together, they define a YUV color space named Y'PbPr. This color space is used in analog component video.
-
Scale the Y'PbPr values to get the final Y'CbCr values:
Y' = 16 + 219 × Y'
Cb = 128 + 224 × Pb
Cr = 128 + 224 × Pr
The following table shows RGB and YCbCr values for various colors, again using the BT.601 definition of luma.
Color | R | G | B | Y' | Cb | Cr |
---|---|---|---|---|---|---|
Black | 0 | 0 | 0 | 16 | 128 | 128 |
Red | 255 | 0 | 0 | 81 | 90 | 240 |
Green | 0 | 255 | 0 | 145 | 54 | 34 |
Blue | 0 | 0 | 255 | 41 | 240 | 110 |
Cyan | 0 | 255 | 255 | 170 | 166 | 16 |
Magenta | 255 | 0 | 255 | 106 | 202 | 222 |
Yellow | 255 | 255 | 0 | 210 | 16 | 146 |
White | 255 | 255 | 255 | 235 | 128 | 128 |
As this table shows, Cb and Cr do not correspond to intuitive ideas about color. For example, pure white and pure black both contain neutral levels of Cb and Cr (128). The highest and lowest values for Cb are blue and yellow, respectively. For Cr, the highest and lowest values are red and cyan.
For More Information
- Recommended 8-Bit YUV Formats for Video Rendering
- Keith Jack. Video Demystified, Fifth Edition. Newnes, 2007.
- Charles A. Poynton. A Technical Introduction to Digital Video. Wiley, 1996.
Recommended 8-Bit YUV Formats for Video Rendering
17 out of 17 rated this helpful - Rate this topic
Microsoft Corporation
April 2002, updated November 2008
This topic describes the 8-bit YUV color formats that are recommended for video rendering in the Windows operating system. This article presents techniques for converting between YUV and RGB formats, and also provides techniques for upsampling YUV formats. This article is intended for anyone working with YUV video decoding or rendering in Windows.
Introduction
Numerous YUV formats are defined throughout the video industry. This article identifies the 8-bit YUV formats that are recommended for video rendering in Windows. Decoder vendors and display vendors are encouraged to support the formats described in this article. This article does not address other uses of YUV color, such as still photography.The formats described in this article all use 8 bits per pixel location to encode the Y channel (also called the luma channel), and use 8 bits per sample to encode each U or V chroma sample. However, most YUV formats use fewer than 24 bits per pixel on average, because they contain fewer samples of U and V than of Y. This article does not cover YUV formats with 10-bit or higher Y channels.
Note For the purposes of this article, the term U is equivalent to Cb, and the term V is equivalent to Cr.
This article covers the following topics:- YUV Sampling. Describes the most common YUV sampling techniques.
- Surface Definitions. Describes the recommended YUV formats.
- Color Space and Chroma Sampling Rate Conversions. Provides some guidelines for converting between YUV and RGB formats and for converting between different YUV formats.
- Identifying YUV Formats in Media Foundation. Explains how to describe YUV format types in Media Foundation.
YUV Sampling
Chroma channels can have a lower sampling rate than the luma channel, without any dramatic loss of perceptual quality. A notation called the "A:B:C" notation is used to describe how often U and V are sampled relative to Y:- 4:4:4 means no downsampling of the chroma channels.
- 4:2:2 means 2:1 horizontal downsampling, with no vertical downsampling. Every scan line contains four Y samples for every two U or V samples.
- 4:2:0 means 2:1 horizontal downsampling, with 2:1 vertical downsampling.
- 4:1:1 means 4:1 horizontal downsampling, with no vertical downsampling. Every scan line contains four Y samples for each U and V sample. 4:1:1 sampling is less common than other formats, and is not discussed in detail in this article.
The dominant form of 4:2:2 sampling is defined in ITU-R Recommendation BT.601. There are two common variants of 4:2:0 sampling. One of these is used in MPEG-2 video, and the other is used in MPEG-1 and in ITU-T Recommendations H.261 and H.263.
Compared with the MPEG-1 scheme, it is simpler to convert between the MPEG-2 scheme and the sampling grids defined for 4:2:2 and 4:4:4 formats. For this reason, the MPEG-2 scheme is preferred in Windows, and should be considered the default interpretation of 4:2:0 formats.
Surface Definitions
This section describes the 8-bit YUV formats that are recommended for video rendering. These fall into several categories:- 4:4:4 Formats, 32 Bits per Pixel
- 4:2:2 Formats, 16 Bits per Pixel
- 4:2:0 Formats, 16 Bits per Pixel
- 4:2:0 Formats, 12 Bits per Pixel
- Surface origin. For the YUV formats described in this article, the origin (0,0) is always the top left corner of the surface.
- Stride. The stride of a surface, sometimes called the pitch, is the width of the surface in bytes. Given a surface origin at the top left corner, the stride is always positive.
- Alignment. The alignment of a surface is at the discretion of the graphics display driver. The surface must always be DWORD aligned; that is, individual lines within the surface are guaranteed to originate on a 32-bit (DWORD) boundary. The alignment can be larger than 32 bits, however, depending on the needs of the hardware.
- Packed format versus planar format. YUV formats are divided into packed formats and planar formats. In a packed format, the Y, U, and V components are stored in a single array. Pixels are organized into groups of macropixels, whose layout depends on the format. In a planar format, the Y, U, and V components are stored as three separate planes.
- 4:4:4 (32 bpp)
- 4:2:2 (16 bpp)
- 4:2:0 (16 bpp)
- 4:2:0 (12 bpp)
4:4:4 Formats, 32 Bits per Pixel
AYUV
A single 4:4:4 format is recommended, with the FOURCC code AYUV. This is a packed format, where each pixel is encoded as four consecutive bytes, arranged in the sequence shown in the following illustration.The bytes marked A contain values for alpha.
4:2:2 Formats, 16 Bits per Pixel
Two 4:2:2 formats are recommended, with the following FOURCC codes:- YUY2
- UYVY
YUY2
In YUY2 format, the data can be treated as an array of unsigned char values, where the first byte contains the first Y sample, the second byte contains the first U (Cb) sample, the third byte contains the second Y sample, and the fourth byte contains the first V (Cr) sample, as shown in the following diagram.If the image is addressed as an array of little-endian WORD values, the first WORD contains the first Y sample in the least significant bits (LSBs) and the first U (Cb) sample in the most significant bits (MSBs). The second WORD contains the second Y sample in the LSBs and the first V (Cr) sample in the MSBs.
YUY2 is the preferred 4:2:2 pixel format for Microsoft DirectX Video Acceleration (DirectX VA). It is expected to be an intermediate-term requirement for DirectX VA accelerators supporting 4:2:2 video.
UYVY
This format is the same as the YUY2 format except the byte order is reversed—that is, the chroma and luma bytes are flipped (Figure 4). If the image is addressed as an array of two little-endian WORD values, the first WORD contains U in the LSBs and Y0 in the MSBs, and the second WORD contains V in the LSBs and Y1 in the MSBs.4:2:0 Formats, 16 Bits per Pixel
Two 4:2:0 16-bits per pixel (bpp) formats are recommended, with the following FOURCC codes:- IMC1
- IMC3
IMC1
All of the Y samples appear first in memory as an array of unsigned char values. This is followed by all of the V (Cr) samples, and then all of the U (Cb) samples. The V and U planes have the same stride as the Y plane, resulting in unused areas of memory, as shown in Figure 5. The U and V planes must start on memory boundaries that are a multiple of 16 lines. Figure 5 shows the origin of U and V for a 352 x 240 video frame. The starting address of the U and V planes are calculated as follows:BYTE* pV = pY + (((Height + 15) & ~15) * Stride); BYTE* pU = pY + (((((Height * 3) / 2) + 15) & ~15) * Stride);
IMC3
This format is identical to IMC1, except the U and V planes are swapped, as shown in the following diagram.4:2:0 Formats, 12 Bits per Pixel
Four 4:2:0 12-bpp formats are recommended, with the following FOURCC codes:- IMC2
- IMC4
- YV12
- NV12
IMC2
This format is the same as IMC1 except for the following difference: The V (Cr) and U (Cb) lines are interleaved at half-stride boundaries. In other words, each full-stride line in the chroma area starts with a line of V samples, followed by a line of U samples that begins at the next half-stride boundary (Figure 7). This layout makes more efficient use of address space than IMC1. It cuts the chroma address space in half, and thus the total address space by 25 percent. Among 4:2:0 formats, IMC2 is the second-most preferred format, after NV12. The following image illustrates this process.IMC4
This format is identical to IMC2, except the U (Cb) and V (Cr) lines are swapped, as shown in the following illustration.YV12
All of the Y samples appear first in memory as an array of unsigned char values. This array is followed immediately by all of the V (Cr) samples. The stride of the V plane is half the stride of the Y plane; and the V plane contains half as many lines as the Y plane. The V plane is followed immediately by all of the U (Cb) samples, with the same stride and number of lines as the V plane, as shown in the following illustration.NV12
All of the Y samples appear first in memory as an array of unsigned char values with an even number of lines. The Y plane is followed immediately by an array of unsigned char values that contains packed U (Cb) and V (Cr) samples. When the combined U-V array is addressed as an array of little-endian WORD values, the LSBs contain the U values, and the MSBs contain the V values. NV12 is the preferred 4:2:0 pixel format for DirectX VA. It is expected to be an intermediate-term requirement for DirectX VA accelerators supporting 4:2:0 video. The following illustration shows the Y plane and the array that contains packed U and V samples.Color Space and Chroma Sampling Rate Conversions
This section provides guidelines for converting between YUV and RGB, and for converting between some different YUV formats. We consider two RGB encoding schemes in this section: 8-bit computer RGB, also known as sRGB or "full-scale" RGB, and studio video RGB, or "RGB with head-room and toe-room." These are defined as follows:- Computer RGB uses 8 bits for each sample of red, green, and blue. Black is represented by R = G = B = 0, and white is represented by R = G = B = 255.
- Studio video RGB uses some number of bits N for each sample of red, green, and blue, where N is 8 or more. Studio video RGB uses a different scaling factor than computer RGB, and it has an offset. Black is represented by R = G = B = 16*2^N-8, and white is represented by R = G = B = 235*2^N-8. However, actual values may fall outside this range.
Conversion between RGB and 4:4:4 YUV
We first describe conversion between RGB and 4:4:4 YUV. To convert 4:2:0 or 4:2:2 YUV to RGB, we recommend converting the YUV data to 4:4:4 YUV, and then converting from 4:4:4 YUV to RGB. The AYUV format, which is a 4:4:4 format, uses 8 bits each for the Y, U, and V samples. YUV can also be defined using more than 8 bits per sample for some applications.
Two dominant YUV conversions from RGB have been defined for digital video. Both are based on the specification known as ITU-R Recommendation BT.709. The first conversion is the older YUV form defined for 50-Hz use in BT.709. It is the same as the relation specified in ITU-R Recommendation BT.601, also known by its older name, CCIR 601. It should be considered the preferred YUV format for standard-definition TV resolution (720 x 576) and lower-resolution video. It is characterized by the values of two constants Kr and Kb:
Kr = 0.299 Kb = 0.114
Kr = 0.2126 Kb = 0.0722
L = Kr * R + Kb * B + (1 - Kr - Kb) * G
Y = floor(2^(M-8) * (219*(L-Z)/S + 16) + 0.5) U = clip3(0, (2^M)-1, floor(2^(M-8) * (112*(B-L) / ((1-Kb)*S) + 128) + 0.5)) V = clip3(0, (2^M)-1, floor(2^(M-8) * (112*(R-L) / ((1-Kr)*S) + 128) + 0.5))
- M is the number of bits per YUV sample (M >= 8).
- Z is the black-level variable. For computer RGB, Z equals 0. For studio video RGB, Z equals 16*2^N-8, where N is the number of bits per RGB sample (N >= 8).
- S is the scaling variable. For computer RGB, S equals 255. For studio video RGB, S equals 219*2^N-8.
clip3(x, y, z) = ((z < x) ? x : ((z > y) ? y : z))
Note clip3 should be implemented as a
function rather than a preprocessor macro; otherwise multiple
evaluations of the arguments will occur.
The Y sample represents brightness, and the U and V samples represent
the color deviations toward blue and red, respectively. The nominal
range for Y is 16*2^M-8 to 235*2^M-8. Black is represented as
16*2^(M-8), and white is represented as 235*2^(M-8). The nominal range
for U and V are 16*2^(M-8) to 240*2^(M-8), with the value 128*2^(M-8)
representing neutral chroma. However, actual values may fall outside
these ranges.For input data in the form of studio video RGB, the clip operation is necessary to keep the U and V values within the range 0 to (2^M)-1. If the input is computer RGB, the clip operation is not required, because the conversion formula cannot produce values outside of this range.
These are the exact formulas without approximation. Everything that follows in this document is derived from these formulas. This section describes the following conversions:
- Converting RGB888 to YUV 4:4:4
- Converting 8-bit YUV to RGB888
- Converting 4:2:0 YUV to 4:2:2 YUV
- Converting 4:2:2 YUV to 4:4:4 YUV
- Converting 4:2:0 YUV to 4:4:4 YUV
Converting RGB888 to YUV 4:4:4
In the case of computer RGB input and 8-bit BT.601 YUV output, we believe that the formulas given in the previous section can be reasonably approximated by the following:Y = ( ( 66 * R + 129 * G + 25 * B + 128) >> 8) + 16 U = ( ( -38 * R - 74 * G + 112 * B + 128) >> 8) + 128 V = ( ( 112 * R - 94 * G - 18 * B + 128) >> 8) + 128
Converting 8-bit YUV to RGB888
From the original RGB-to-YUV formulas, one can derive the following relationships for BT.601.Y = round( 0.256788 * R + 0.504129 * G + 0.097906 * B) + 16 U = round(-0.148223 * R - 0.290993 * G + 0.439216 * B) + 128 V = round( 0.439216 * R - 0.367788 * G - 0.071427 * B) + 128
C = Y - 16 D = U - 128 E = V - 128
R = clip( round( 1.164383 * C + 1.596027 * E ) ) G = clip( round( 1.164383 * C - (0.391762 * D) - (0.812968 * E) ) ) B = clip( round( 1.164383 * C + 2.017232 * D ) )
clip()
denotes clipping to a range of [0..255]. We believe these formulas can be reasonably approximated by the following:R = clip(( 298 * C + 409 * E + 128) >> 8) G = clip(( 298 * C - 100 * D - 208 * E + 128) >> 8) B = clip(( 298 * C + 516 * D + 128) >> 8)
To convert 4:2:0 or 4:2:2 YUV to RGB, we recommend converting the YUV data to 4:4:4 YUV, and then converting from 4:4:4 YUV to RGB. The sections that follow present some methods for converting 4:2:0 and 4:2:2 formats to 4:4:4.
Converting 4:2:0 YUV to 4:2:2 YUV
Converting 4:2:0 YUV to 4:2:2 YUV requires vertical upconversion by a factor of two. This section describes an example method for performing the upconversion. The method assumes that the video pictures are progressive scan.
Note The 4:2:0 to 4:2:2 interlaced
scan conversion process presents atypical problems and is difficult to
implement. This article does not address the issue of converting
interlaced scan from 4:2:0 to 4:2:2.
Let each vertical line of input chroma samples be an array Cin[]
that ranges from 0 to N - 1. The corresponding vertical line on the output image will be an array Cout[]
that ranges from 0 to 2N - 1. To convert each vertical line, perform the following process:Cout[0] = Cin[0]; Cout[1] = clip((9 * (Cin[0] + Cin[1]) - (Cin[0] + Cin[2]) + 8) >> 4); Cout[2] = Cin[1]; Cout[3] = clip((9 * (Cin[1] + Cin[2]) - (Cin[0] + Cin[3]) + 8) >> 4); Cout[4] = Cin[2] Cout[5] = clip((9 * (Cin[2] + Cin[3]) - (Cin[1] + Cin[4]) + 8) >> 4); ... Cout[2*i] = Cin[i] Cout[2*i+1] = clip((9 * (Cin[i] + Cin[i+1]) - (Cin[i-1] + Cin[i+2]) + 8) >> 4); ... Cout[2*N-3] = clip((9 * (Cin[N-2] + Cin[N-1]) - (Cin[N-3] + Cin[N-1]) + 8) >> 4); Cout[2*N-2] = Cin[N-1]; Cout[2*N-1] = clip((9 * (Cin[N-1] + Cin[N-1]) - (Cin[N-2] + Cin[N-1]) + 8) >> 4);
Note The equations for handling the
edges can be mathematically simplified. They are shown in this form to
illustrate the clamping effect at the edges of the picture.
In effect, this method calculates each missing value by interpolating
the curve over the four adjacent pixels, weighted toward the values of
the two nearest pixels (Figure 11). The specific interpolation method
used in this example generates missing samples at half-integer positions
using a well-known method called Catmull-Rom interpolation, also known
as cubic convolution interpolation.In signal processing terms, the vertical upconversion should ideally include a phase shift compensation to account for the half-pixel vertical offset (relative to the output 4:2:2 sampling grid) between the locations of the 4:2:0 sample lines and the location of every other 4:2:2 sample line. However, introducing this offset would increase the amount of processing required to generate the samples, and make it impossible to reconstruct the original 4:2:0 samples from the upsampled 4:2:2 image. It would also make it impossible to decode video directly into 4:2:2 surfaces and then use those surfaces as reference pictures for decoding subsequent pictures in the stream. Therefore, the method provided here does not take into account the precise vertical alignment of the samples. Doing so is probably not visually harmful at reasonably high picture resolutions.
If you start with 4:2:0 video that uses the sampling grid defined in H.261, H.263, or MPEG-1 video, the phase of the output 4:2:2 chroma samples will also be shifted by a half-pixel horizontal offset relative to the spacing on the luma sampling grid (a quarter-pixel offset relative to the spacing of the 4:2:2 chroma sampling grid). However, the MPEG-2 form of 4:2:0 video is probably more commonly used on PCs and does not suffer from this problem. Moreover, the distinction is probably not visually harmful at reasonably high picture resolutions. Trying to correct for this problem would create the same sort of problems discussed for the vertical phase offset.
Converting 4:2:2 YUV to 4:4:4 YUV
Converting 4:2:2 YUV to 4:4:4 YUV requires horizontal upconversion by a factor of two. The method described previously for vertical upconversion can also be applied to horizontal upconversion. For MPEG-2 and ITU-R BT.601 video, this method will produce samples with the correct phase alignment.Converting 4:2:0 YUV to 4:4:4 YUV
To convert 4:2:0 YUV to 4:4:4 YUV, you can simply follow the two methods described previously. Convert the 4:2:0 image to 4:2:2, and then convert the 4:2:2 image to 4:4:4. You can also switch the order of the two upconversion processes, as the order of operation does not really matter to the visual quality of the result.Other YUV Formats
Some other, less common YUV formats include the following:- AI44 is a palettized YUV format with 8 bits per sample. Each sample contains an index in the 4 most significant bits (MSBs) and an alpha value in the 4 least significant bits (LSBs). The index refers to an array of YUV palette entries, which must be defined in the media type for the format. This format is primarily used for subpicture images.
- NV11 is a 4:1:1 planar format with 12 bits per pixel. The Y samples appear first in memory. The Y plane is followed by an array of packed U (Cb) and V (Cr) samples. When the combined U-V array is addressed as an array of little-endian WORD values, the U samples are contained in the LSBs of each WORD, and the V samples are contained in the MSBs. (This memory layout is similar to NV12 although the chroma sampling is different.)
- Y41P is a 4:1:1 packed format, with U and V sampled every fourth
pixel horizontally. Each macropixel contains 8 pixels in three bytes,
with the following byte layout:
U0 Y0 V0 Y1 U4 Y2 V4 Y3 Y4 Y5 Y6 Y7
- Y41T is identical to Y41P, except the least-significant bit of each Y sample specifies the chroma key (0 = transparent, 1 = opaque).
- Y42T is identical to UYVY, except the least-significant bit of each Y sample specifies the chroma key (0 = transparent, 1 = opaque).
- YVYU is equivalent to YUYV except the U and V samples are swapped.
Identifying YUV Formats in Media Foundation
Each of the YUV formats described in this article has an assigned FOURCC code. A FOURCC code is a 32-bit unsigned integer that is created by concatenating four ASCII characters.There are various C/C++ macros that make it easier to declare FOURCC values in source code. For example, the MAKEFOURCC macro is declared in Mmsystem.h, and the FCC macro is declared in Aviriff.h. Use them as follows:
DWORD fccYUY2 = MAKEFOURCC('Y','U','Y','2'); DWORD fccYUY2 = FCC('YUY2');
DWORD fccYUY2 = '2YUY'; // Declares the FOURCC 'YUY2'
In Media Foundation, formats are identified by a major type GUID and a subtype GUID. The major type for computer video formats is always MFMediaType_Video . The subtype can be constructed by mapping the FOURCC code to a GUID, as follows:
XXXXXXXX-0000-0010-8000-00AA00389B71
XXXXXXXX
is the FOURCC code. Thus, the subtype GUID for YUY2 is:32595559-0000-0010-8000-00AA00389B71
Related topics
Picture Aspect Ratio
0 out of 2 rated this helpful - Rate this topic
Picture Aspect Ratio
Picture aspect ratio defines the shape of the displayed video image. Picture aspect ratio is notated X:Y, where X:Y is the ratio of picture width to picture height. Most video standards use either 4:3 or 16:9 picture aspect ratio. The 16:9 aspect ratio is commonly called widescreen. Cinema film often uses a 1:85:1 or 1:66:1 aspect ratio. Picture aspect ratio is also called display aspect ratio (DAR).Sometimes the video image does not have the same shape as the display area. For example, a 4:3 video might be shown on a widescreen (16×9) television. In computer video, the video might be shown inside a window that has an arbitrary size. In that case, there are three ways the image can be made to fit within the display area:
- Stretch the image along one axis to fit the display area.
- Scale the image to fit the display area, while maintaining the original picture aspect ratio.
- Crop the image.
Letterboxing
The process of scaling a widescreen image to fit a 4:3 display is called letterboxing, shown in the next diagram. The resulting rectanglular areas at the top and bottom of the image are typically filled with black, although other colors can be used.The reverse case, scaling a 4:3 image to fit a widescreen display, is sometimes called pillarboxing. However, the term letterbox is also used in a general sense, to mean scaling a video image to fit any given display area.
Pan-and-Scan
Pan-and-scan is a technique whereby a widescreen image is cropped to a 4×3 rectangular area, for display on a 4:3 display device. The resulting image fills the entire display, without requiring black letterbox areas, but portions of the original image are cropped out of the picture. The area that is cropped can move from frame to frame, as the area of interest shifts. The term "pan" in pan-and-scan refers to the panning effect that is caused by moving the pan-and-scan area.Pixel Aspect Ratio
Pixel aspect ratio (PAR) measures the shape of a pixel.When a digital image is captured, the image is sampled both vertically and horizontally, resulting in a rectangular array of quantized samples, called pixels or pels. The shape of the sampling grid determines the shape of the pixels in the digitized image.
Here is an example that uses small numbers to keep the math simple. Suppose that the original image is square (that is, the picture aspect ratio is 1:1); and suppose the sampling grid contains 12 elements, arranged in a 4×3 grid. The shape of each resulting pixel will be taller than it is wide. Specifically, the shape of each pixel will be 3×4. Pixels that are not square are called non-square pixels.
Pixel aspect ratio also applies to the display device. The physical shape of the display device and the physical pixel resolution (across and down) determine the PAR of the display device. Computer monitors generally use square pixels. If the image PAR and the display PAR do not match, the image must be scaled in one dimension, either vertically or horizontally, in order to display correctly. The following formula relates PAR, display aspect ratio (DAR), and image size in pixels:
DAR = (image width in pixels / image height in pixels) × PAR
Note that image width and image height in this formula refer to the image in memory, not the displayed image.
Here is a real-world example: NTSC-M analog video contains 480 scan lines in the active image area. ITU-R Rec. BT.601 specifies a horizontal sampling rate of 704 visible pixels per line, yielding a digital image with 704 x 480 pixels. The intended picture aspect ratio is 4:3, yielding a PAR of 10:11.
- DAR: 4:3
- Width in pixels: 704
- Height in pixels: 480
- PAR: 10/11
To display this image correctly on a display device with square pixels, you must scale either the width by 10/11 or the height by 11/10.
Working with Aspect Ratios
The correct shape of a video frame is defined by the pixel aspect ratio (PAR) and the display area.- The PAR defines the shape of the pixels in an image. Square pixels have an aspect ratio of 1:1. Any other aspect ratio describes a non-square pixel. For example, NTSC television uses a 10:11 PAR. Assuming that you are presenting the video on a computer monitor, the display will have square pixels (1:1 PAR). The PAR of the source content is given in the MF_MT_PIXEL_ASPECT_RATIO attribute on the media type.
-
The display area is the region of the video image that is
intended to be shown. There are two relevant display areas that might be
specified in the media type:
- Pan-and-scan aperture. The pan-and-scan aperture is a 4×3 region of video that should be displayed in pan/scan mode. It is used to show wide-screen content on a 4×3 display without letterboxing. The pan-and-scan aperture is given in the MF_MT_PAN_SCAN_APERTURE attribute and should be used only when the MF_MT_PAN_SCAN_ENABLED attribute is TRUE.
- Display aperture. This aperture is defined in some video standards. Anything outside of the display aperture is the overscan region and should not be displayed. For example, NTSC television is 720×480 pixels with a display aperture of 704×480. The display aperture is given in the MF_MT_MINIMUM_DISPLAY_APERTURE attribute. If present, it should be used when pan-and-scan mode is FALSE.
Code Examples
Finding the Display Area
The following code shows how to get the display area from the media type.MFVideoArea MakeArea(float x, float y, DWORD width, DWORD height); HRESULT GetVideoDisplayArea(IMFMediaType *pType, MFVideoArea *pArea) { HRESULT hr = S_OK; BOOL bPanScan = FALSE; UINT32 width = 0, height = 0; bPanScan = MFGetAttributeUINT32(pType, MF_MT_PAN_SCAN_ENABLED, FALSE); // In pan-and-scan mode, try to get the pan-and-scan region. if (bPanScan) { hr = pType->GetBlob(MF_MT_PAN_SCAN_APERTURE, (UINT8*)pArea, sizeof(MFVideoArea), NULL); } // If not in pan-and-scan mode, or the pan-and-scan region is not set, // get the minimimum display aperture. if (!bPanScan || hr == MF_E_ATTRIBUTENOTFOUND) { hr = pType->GetBlob(MF_MT_MINIMUM_DISPLAY_APERTURE, (UINT8*)pArea, sizeof(MFVideoArea), NULL); if (hr == MF_E_ATTRIBUTENOTFOUND) { // Minimum display aperture is not set. // For backward compatibility with some components, // check for a geometric aperture. hr = pType->GetBlob(MF_MT_GEOMETRIC_APERTURE, (UINT8*)pArea, sizeof(MFVideoArea), NULL); } // Default: Use the entire video area. if (hr == MF_E_ATTRIBUTENOTFOUND) { hr = MFGetAttributeSize(pType, MF_MT_FRAME_SIZE, &width, &height); if (SUCCEEDED(hr)) { *pArea = MakeArea(0.0, 0.0, width, height); } } } return hr; }
MFOffset MakeOffset(float v) { MFOffset offset; offset.value = short(v); offset.fract = WORD(65536 * (v-offset.value)); return offset; }
MFVideoArea MakeArea(float x, float y, DWORD width, DWORD height) { MFVideoArea area; area.OffsetX = MakeOffset(x); area.OffsetY = MakeOffset(y); area.Area.cx = width; area.Area.cy = height; return area; }
Converting Between Pixel Aspect Ratios
The following code shows how to convert a rectangle from one pixel aspect ratio (PAR) to another, while preserving the picture aspect ratio.//----------------------------------------------------------------------------- // Converts a rectangle from one pixel aspect ratio (PAR) to another PAR. // Returns the corrected rectangle. // // For example, a 720 x 486 rect with a PAR of 9:10, when converted to 1x1 PAR, // must be stretched to 720 x 540. //----------------------------------------------------------------------------- RECT CorrectAspectRatio(const RECT& src, const MFRatio& srcPAR, const MFRatio& destPAR) { // Start with a rectangle the same size as src, but offset to (0,0). RECT rc = {0, 0, src.right - src.left, src.bottom - src.top}; // If the source and destination have the same PAR, there is nothing to do. // Otherwise, adjust the image size, in two steps: // 1. Transform from source PAR to 1:1 // 2. Transform from 1:1 to destination PAR. if ((srcPAR.Numerator != destPAR.Numerator) || (srcPAR.Denominator != destPAR.Denominator)) { // Correct for the source's PAR. if (srcPAR.Numerator > srcPAR.Denominator) { // The source has "wide" pixels, so stretch the width. rc.right = MulDiv(rc.right, srcPAR.Numerator, srcPAR.Denominator); } else if (srcPAR.Numerator < srcPAR.Denominator) { // The source has "tall" pixels, so stretch the height. rc.bottom = MulDiv(rc.bottom, srcPAR.Denominator, srcPAR.Numerator); } // else: PAR is 1:1, which is a no-op. // Next, correct for the target's PAR. This is the inverse operation of // the previous. if (destPAR.Numerator > destPAR.Denominator) { // The destination has "wide" pixels, so stretch the height. rc.bottom = MulDiv(rc.bottom, destPAR.Numerator, destPAR.Denominator); } else if (destPAR.Numerator < destPAR.Denominator) { // The destination has "tall" pixels, so stretch the width. rc.right = MulDiv(rc.right, destPAR.Denominator, destPAR.Numerator); } // else: PAR is 1:1, which is a no-op. } return rc; }
Calculating the Letterbox Area
The following code calculates the letterbox area, given a source and destination rectangle. It is assumed that both rectangles have the same PAR.RECT LetterBoxRect(const RECT& rcSrc, const RECT& rcDst) { // Compute source/destination ratios. int iSrcWidth = rcSrc.right - rcSrc.left; int iSrcHeight = rcSrc.bottom - rcSrc.top; int iDstWidth = rcDst.right - rcDst.left; int iDstHeight = rcDst.bottom - rcDst.top; int iDstLBWidth; int iDstLBHeight; if (MulDiv(iSrcWidth, iDstHeight, iSrcHeight) <= iDstWidth) { // Column letterboxing ("pillar box") iDstLBWidth = MulDiv(iDstHeight, iSrcWidth, iSrcHeight); iDstLBHeight = iDstHeight; } else { // Row letterboxing. iDstLBWidth = iDstWidth; iDstLBHeight = MulDiv(iDstWidth, iSrcHeight, iSrcWidth); } // Create a centered rectangle within the current destination rect LONG left = rcDst.left + ((iDstWidth - iDstLBWidth) / 2); LONG top = rcDst.top + ((iDstHeight - iDstLBHeight) / 2); RECT rc; SetRect(&rc, left, top, left + iDstLBWidth, top + iDstLBHeight); return rc; }
Related topics
- Media Types
- Video Media Types
- MF_MT_MINIMUM_DISPLAY_APERTURE
- MF_MT_PAN_SCAN_APERTURE
- MF_MT_PAN_SCAN_ENABLED
- MF_MT_PIXEL_ASPECT_RATIO
Extended Color Information
0 out of 2 rated this helpful - Rate this topic
The answer may be surprising: Without some additional information, this triplet does not define any particular color! The meaning of any RGB value depends on the color space. If we don't know the color space, then strictly speaking we don't know the color.
A color space defines how the numeric representation of a given color value should be reproduced as physical light. When video is encoded in one color space but is displayed in another, it results in distorted colors, unless the video is color-corrected. To achieve accurate color fidelity, therefore, it is crucial to know the color space of the source video. Previously, the video pipeline in Windows did not carry information about the intended color space. Starting in Windows Vista, both DirectShow and Media Foundation support extended color information in the media type. This information is also available for DirectX Video Acceleration (DXVA).
The standard way to describe a color space mathematically is to use the CIE XYZ color space, defined by the International Commission on Illumination (CIE). It is not practical to use CIE XYZ values directly in video, but the CIE XYZ color space can be used as an intermediate representation when converting between color spaces.
To reproduce colors accurately, the following information is needed:
- Color primaries. The color primaries define how CIE XYZ tristimulus values are represented as RGB components. In effect, the color primaries define the "meaning" of a given RGB value.
- Transfer function. The transfer function is a function that converts linear RGB values to non-linear R'G'B' values. This function is also called gamma correction.
- Transfer matrix. The transfer matrix defines how R'G'B' is converted to Y'PbPr.
- Chroma sampling. Most YUV video is transmitted with less resolution in the chroma components than the luma. Chroma sampling is indicated by the FOURCC of the video format. For example, YUY2 is a 4:2:2 format, meaning the chroma samples are sampled horizontally by a factor of 2.
- Chroma siting. When the chroma is sampled, the position of the chroma samples relative to the luma samples determines how the missing samples should be interpolated.
- Nominal range. The nominal range defines how Y'PbPr values are scaled to Y'CbCr.
Color Space in Media Types
DirectShow, Media Foundation, and DirectX Video Acceleration (DXVA) all have different ways to represent video formats. Fortunately, it is easy to translate the color-space information from one to another, because the relevant enumerations are the same.- DXVA 1.0: Color-space information is given in the DXVA_ExtendedFormat structure.
- DXVA 2.0: Color-space information is given in the DXVA2_ExtendedFormat structure structure. This structure is identical to the DXVA 1.0 structure, and the meaning of the fields is the same.
- DirectShow: Color-space information is given in the VIDEOINFOHEADER2 structure. The information is stored in the upper 24 bits of the dwControlFlags field. If color-space information is present, set the AMCONTROL_COLORINFO_PRESENT flag in dwControlFlags. When this flag is set, the dwControlFlags field should be interpreted as a DXVA_ExtendedFormat structure, except that the lower 8 bits of the structure are reserved for AMCONTROL_xxx flags.
- Video capture drivers: Color-space information is given in the KS_VIDEOINFOHEADER2 structure. This structure is identical to the VIDEOINFOHEADER2 structure, and the meaning of the fields is the same.
-
Media Foundation: Color-space information is stored as attributes in the media type:
Color information Attribute Color primaries MF_MT_VIDEO_PRIMARIES Transfer function MF_MT_TRANSFER_FUNCTION Transfer matrix MF_MT_YUV_MATRIX Chroma subsampling MF_MT_SUBTYPE
(Given by the FOURCC, which is stored in the first DWORD of the subtype GUID.)Chroma siting MF_MT_VIDEO_CHROMA_SITING Nominal range MF_MT_VIDEO_NOMINAL_RANGE
Color Space Conversion
Conversion from one Y'CbCr space to another requires the following steps.- Inverse quantization: Convert the Y'CbCr representation to a Y'PbPr representation, using the source nominal range.
- Upsampling: Convert the sampled chroma values to 4:4:4 by interpolating chroma values.
- YUV to RGB conversion: Convert from Y'PbPr to non-linear R'G'B', using the source transfer matrix.
- Inverse transfer function: Convert non-linear R'G'B' to linear RGB, using the inverse of the transfer function.
- RGB color space conversion: Use the color primaries to convert from the source RGB space to the target RGB space.
-
Transfer function: Convert linear RGB to non-linear R'G'B, using the target transfer function.
- RGB to YUV conversion: Convert R'G'B' to Y'PbPr, using the target transfer matrix.
- Downsampling: Convert 4:4:4 to 4:2:2, 4:2:0, or 4:1:1 by filtering the chroma values.
- Quantization: Convert Y'PbPr to Y'CbCr, using the target nominal range.
For example, to convert from RT.601 to RT.709 requires the following stages:
- Inverse quantization: Y'CbCr(601) to Y'PbPr(601)
- Upsampling: Y'PbPr(601)
- YUV to RGB: Y'PbPr(601) to R'G'B'(601)
- Inverse transfer function: R'G'B'(601) to RGB(601)
- RGB color space conversion: RGB(601) to RGB(709)
- Transfer function: RGB(709) to R'G'B'(709)
- RGB to YUV: R'G'B'(709) to Y'PbPr(709)
-
Downsampling: Y'PbPr(709)
- Quantization: Y'PbPr(709) to Y'CbCr(709)
Using Extended Color Information
To preserve color fidelity throughout the pipeline, color-space information must be introduced at the source or the decoder and conveyed all the way through the pipeline to the sink.Video Capture Devices
Most analog capture devices use a well-defined color space when capturing video. Capture drivers should offer a format with a KS_VIDEOINFOHEADER2 format block that contains the color information. For backward compatibility, the driver should accept formats that do not contain the color information. This will enable the driver to work with components that do not accept the extended color information.File-based Sources
When parsing a video file, the media source (or parser filter, in DirectShow) might be able to provide some of the color information. For example, the DVD Navigator can determine the color space based on the DVD content. Other color information might be available to the decoder. For example, an MPEG-2 elementary video stream gives the color information in the sequence_display_extension field. If the color information is not explicitly described in the source, it might be defined implicitly by the type of content. For example, the NTSC and PAL varieties of DV video each use different color spaces. Finally, the decoder can use whatever color information it gets from the source's media type.Other Components
Other components might need to use the color-space information in a media type:- Software color-space converters should use color-space information when selecting a conversion algorithm.
- Video mixers, such as the enhanced video renderer (EVR) mixer, should use the color information when mixing video streams from different types of content.
- The DXVA video processing APIs and DDIs enable the caller to specify color-space information information. The GPU should use this information when it performs hardward video mixing.
Related topics
10-bit and 16-bit YUV Video Formats
7 out of 8 rated this helpful - Rate this topic
This topic contains the following sections:
- Overview
- FOURCC Codes For 10-Bit and 16-Bit YUV
- Surface Definitions
- Preferred YUV Formats
- Related topics
Overview
These formats use a fixed-point representation for both the luma channel and the chroma (C'b and C'r) channels. Sample values are scaled 8-bit values, using a scaling factor of 2^(n − 8), where n is either 10 or 16, as per sections 7.7-7.8 and 7.11-7.12 of SMPTE 274M. Precision conversions can be performed using simple bit shifts. For example, if the white point of an 8-bit format is 235, the corresponding 10-bit format has a white point at 940 (235 × 4).The 16-bit representations described here use little-endian WORD values for each channel. The 10-bit formats also use 16 bits for each channel, with the lowest 6 bits set to zero, as shown in the following diagram.
Because the 10-bit and 16-bit representations of the same YUV format have the same memory layout, it is possible to cast a 10-bit representation to a 16-representation with no loss of precision. It is also possible to cast a 16-bit representation down to a 10-bit representation. (The Y416 and Y410 formats are an exception to this general rule, however, because they do not share the same memory layout.)
When the graphics hardware reads a surface that contains a 10-bit representation, it should ignore the low-order 6 bits of each channel. If a surface contains valid 16-bit data, however, it should be identified as a 16-bit surface.
In the formats that contain alpha, a completely transparent pixel has an alpha value of zero, and a completely opaque pixel has an alpha value of (2^n) – 1, where n is the number of alpha bits. Alpha is assumed to be a linear value that is applied to each component after the component has been converted into its normalized linear form.
For images in video memory, the graphics driver selects the memory alignment of the surface. The surface must be DWORD aligned. That is, individual lines within a surface are guaranteed to start at a 32-bit boundary, although the alignment can be larger than 32 bits. The origin (0,0) is always the upper-left corner of the surface.
For the purposes of this documentation, the term U is equivalent to Cb, and the term V is equivalent to Cr.
FOURCC Codes For 10-Bit and 16-Bit YUV
The FOURCC codes for the formats described here use the following convention:- If the format is planar, the first character in the FOURCC code is 'P'. If the format is packed, the first character is 'Y'.
-
The second character in the FOURCC code is determined by the chroma sampling, as shown in the following table.
Chroma sampling FOURCC code letter 4:4:4 '4' 4:2:2 '2' 4:2:1 '1' 4:2:0 '0'
- The final two characters in the FOURCC indicate the number of bits per channel, either '16' for 16 bits or '10' for 10 bits.
FOURCC | Description |
---|---|
P016 | Planar, 4:2:0, 16-bit. |
P010 | Planar, 4:2:0, 10-bit. |
P216 | Planar, 4:2:2, 16-bit. |
P210 | Planar, 4:2:2, 10-bit. |
Y216 | Packed, 4:2:2, 16-bit. |
Y210 | Packed, 4:2:2, 10-bit. |
Y416 | Packed, 4:4:4, 16-bit |
Y410 | Packed, 4:4:4, 10-bit. |
Subtype GUIDs have also been defined from these FOURCCs; see Video Subtype GUIDs.
Surface Definitions
This section describes the memory layout of each format. In the descriptions that follow, the term WORD refers to a little-endian 16-bit value, and the term DWORD refers to a little-endian 32-bit value.4:2:0 Formats
Two 4:2:0 formats are defined, with the FOURCC codes P016 and P010. They share the same memory layout, but P016 uses 16 bits per channel and P010 uses 10 bits per channel.P016 and P010
In these two formats, all Y samples appear first in memory as an array of WORDs with an even number of lines. The surface stride can be larger than the width of the Y plane. This array is followed immediately by an array of WORDs that contains interleaved U and V samples, as shown in the following diagram.If the combined U-V array is addressed as an array of DWORDs, the least significant word (LSW) contains the U value and the most significant word (MSW) contains the V value. The stride of the combined U-V plane is equal to the stride of the Y plane. The U-V plane has half as many lines as the Y plane.
These two formats are the preferred 4:2:0 planar pixel formats for higher precision YUV representations. They are expected to be an intermediate-term requirement for DirectX Video Acceleration (DXVA) accelerators that support 10-bit or 16-bit 4:2:0 video.
4:2:2 Formats
Four 4:2:2 formats are defined, two planar and two packed. They have the following FOURCC codes:- P216
- P210
- Y216
- Y210
P216 and P210
In these two planar formats, all Y samples appear first in memory as an array of WORDs with an even number of lines. The surface stride can be larger than the width of the Y plane. This array is followed immediately by an array of WORDs that contains interleaved U and V samples, as shown in the following diagram.If the combined U-V array is addressed as an array of DWORDs, the LSW contains the U value and the MSW contains the V value. The stride of the combined U-V plane is equal to the stride of the Y plane. The U-V plane has the same number of lines as the Y plane.
These two formats are the preferred 4:2:2 planar pixel formats for higher precision YUV representations. They are expected to be an intermediate-term requirement for DirectX Video Acceleration (DXVA) accelerators that support 10-bit or 16-bit 4:2:2 video.
Y216 and Y210
In these two packed formats, each pair of pixels is stored as an array of four WORDs, as shown in the following illustration.The first WORD in the array contains the first Y sample in the pair, the second WORD contains the U sample, the third WORD contains the second Y sample, and the fourth WORD contains the V sample.
Y210 is identical to Y216 except that each sample contains only 10 bits of significant data. The least significant 6 bits are set to zero, as described previously.
4:4:4 Formats
Two 4:4:4 formats are defined, with the FOURCC codes Y410 and Y416. Both are packed formats.Y410
This format is a packed 10-bit representation that includes 2 bits of alpha. Each pixel is encoded as a single DWORD with the memory layout shown in the following diagram.Bits 0-9 contain the U sample, bits 10-19 contain the Y sample, bits 20-29 contain the V sample, and bits 30-31 contain the alpha value. To indicate that a pixel is fully opaque, an application must set the two alpha bits equal to 0x03.
Y416
This format is a packed 16-bit representation that includes 16 bits of alpha. Each pixel is encoded as a pair of DWORDs, as shown in the following illustration.Bits 0-15 contain the U sample, bits 16-31 contain the Y sample, bits 32-47 contain the V sample, and bits 48-63 contain the alpha value.
To indicate that a pixel is fully opaque, an application must set the two alpha bits equal to 0xFFFF. This format is intended primarily as an intermediate format during image processing to avoid the accumulation of errors.
Preferred YUV Formats
The following table lists the preferred YUV formats, including 8-bit formats.Format | Chroma sampling | Packed or planar | Bits per channel |
---|---|---|---|
AYUV | 4:4:4 | Packed | 8 |
Y410 | 4:4:4 | Packed | 10 |
Y416 | 4:4:4 | Packed | 16 |
AI44 | 4:4:4 | Packed | Palettized |
YUY2 | 4:2:2 | Packed | 8 |
Y210 | 4:2:2 | Packed | 10 |
Y216 | 4:2:2 | Packed | 16 |
P210 | 4:2:2 | Planar | 10 |
P216 | 4:2:2 | Planar | 16 |
NV12 | 4:2:0 | Planar | 8 |
P010 | 4:2:0 | Planar | 10 |
P016 | 4:2:0 | Planar | 16 |
NV11 | 4:1:1 | Planar | 8 |
It is recommended that if an object supports a given bit depth and chroma sampling scheme, it should support the corresponding YUV formats listed in this table. (Objects might support additional formats not listed here.)
Related topics
Video FOURCCs
This topic has not yet been rated - Rate this topic
Various C/C++ macros are defined for declaring FOURCC values in source code. The MAKEFOURCC macro is defined in Mmsystem.h, and the FCC macro is defined in Aviriff.h and various other header files. You can also declare a FOURCC code directly as a string literal simply by reversing the order of the characters. Thus, the following statements are equivalent:
DWORD fccYUY2 = MAKEFOURCC('Y','U','Y','2'); DWORD fccYUY2 = FCC('YUY2'); DWORD fccYUY2 = '2YUY'; // Declares the FOURCC 'YUY2'.
Some of the DirectX Video Acceleration 2.0 APIs use a D3DFORMAT value to describe a video format. A FOURCC code can be used in this context as well:
D3DFORMAT fmt = (D3DFORMAT)MAKEFOURCC('Y','U','Y','2'); D3DFORMAT fmt = (D3DFORMAT)FCC('YUY2'); D3DFORMAT fmt = D3DFORMAT('2YUY'); // Coerce to D3DFORMAT type.
FOURCC Constants
The following table lists some common FOURCC codes.FOURCC value | Description |
---|---|
'H264' | H.264 video. |
'I420' | YUV video stored in planar 4:2:0 format. |
'IYUV' | YUV video stored in planar 4:2:0 format. |
'M4S2' | MPEG-4 part 2 video. |
'MP4S' | Microsoft MPEG 4 codec version 3. This codec is no longer supported. |
'MP4V' | MPEG-4 part 2 video. |
'MPG1' | MPEG-1 video. |
'MSS1' | Content encoded with the Windows Media Video 7 screen codec. |
'MSS2' | Content encoded with the Windows Media Video 9 screen codec. |
'UYVY' | YUV video stored in packed 4:2:2 format. Similar to YUY2 but with different ordering of data. |
'WMV1' | Content encoded with the Windows Media Video 7 codec. |
'WMV2' | Content encoded with the Windows Media Video 8 codec. |
'WMV3' | Content encoded with the Windows Media Video 9 codec. |
'WMVA' | Content encoded with the older, obsolete version of the Windows Media Video 9 Advanced Profile codec. |
'WMVP' | Content encoded with the Windows Media Video 9.1 Image codec. |
'WVC1' | SMPTE 421M ("VC-1"). Content encoded with Windows Media Video 9 Advanced Profile. |
'WVP2' | Content encoded with the Windows Media Video 9.1 Image v2 codec. |
'YUY2' | YUV video stored in packed 4:2:2 format. |
'YV12' | YUV video stored in planar 4:2:0 or 4:1:1 format. Identical to I420/IYUV except that the U and V planes are switched. |
'YVU9' | YUV video stored in planar 16:1:1 format. |
'YVYU' | YUV video stored in packed 4:2:2 format. Similar to YUY2 but with different ordering of data. |
No comments:
Post a Comment