Sometimes in their daily lives, video engineers forget that not everybody understands the alphabet-soup of three letter acronyms (TLAs) that get used regularly in our industry. In this Back to Basics series of blog posts, I revisit core and fundamental concepts in video compression and delivery and attempt to explain them in accessible, easy to understand ways.

In this post, I explain the concept of a Group of Pictures (GOP) which is used in most modern video encoding algorithms today, including MPEG-2, H.264, and H.265. First, let’s start at the beginning.

Why do we need to compress video in the first place?

Raw, uncompressed video carried over an HDMI, HD-SDI, or Ethernet cable requires a lot of bandwidth. The following chart provides approximate required bitrates to carry uncompressed digital video of different sizes (resolutions):

ResolutionUncompressed Bitrate
1280×720 (720p)~ 1.5 Gbps
1920×1080 (1080p)~ 3 Gbps
3180 x 2160 (2160p or 4K)~ 12 Gbps

While sending video around at multiple gigabits per second may be feasible within a professional production facility, it’s obviously not possible to stream 1.5 Gbps into most viewers’ homes or to their mobile devices. For much of the world, residential internet services aren’t even capable of sustaining speeds near that. So in order to deliver video over the internet, or other fixed capacity medium (such as satellite or Blu-Ray discs), it is necessary to compress the video to more manageable bitrates (usually in the 1-20 Mbps range).

How does video compression work?

How much time do you have? Deeply understanding the specifics of video compression requires years of study and is often the subject of research by graduate students in Computer Science or Mathematics. In this blog post, I will scratch the surface by explaining general concepts of video compression.

At its fundamental level, video compression works by finding and eliminating redundancies within an image or a sequence of images. If you think of a typical video that you might watch online or on TV, a newscast for example, each frame of video is typically very similar to those that precede and follow it. A newscaster’s lips may move as they talk, and information may scroll along the bottom of the screen, but a large portion of the image is either the same, or very similar, from frame to frame.

The first of 3 sequential frames of video from Big Buck Bunny at 30 fps. Notice that most of the content from frame to frame remains the same, only the butterfly’s wings change slightly.

The second of 3 sequential frames of video from Big Buck Bunny at 30 fps. Notice that most of the content from frame to frame remains the same, only the butterfly’s wings change slightly.

The second of 3 sequential frames of video from Big Buck Bunny at 30 fps. Notice that most of the content from frame to frame remains the same, only the butterfly’s wings change slightly.

Three sequential frames of video from Big Buck Bunny at 30 fps. Notice that most of the content from frame to frame remains the same, only the butterfly’s wings change slightly.

To take advantage of this, most video compression is achieved by only sending some frames in full (referred to as keyframes), then only sending the difference between the keyframe and the subsequent frames. The receiver (decoder) can use the keyframe plus these differences to re-create the desired frame with reasonable accuracy. This method of compression is known as temporal compression because it exploits the fact that information changes in a video slowly over time.

A second type of compression, known as spatial compression, is also used to compress the keyframes themselves by finding and eliminating redundancies within the same image. Again, picture a photo of a newscaster reading the news. In most cases, pixels within the image are similar to the pixels that surround them, so we can apply the same technique of only sending the differences between one group of pixels and the subsequent group. This is the same technique used to compress images that we are all familiar with when saving an image in JPEG format.

 Sample video frame extracted from Big Buck Bunny. Pixels tend to be surrounded by other pixels of similar color - for example, in the sky, blue pixels are surrounded by other blue pixels, and white pixels are surrounded by other white pixels in the cloud. Encoders exploit this property to compress images using spatial compression.

Sample video frame extracted from Big Buck Bunny. Notice that pixels tend to be surrounded by other pixels of similar color – for example, in the sky, blue pixels are surrounded by other blue pixels, and white pixels are surrounded by other white pixels in the cloud. Encoders exploit this property to compress images using spatial compression.

Got it! So what’s all this about GOPs?

In an ideal world, video encoders could send a keyframe for the first frame of a video, then every subsequent frame would be represented as differences until the end of the video. There are, however, a few reasons why that doesn’t work well in practice:

  1. Random access: Sending the first frame as a keyframe and subsequent differences could work if every user started from the first frame and only watched forward. But this isn’t how viewers actually consume video. Viewers skip ahead, and join live video streams at random points. In order to accommodate facilitate this behavior, more keyframes need to be placed throughout the video to allow viewers to begin watching from those points. These are called random access points.
  2. Error resiliency: The other problem with only sending differences for the majority of the video is that delivery media are imperfect. Packets get lost, bits get flipped, and all sorts of other errors happen in the real world. If you only send differences from what came before and an error or corruption occurs, that error will continue to propagate through the rest of the video stream until it concludes. Adding additional keyframes throughout a video provides error resiliency by returning the decoder to a “known good” frame and clearing previous errors that may have been propagating. You have probably seen this happen while watching videos, where some error is introduced and the screen gets blocky, or green-tinged shapes appear on the screen. Then suddenly, the picture returns to a crisp and clear image.
  3. Scene changes: Sending only differences between frames works very well when the differences between frames are relatively small. During content changes, or scene cuts, it’s possible for nearly the entire image to be filled with new information from one frame to the next. When this happens, it usually doesn’t make sense to continue sending only differences. A video encoder will often detect this situation and automatically insert a new keyframe at the boundary point. This is called scene change detection.

So, now that you understand why it’s important to regularly insert keyframes into video streams, I can talk about a Group of Pictures (GOP). Simply put, a GOP is the distance between two keyframes, measured in the number of frames, or the amount of time between keyframes. For example, if a keyframe is inserted every 1 second into a video at 30 frames per second, the GOP length is 30 frames, or 1 second. While real-world GOP lengths vary by application, it is typically in the 0.5 – 2 second range.

Back to

‘Keyframes’? ‘Differences’? Aren’t there more formal names for these?

Of course there are! In MPEG-2 compression and beyond, keyframes are typically known as Intra-coded frames, or I-frames for short. They are named this because the frames are compressed using spatial compression, and thus all of the information required to decode this type of frame comes from within itself (“intra”). The decoder does not depend on or require any other frames in order to create the image. In H.264 and beyond, a special type of frame called an Instantaneous Decoder Refresh (or IDR-frame) was introduced. While there are subtle differences between I and IDR frames, for the purposes of understanding GOPs we can treat them as if they’re the same. You can think of an I-frame as basically just a JPEG image of the frame. Typically, I-frames use the most amount of bits within a video stream because they are only taking advantage of spacial compression, not temporal.

How about the “differences”? There are two types of frames that we use to carry information about differences to a decoder. The first is called a Predicted frame, or a P-frame. It’s called a predicted frame because it predicts what has changed from the previous frames. P-frames provide the “differences” between the current frame and one (or more) frames that came before it. P-frames offer much better compression than I-frames, because they take advantage of both temporal and spatial compression and use less bits within a video stream.

The last type of “difference” frame is a Bi-directional Predicted frame, or a B-frame. B-frames are bi-directional because they take the frames that came before and after them, and send only the differences between the current frame, and the past and future reference frames. Because they take temporal compression and amp it up to 11, B-frames offer the highest compression and take up the fewest bits within a video stream.

Back to

For those of you still following along, you might ask how a B-frame is able to see into the future to specify differences between it and a future frame. Again, the specifics are a little bit beyond the scope of this blog post, but what happens is that the encoder buffers the past and future frames before encoding the frames in between. It sends these frames to the decoder out of order, so the future frame actually arrives at the decoder before the B-frames do. The decoder then constructs the frames in between and plays them out in the correct order using timing information included by the encoder.

Sizes (in kB) of different frame types within a GOP. I frames use the most number of bits, followed by P frames, and B frames use the least.

Sizes (in kB) of different frame types within a GOP. Notice that I frames use the most number of bits, followed by P frames, and B frames use the least.

I and B, and P, oh my! What does it look like in an actual video?

A typical GOP contains a repeating pattern of B and P frames sandwiched between I frames. An example of a typical pattern might be something like:

I B B P B B P B B P B B I

A sequence such as the above can be represented by two numbers: M and N. M represents the distance between two I or P frames, whereas N represents the distance between two I frames. The above GOP is described as N=3, M=12.

Professional video analyzers can depict GOPs and frame types visually:

Back to

But you can also see the GOP order of an encoded movie using open-source tools such as ffprobe:

$ ffprobe -i SAMPLE_MOVIE.mp4 -show_frames | grep 'pict_type'
pict_type=I
pict_type=B
pict_type=B
pict_type=P
pict_type=B
pict_type=B
pict_type=P
pict_type=B
pict_type=B
pict_type=P
pict_type=B
pict_type=P
pict_type=I

From the above output, our sample movie has a GOP length of 12 frames.

How does GOP configuration impact video quality?

The shorter the GOP length, the fewer B and P frames exist between I frames. Remember that B and P frames offer us the most efficient compression, so in a lower bitrate movie, a short GOP length will result in poorer video quality. A longer GOP length will compress the content more efficiently, providing higher video quality at lower bit rates, but at the expense of trading off random access points and error resiliency. Most encodes typically use GOP lengths in the 1-2 second range.

As an example, the following two images represent a zoomed-in section of a frame from Big Buck Bunny, encoded at 2.5Mbps using identical encoding settings with the exception of the GOP length. The first image is configured with a GOP of 4 frames, and the second is configured with a GOP length of 90 frames.

Back to

Back to

Frame of the same video encoded at 2.5Mbps with a GOP of 4 frames (top) and a GOP of 90 frames (bottom) as an extreme illustration of the impact GOP settings can have on video quality. Because there are fewer B and P frames in the top example, the encoder has to quantize the I-frames more coarsely (compress them more) to fit within the configured bitrate, which results in blockiness, blurriness, and loss of detail.

The above is true for the majority of use cases. For very high bitrate encoders where maintaining high picture quality is more important than saving bits (typically 50Mbps and higher), a GOP length of 1 can be used (that is, if every frame is an I-frame!). This is typically only used for broadcast or production quality and archival encodes.

How do I specify my GOP settings while encoding?

In AWS Elemental MediaConvert, you can customize GOP settings by clicking on the Video track of your output and scrolling down to the advanced settings.

Back to

In this example, I indicate a GOP length of 90 frames, with 2 B-frames between reference frames. In other words, the configuration above represents M=3, N=90.

In AWS Elemental MediaLive, find the GOP settings by clicking on the video track of your live output and scrolling down to the “GOP Structure” section.

 

Back to

In this example, the settings are as follows:

  • GOP Size: 60
  • GOP Size Units: FRAMES
  • Num B-Frames: 3
  • Closed GOP Cadence: 1
  • Num Ref-Frames: 3
  • B-Frame Reference: ENABLED
  • Sub-GOP Length: FIXED

Finally, if you are using an open-source codec and encoder such as ffmpeg with x264, you can specify GOP settings by including the keyint= and bframes= arguments:

$ ffmpeg -i SAMPLE_MOVIE.mp4 -c:v libx264 -b:v 4M -x264-params keyint=24:bframes=2 OUTPUT.mp4

This will encode the video using a GOP length of 24, with 2 B-frames between reference frames, or M=3, N=24.

Which type of frame is most important?

One question that I like to ask when interviewing video engineers is, “Which type of frame is the most important?” A common (and perfectly acceptable!) answer is that I-frames are the most important, because without them, the other types of frames would have nothing to base their differences on. But the subtle nuance of this question is that there is no single correct answer. One might just as correctly say that B-frames are the most important because they offer the best compression. After all, what’s the point of video compression if it doesn’t compress the video very well? The purpose of this question isn’t to get the correct answer, it’s to hear the candidate provide their justification for the answer, as that provides a big clue as to how well they understand these fundamental concepts.

Hopefully, after reading this blog post, you understand the different types of compression used on video frames, and how they come together to form a GOP. Now the question is, “Which type of frame do you think is the most important?”