Like all of Amazon’s major technology groups, Amazon Prime Video has a dedicated team of scientists who are working constantly to find new ways to delight our customers and improve our products.
Our work was on display at this year’s IEEE Winter Conference on Applications of Computer Vision, where we presented two papers. One was on sports field registration, or understanding the spatial relationships between objects depicted in sports videos. The other was on recap and intro detection, or automatically identifying the recaps and intros at the beginnings of TV shows, so viewers can skip them if they want.
American football, with dense features
At top is video of an American football play; bottom left is a visualization of our grid keypoints; bottom right is a visualization of our dense features.
Sports field registration involves mapping video images onto a topographical model of the field, to enable enhancement of the video feed. It’s the technology behind the virtual first-down lines in American-football broadcasts or the virtual world-record lines in swimming broadcasts.
Usually, sports field registration requires onsite cameras equipped with sensors and calibrated to reference points on the field. Combining the sensor output with the cameras’ video yields very accurate field registration.
We address the problem of sports field registration in the absence of instrumentation, using video from a single camera capable of pan, tilt, and zoom (PTZ) motion. This could enable the addition of cutting-edge graphics to broadcasts of minor-league or amateur sporting events, broadcasts of less-popular sports, or even video signals from uninstrumented secondary cameras at major sporting events.
Where previous work on this problem modeled field topography using only a few keypoints — usually, intersections of lines laid down on the field — we model the field using a dense grid of keypoints.
Using video annotated according to our modeling scheme, we train a neural network to correlate image pixels with specific keypoints in our model of the field.
The dense grid increases the precision of our registration — provided that we correctly identify the keypoints. But of course, keypoints that don’t lie at the intersections of field lines are harder to identify.
Consequently, we use a second source of information to improve our mapping. This is a set of dense field features that represent the standard distances between lines on the field and between other identifiable regions of the field.
In the figure below, for instance, the black-and-white model at left illustrates the lines of an American-football field, while the black-and-white model at right illustrates the numbers marking the yard lines.
The glowing green elements of the bottom images are meant to indicate that features of the black-and-white models are being represented, not according to their absolute location on the field, but according to normalized distances between black pixels and white pixels.
That is, whereas the keypoints represent absolute field positions, the dense feature set represents field position relative to recurring visual elements of the field. It’s thus a complementary feature set that improves the mapping between a video frame and the sports field.
Using the dense features to verify keypoints adds computational overhead, however, and our system needs to work in real time. Our network architecture therefore incorporates several properties meant to reduce this overhead.
The first is that it is a multitask network: from the input data, it produces a single vector representation that passes to both the keypoint estimator and the dense-feature extractor.
The second is that the network uses the dense features for verification only if it has reason to believe that the keypoint estimates are inaccurate. Specifically, given the initial keypoint estimate for a frame of video, the network takes several different samples of keypoints and determines whether they align with each other. If they don’t, it uses the dense features to refine its estimate (the self-verification and online-refinement modules in the diagram above).
By combining these techniques, we were able to get our sports field registration system to work in real time. In tests, we compared it to multiple state-of-the-art sports field registration systems on five data sets: soccer, American football, ice hockey, basketball, and tennis.
On different sports, our system’s performance ranged from comparable to baseline to much better. For American football, for instance, according to the standard version of the intersection-over-union measure, our system was 2.5 times as accurate as the best-performing baseline.
Five sports
At left are grid keypoints and the projection of field templates onto the videos of five different sports; at right are mappings of the camera’s field of view onto models of the fields.
Intro and recap detection
Fans of Prime Video’s hit shows, such as The Marvelous Mrs. Maisel, are familiar with the option of skipping the introductions — which usually feature credits and theme music — and recaps — quick summaries of the narrative to date — at the beginning of individual episodes.
With existing content, however, providing the option to skip intros and recaps requires hand coding. We’d like to extend that option to other Prime Video programming through automatic detection of intros and recaps.
Both intros and recaps have distinguishing features that should make them detectable. Intros tend to involve text (credits) superimposed on the screen, often with extended musical performances in the background, while recaps usually involve unusually quick cuts between scenes. Frequently, they’re also introduced by text.
Our detector is a neural network, with an architecture chosen to maximize responsiveness to these elements of intros and recaps. Unlike alternative approaches that require an entire video series to find intro and recap timestamps, our approach can work on each episode independently, which makes it more practical.
With our system, a given frame of video passes first to a convolutional neural network (CNN). CNNs are designed to step through input images, applying the same filters to successive blocks of pixels. They can thus learn to identify text regardless of what region of the screen it falls in. We also pass the input audio to the same CNN, which learns a fused representation of audio and video.
The output of the CNN then passes to a bidirectional long-short-term-memory (Bi-LSTM) network. An LSTM is a type of neural network that processes sequential inputs in order, so that each output reflects both the inputs and outputs that preceded it. A Bi-LSTM passes through the same sequence both forward and backward. This allows our network to recognize longer-term dependencies — such as the cutting rates in particular video sequences.
Finally, the output of the LSTM passes to a conditional random field, which essentially performs curve smoothing. Smoother contours within a segment of video enable clearer identification of the boundaries between segments — between, say, intros and recaps, or between either and the new content of an episode.
In tests, we compared the performance of our system to baselines that used the same CNN but different methods to process the CNN’s output: a single-layer LSTM; a two-layer LSTM; a Bi-LSTM; and a Bi-LSTM that uses Viterbi decoding, rather than a CRF, for smoothing. We find that our system dramatically outperforms all four baselines.