Can you SEE me now?” A Measurement Study of Mobile Video Calls

Video telephony is increasingly being adopted by end consumers. It is extremely challenging to deliver video calls over wireless networks. In this paper, we conduct a measurement study on three popular mobile video call applications: FaceTime, Google Plus Hangout, and Skype, over both WiFi and Cellular links. We study the following
questions:

1) how they encode/decode video in real time under tight resource constraints
on mobile devices?

2) how they transmit video smoothly in the face of various wireless network impairments?

3) what is their delivered video conferencing quality under different mobile network conditions?

4) how different system architectures and design choices contribute to their delivered quality?

Through detailed analysis of measurement results, we obtain valuable insights regarding the unique challenges, advantages and disadvantages of existing design solutions, and possible directions to deliver highquality video calls in wireless networks

Related Work

There are lots of measurement studies on WiFi networks, e.g., , and Cellular networks, e.g., There are also some studies to compare the performance of WiFi and Cellular networks . They concluded that WiFi provides higher download rate and smaller latency than 3G. Most recently, Huang et al. studied the performance and power characteristics of 4G LTE networks. They observed that LTE can offer higher down link and up link throughput than 3G and even WiFi. Different from those studies, we focus on the performance of video calls over WiFi and Cellular networks. Most measurement studies of real time communications over the Internet focused on Skype’s VoIP service. Baset et al analyzed Skype’s P2P topology, call establishment protocol, and NAT traversal mechanism. Skype’s FEC mechanism was studied in . Huang et al. proposed a user satisfaction index model to quantify VoIP user satisfaction. More recently, there are some measurement studies on video calls. Cicco et al. measured the responsiveness of Skype video calls to bandwidth variations.  we conducted a measurement study of Skype two-party video calls under different network conditions.  we measured three computer based multi-party video conferencing solutions: iChat, Google+ hangout, and Skype. Different from those studies, we focus on mobile video calls over wireless networks in this paper.

Measurement Platform

FaceTime, Skype, and Google+ Hangout all use proprietary protocol and encode their signaling and data. Using methodology similar to those developed in  for studying computer-based video conferencing systems, we measure them as “black boxes”, reverse-engineer their design choices, and compare their performance in wireless networks. We performed IP-level packet sniffing, application-level information window capturing, and video-level quality analysis. Among the three, only Google+ offers multi-party conferencing feature on mobile platforms. We therefore restrict our comparison study to two-party video calls.

Testbed

Overall Platform

Figure 1: Testbeds for Mobile Video Call Measurement

Figure 1: Testbeds for Mobile Video Call Measurement

Our measurement platform (shown in Fig 1) consists of two parts: wireless user side and wire line user side. At the wireless user side, a smartphone is connected to the Internet through WiFi or 3G cellular data service provided by one top-three US carrier. At the wire line user side, a PC or Mac is connected to the Internet through campus Ethernet. Software-based network emulators are inserted on both ends of the connection to emulate network conditions in controlled experiments. Packet traces are captured at different points using Wire shark. Experimental video calls are established between the smartphone and the computer. To emulate a consistent and repeatable video call, we choose a standard TV news video sequence “Akiyo” from JVT (Joint Video Team) test sequence pool. The sequence has mostly head and shoulder movements. It is very similar to a video-call scenario. In order to inject this video sequence into the video  all systems, at the computer side, we use a virtual video camera tool . Since we cannot find a virtual camera tool for our smartphone, we simply focus the smartphone’s camera to a screen displaying the “Akiyo” video.

Information Collection

We collect measurement from multiple channels.

1. IP Packet Traces: We sniff packets at both the computer side (with wire shark) and the smartphone side (with command line tcp dump). Collected packet traces are used to analyze protocols and calculate network level statistics, such as packet delay, loss, and loss burst, etc. 2. Video Quality Logs: At the computer side, Skype and Google+ report technical information about the received video quality through their application windows, such as video rates, frame rates, RTT, et al, . We use a screen text capture tool to capture these information periodically. The sampling interval is 1 second.3. End-to-End Video Delay Samples: Same as in our previous work, we use end-to-end video delay as an important measure of video call quality. End-to-end video delay is defined as the time lag from when a video frame is generated on the sender side till it is displayed on the receiver’s screen. It consists of video capturing, encoding, transmission, decoding, and rendering delays. As illustrated in Fig. 3, to measure the one-way video delay of a video call, we put computer A and phone B close to each other. The “Akiyo” video is being played on computer A. Meanwhile, a stopwatch application is also running on A.We then start the video call between A and B, with the camera of B focused on the “Akiyo+Stopwatch” video on A’s screen. Through the video call application, phone B sends the captured “” video to A. On computer A, we put the received video window next to the original source video window. By comparing the readings from the two stopwatch videos on computer A’s screen, we can get the one-way video delay from phone B to computer A. To automatically collect video delay information from the two stopwatches, we write a script to capture the screen of computer A once every 100 millisecond, and then decode the captured stopwatch images using an Optical Character Recognition (OCR) software.

Figure 3 : One-way video delay testbed

Figure 3 : One-way video delay testbed

When the phone and computer are not in the same location, e.g., in our mobility experiments on subway, we cannot measure the one-way delay as in Fig. 3 Instead, we can measure the round-trip video delay using the scheme illustrated in Fig.4.There are totally five stopwatch videos during a video call. Stopwatch  1 is a standalone application running on a separate Android phone. During a video call, the iPhone captures the video of stopwatch  1 , the captured video is marked as stopwatch  2 on the iPhone screen. The captured stopwatch video is then sent to the receiving computer,

Figure 4 : Round-trip video delay testbed on subway

Figure 4 : Round-trip video delay testbed on subway

how do they work – key design choices

We first need to understand the three systems’ design choices on system architecture, video generation & adaptation, and packet loss recovery, etc. Leveraging on our study in for their corresponding computer versions, we are able to discover important design choices of Google+ and Skype’s mobile versions. We also obtain good understanding on FaceTime for the first time.

Architecture and Protocol

Similar to its computer version, Google+ mobile version is also server-centric. Our mobile phone is always connected to a Google conferencing server located close to New York City, with RTT of 14ms to a computer in our campus network. There is no direct communication between the phone and the computer in all our experiments. Google+ uses UDP and only switches to TCP if we deliberately block UDP traffic. All the voice and video data are encapsulated in RTP packets. Skype mobile is still hybrid: sometime our mobile phone connects to the computer directly (mostly when using WiFi), sometime it routes the video call through a relay server (mostly when using Cellular). At the transport layer, Skype uses UDP or TCP. Compared with the computer version, Skype mobile is more likely to use TCP and relay server. This might because it is more complicated to establish a direct connection between mobile devices. Instead of RTP, Skype uses its own protocol to encapsulate voice and video. Skype relay servers are at different locations with RTTs to our campus network ranging from 4 to 37 ms. In our WiFi experiments, FaceTime mostly uses direct P2P connection between the smartphone and the computer. In our Cellular experiments, the smartphone and the computer are connected through relay servers, with RTTs to our campus network ranging from 2ms to 20ms. FaceTime always use UDP, no video call can be established if we block UDP traffic.

Video Encoding

Network conditions, such as available bandwidth, packet loss and delay, are inherently dynamic, in wireless environment. To meet the tight video playout deadline, only very limited receiver-side buffering is allowed to smooth out bandwidth variations and

Table 2 : System Architecture and Protocol Comparison

Table 2 : System Architecture and Protocol Comparison

delay jitters. To maintain a smooth video call, the source has to adapt its video encoding strategy to network conditions. All three systems are capable of generating video at different rates in realtime. We probe their video encoding parameter ranges by throttling the end-to-end bandwidth from the smartphone to the computer, using the network emulator. On the computer side, both Skype and Google+ report total rate, frame rate and resolution of the video received from the smartphone in an application information window. Their video encoding parameter ranges are reported in Table 3. Using the same RTP header analysis technique introduced in, we verified that Google+ still uses layered video coding on mobile phones. Both temporal and spatial scalability are used to generate video in a wide rate range. It is well-known that layered video coding is computation-intensive. In our experiments with Google+, the iPhone CPU utilization is close to 100%, 50% higher than FaceTime and Skype. Google+ also consumes 40% more power than Face-Time and Skype as described in Section 2.3.5. It is still amazing that Google managed to implement realtime layered video coding on mobile phones. Unfortunately, FaceTime reports very limited information about its video encoding parameters. We derive FaceTime’s video rate from the captured video trace, by discounting FEC packets. (We will describe how we identify FaceTime’s FEC packets in the following section.) To estimate its frame rate, we first calculate the timestamp difference of two adjacent RTP packets. If the RTP timestamp difference is zero, they are from the same video frame. Let be the minimum non-zero timestamp difference. Any packet pair with timestamp difference must be from two adjacent video frames. We then use the difference tc of the packet capture time of the pair to approximate the gap between the generation time of their corresponding frames. Then the frame rate can be estimated as 1=tc. The inferred frame rate ranges from 1 to 30 FPS.

Loss Recovery

Wireless networks have both congestion losses and random losses. To cope with losses, Skype, Google+ and Facetime all use redundant data to protect video data. To gain more insights about their loss recovery strategies, we conduct controlled experiments and systematically inject random packet losses to path from the smartphone to  the computer. As illustrated in Figure  4, we started with zero loss rate, then increase loss rate by 5% every 120 seconds. We record the video rate and sending rate of each system. As indicated in Figure 4a, Google+’s total sending rate is only slightly higher than its actual video rate. This is consistent with our finding in  for Google+ computer version, which selectively retransmits lost packets. Through RTP packet analysis, we verified that Google+ mobile version also employs selective retransmission: lost video packets from the base layer will be retransmitted, and lost packets from the upper layers may not be recovered. We showed in  that Google+’s retransmission strategy is highly robust to packet losses in wireline networks. We will study its efficiency in wireless networks in the next section. Finally, Google+ reduces its sending rate and video rate as the loss rate goes over 10%. In Figure 4b, Skype’s redundancy traffic is significantly higher. As packet loss

Figure 4: Redundancy Adaptation as Packet Loss Ratio Increases

Figure 4: Redundancy Adaptation as Packet Loss Ratio Increases

rate increases, the video rate decreases, while the total sending rate increases. This agrees with the finding in and that Skype employs adaptive-but-aggressive FEC scheme. As will be shown in the next section, Skype’s aggressive FEC may lead to a vicious cycle. Both video rate and sending rate drop down significantly after the packet loss rate increases to 15%, but the FEC redundancy ratio is still very high. FaceTime’s redundancy ratio in Figure4c lies in between Google+ and Skype. We now closely examine its loss recovery strategy. In Table 4, we compare RTP header traces of FaceTime without and with packet losses. Without packet loss, RTP packet sequence number increases at pace 1. All packets carrying the same timestamp are from a same video frame, the last packet carries Mark 1. Due to video encoding structure, some frames are larger, and have more packets. But all RTP packets contain more than 750 bytes. For the trace with packet loss, immediately following the last packet of a frame, we spot some packets, (marked in shade), carrying the same sequence number and timestamp as the last packet of the frame. The payload of those packets are all different from each other, suggesting that they are not duplicate packets. They have identical length, which is larger than the length of all the previous packets in the frame. Finally, with packet loss, all frames are broken into multiple packets, some withshort length, e.g., 285. In all our experiments, FaceTime generates much more shorter packets after we inject packet losses. All these observations strongly suggest that a frame-based FEC scheme is implemented by FaceTime. Original video packets in a frame are put into one FEC block. Redundancy packets are generated to protect original packets. A FEC redundancy packet has to be longer than all original packets it protects. Since it is generated immediately after a video frame is encoded, it has the same timestamp as the original video packets in that frame. Finally, if a FEC block only has one original video packet, then the FEC redundancy ratio has to be multiple of 100%, which is too coarse. Also short FEC blocks (in terms of the number of packets) is vulnerable to bursty loss. To achieve finer

Table .4: RTP Packet Trace of FaceTime

Table .4: RTP Packet Trace of FaceTime

FEC redundancy control and higher robustness against bursty losses, for a small video frame that can be fit into one large packet, one should packetize the frame into multiple small packets, and put them into one long FEC block. This explains why FaceTime generates more short packets when packet losses are injected.

Rate Control

To avoid congestion along the video transmission path, all three applications adapt their sending rates and video rates to the available network bandwidth. We test their bandwidth tracking capability through a sequence of bandwidth limiting experiments. As illustrated in Figure 5, we use network emulator to set the available network bandwidth, and then record their sending rate and video rate. We start with “unlimited” bandwidth, and record their rates. Both Google+ and Skype set their video rates between 300 and 400 kbps. FaceTime starts at 700 kbps. Two minutes into the “unlimited” bandwidth setting, we set the available bandwidth to be 200 kbps higher than each system’s current sending rate, and then keep dropping the bandwidth limit by 100kbps every 2 minutes. While all three system can pick a video rate lower than the available bandwidth, their aggressiveness is quite different. Skype chooses a very aggressive video rate to fully utilize the available bandwidth. Since Skype use FEC, the sending rate even often exceeds the available bandwidth. This will cause congestion losses, which in turn trigger more aggressive FEC. We will revisit this in Sec. 2.3.3.Google+ also sets its video rate close to the available network bandwidth. Since Google+ uses retransmission, its sending rate is very close to its video rate, and mostly below the bandwidth constraint. It won’t trigger many congestion losses. When there is bandwidth limit, FaceTime is the most conservative among the three. It always reserve a considerable bit rate margin. Interestingly, it can always

Figure 5 : Total Rate and Video Rate Adaptation with Available Bandwidth

Figure 5 : Total Rate and Video Rate Adaptation with Available Bandwidth

track the available bandwidth well. Even though FaceTime also uses FEC, due to its conservative video rate selection, it will not trigger congestion losses by itself. So the FEC redundancy is kept very low in this set of experiments.

Power Consumption

Compared with computers, mobile devices are much more constrained by CPU cycles and battery supplies. It is therefore very important to gauge the CPU and battery consumption of the three mobile video call applications. Table 5 reports the CPU consumption of the top-3 processes when we use each of the three applications for a five minutes video call on WiFi or Cellular. Since the iPhone has dual-core, the full utilization is 200%. To test the power consumption of video call applications, we start a

Table 5 : Top-3 CPU-consuming Processes

Table 5 : Top-3 CPU-consuming Processes

video call for one hour and continuously monitor the remaining battery life during the call. In Figure 6, each system consumes more power over Cellular than over WiFi. This is expected, as Cellular transceiver consumes more power than WiFi transceiver. Among the three application, Google+ is the most power demanding, similar to the CPU consumption. This might again due to Google+ uses realtime layered coding, which is known to consume more power than non-layered video coding. FaceTime is slightly more power efficient than Skype, possibly again because it is an integrated app in iPhone.

Figure 6 : Battery Life Remaining During 1 hour Video Calls (staring for 90% full)

Figure 6 : Battery Life Remaining During 1 hour Video Calls (staring for 90% full)