Measuring Vibrations from Video Feeds

By using a high-speed camera, researchers at MIT in 2014 where able to recover human speech from videos of minute vibrations of objects in a room. For example, in one experiment a 2,200fps camera was positioned outside a room behind sound-proof glass, videoing an empty crisp packet on the floor inside the room, while a researcher shouted “Mary had a little lamb” at the crisp packet. By detecting minute oscillations of the crisp packet of 1 μm (0.001 mm), and using hours of computer processing, a ten second audio clip could be produced that was recognisably “Mary had a little lamb” in an American accent. 
The purpose of this study group was to investigate whether this tech- nique could be used in practice, with emphasis on the recovery of intel- ligible speech from a video feed of a room. During the week, the group investigated several aspects of the problem, including: 
• how much an object vibrates due to sound; 
• what can be done to maximize the vibration; 
• how the MIT technique detects minute vibrations in videos; • what affects the quality of the resulting recording; and 
• how good a recording is needed for intelligible speech. 
It was discovered the MIT experiments would not have recovered intel- ligible speech from an ordinary conversation; their success depended on loud sounds and prior knowledge of “Mary had a little lamb”. Camera vibrations were also ignored by MIT; these are expected to be signifi- cant, but the technique could be adapted to be resilient to them. Other possibilities for enhancing their technique, by exploiting resonances or reflections, are discussed in the report. A high-speed low-noise cam- era is essential, and any existing video footage (such as from CCTV) is unlikely to be of sufficient quality. Further experiments with high-end high-speed cameras are needed to assess the feasibility of the technique in practice.


Report author
Ed Brambley (University of Warwick)

Executive Summary
By using a high-speed camera, researchers at MIT in 2014 where able to recover human speech from videos of minute vibrations of objects in a room. For example, in one experiment a 2,200fps camera was positioned outside a room behind sound-proof glass, videoing an empty crisp packet on the floor inside the room, while a researcher shouted "Mary had a little lamb" at the crisp packet. By detecting minute oscillations of the crisp packet of 1 µm (0.001 mm), and using hours of computer processing, a ten second audio clip could be produced that was recognisably "Mary had a little lamb" in an American accent.
The purpose of this study group was to investigate whether this technique could be used in practice, with emphasis on the recovery of intelligible speech from a video feed of a room. During the week, the group investigated several aspects of the problem, including: • how much an object vibrates due to sound; • what can be done to maximize the vibration; • how the MIT technique detects minute vibrations in videos; • what affects the quality of the resulting recording; and • how good a recording is needed for intelligible speech.
It was discovered the MIT experiments would not have recovered intelligible speech from an ordinary conversation; their success depended on loud sounds and prior knowledge of "Mary had a little lamb". Camera vibrations were also ignored by MIT; these are expected to be significant, but the technique could be adapted to be resilient to them. Other possibilities for enhancing their technique, by exploiting resonances or reflections, are discussed in the report. A high-speed low-noise camera is essential, and any existing video footage (such as from CCTV) is unlikely to be of sufficient quality. Further experiments with high-end high-speed cameras are needed to assess the feasibility of the technique in practice.  [1] demonstrated the recovery of the sound in a room from a video of some objects present in the room. The idea is that sound is the vibration of the air in the room, which causes minute vibrations of objects in the room exposed to that sound. One can then attempt to detect these vibrations from a high-speed video of the objects, and use motion-enhancement signal processing techniques developed in the same laboratory [7] to extract the audio. The aim of this report is to investigate the feasibility of using this technique to extract intelligible speech in practical situations.
(1.2) Section 2 of the report analyses the vibration of simple objects, such as the ones used by Davis et al. [1], in order to model the amplitude of vibration as a function of the size and material properties of the object and the amplitude and frequencies of the sound. In addition to giving ballpark figures of what sort of equipment would be needed to detect the vibrations, one other aim of this modelling is to determine the ideal properties that an object would have in order to be used as a visual microphone. One novel possibility is to gain greater sensitivity to motion by looking at reflections in an object, rather than the object itself; this is considered further in section 3.
(1.3) Section 4 describes the earlier work from the MIT lab on which the recovery of sound is based. Wadhwa et al. [7] developed a technique to analyse video and enhance the motion shown in the video in a particular frequency range. This is how Davis et al. [1] were able to detect the tiny motion of objects due to sound.
(1.4) Section 5 investigates what is required for a recording of speech to be intelligible. This also gives ballpark figures on what frequencies and noise levels are needed in practice.
(1.5) Finally, section 6 summarizes the results in this report, and suggests further lines of inquiry.

Sound during conversations
(1.6) The human voice consists of frequencies ranging from 80 Hz to 4 kHz excluding sibilants. In telephony, the voice band is approximately 300 Hz to 3.4 kHz 1 , with the missing information below 300 Hz perceived as a missing fundamental 2 . Sound restricted to the voice band is noticeably telephonelike, but is none-the-less fully intelligible. (1.7) Sound in air is a wave, consisting of small oscillating motion of the air particles and a corresponding small oscillation of air pressure. Sound volume is measured using the logarithmic deciBel (dB) scale, shown in table 1. The corresponding maximum displacement of an air particle, ξ, due to the sound is approximately given by where f is the frequency of the sound in Hertz, dB is the sound amplitude in deciBels, ρ 0 and c 0 are the density (1.2 kg/m 3 ) and sound speed (340 m/s) of air, and ξ is given in metres.
(1.8) As an example of the minute motion of objects due to sound, a loud conversations at 60 dB at a typical frequency of 300 Hz would cause the air to move by approximately 30 nm, or 1/2000th of a human hair.
2 Acoustic excitations of thin plates (2.1) In this section, we investigate models of sound interacting with an object. The aim is to predict the amplitude of motion of the object given the incident sound, and hence to predict parameters that would make for a good visual microphone. First, in section 2.
where ρ is the plate density (kg/m 3 ), h is the plate thickness in metres, P is the net force per unit area acting on the plate in Pascals, and is the bending stiffness, where E is the Young's modulus of the plate material and ν its Poisson ratio. Some approximate physical properties of relevant materials are given in table 2. (2.3) In this section, we simplify the object to an infinite vertical elastic beam. We assume that a plane wave of frequency ω is incident from the left with an incident angle θ, as shown in figure 1. Part of the incident wave is reflected back to the left, and part of it is transmitted. The pressure of the incoming wave is of the form

Sound exciting an infinite elastic beam
Measuring Vibrations from Video Feeds where The reflected and transmitted waves are therefore of the form with the corresponding horizontal velocities given by where ρ 0 and c 0 are respectively the density and the sound speed of air. This notation uses complex amplitudes P 0 , P T and P R to describe both the amplitude and phase of the solutions for convenience, despite the underlying quantities for pressure and velocity being real.
(2.4) We call the displacement of the beam u(y, t) and we then impose that the displacement of the air and the beam must match on both sides of the beam v inc (0, y, t) This implies that u(y, t) = Re U exp i(ωt − k y y) , with Equation (2) gives Newton's law of motion applied to the beam, Balancing the forces on the beam using (10) means that U must also satisfy Solving (9) and (11) simultaneously leads to Equation (13) therefore gives the amplitude and phase of the oscillation of the beam, subjected to an incoming wave of amplitude P 0 . Ideally, therefore, we would like the amplitude |U | to be as large as possible to be most easily detected.
(2.5) Since (13) is relatively complicated, it is helpful to look at some simplifying cases. In particular, for a wave perpendicular to the beam (sin θ = 0) the bending stiffness of the beam is unimportant, and we have In the limit of a very thin beam, or a very light beam, then hρ → 0, and we recover |U | = |P 0 |/(ωρ 0 c 0 ), which is the expression for the displacement of the air; that is, a light beam moves with the air. For a heavier or thicker beam, the motion is smaller, especially at higher frequencies. The order of magnitude of the frequency when the beam stops moving with the air is given by f c = 2πω c ∼ 4πρ 0 c 0 /(hρ). For frequencies much lower than this critical frequency f c the beam moves with the air, while for frequencies much higher than f c the beam moves much less than the air.
(2.6) Figure 2 plots the amplitude of oscillation |U | of various beams. The first two sub-figures are for 50 µm thick polyethylene, emulating a crisp packet. Louder amplitudes of sound lead to larger displacements, and higher frequencies lead to smaller displacements; this is expected, as a sound wave in air has the same displacement profile. The displacement is relatively insensitive to the direction of the incident sound, provided it is not parallel to the surface (θ = 90 • ). The final sub-figure in figure 2 shows that several materials  act the same, and effectively just act as passive tracers of the vibration in the air, at least at low to moderate frequencies, while high frequencies are more attenuated. This is not the case for all materials, however; 3 mm thick glass has a much smaller amplitude, especially at higher frequencies.
(2.7) This model is of course a rather simple one. For example, it assumes that the object is an infinite flat beam with no curvature or edges, that the beam is not clamped or pinned in any way, and that the incoming wave is only present to the left of the beam. Some of these could be incorporated as extensions of this model. This model also ignores resonances as is assumes an infinite beam. Resonant frequencies occur in all finite elastic scatterers, causing the acoustic displacement to be much greater than at other nonresonant frequencies. In order to investigate resonance, we next describe a model of a finite size elastic plate and its resonances.

Resonances of a 2D rectangular plate
(2.8) In this section, we develop a model of the resonances of an elastic plate supported by its edges. We assume a rectangular plate of size L x × L y resting on its edges and solve the unforced 2D version of (2), We use separation of variables and take u(t, x, y) = g(t)W (x, y). Then hρ g For equality to hold, both sides must be a constant, say −ω 2 , and we may therefore take g(t) = sin(ωt). Then B ρh We then impose the boundary conditions for a freely supported resting plate with no bending moments at the edges: We then notice that W (x, y) = A sin(k x x) sin(k y y) satisfies trivially all the above boundary conditions if k x = nπ/L x and k y = mπ/L y where n and m are positive integers. Substituting this ansatz into (16), we get For a polyethylene plate with L x = L y = 0.1 m and h = 50 µm, we have B ≈ 1.65 × 10 −5 , and this gives resonant Resonant frequencies may therefore be expected to be rather common, and hence the acoustic displacement of an object may well be much closer to the acoustic displacement of the air than might otherwise have been thought without considering resonances.
(2.9) It should be noted that this model assumes the plate is flat, is supported only at its edges, and that there is no friction or loss at the edges. Again, such extensions could be incorporated into a more complicated model, but it is unlikely that the exact resonant frequencies will be of use in practice; rather, if the resonant frequencies were to be used explicitly in the algorithm for extracting sound from image motion, one would need to best-fit the resonant frequencies given the response of the object when forced by the sound in the room. The forcing of this resonant plate is considered in the next section.

A forced elastic plate
(2.10) We now consider the elastic plate from the previous section subjected to an external forcing. For simplicity, in this section we consider only the 1D problem, so that the governing equation is where P (x, t) is the force per unit area. Since the problem is linear, we may without loss of generality assume the wave has a single frequency, P (x, t) = P (x) sin(ωt), since multiple frequencies may be summed over if required. If the plate has length L, the eigenmodes of a resting plate are given by (21) as The solutions of (23) can be written as which may be seen by substituting (25) into (23) to get The A j coefficients are therefore the Fourier series coefficients of the function P (x), In particular, if the wavelengths of the sound in the air are much longer than the length L, then the pressure P (x) may be taken as a constant, P (x) = P 0 , and then Note that, if the frequency of excitation ω is the same as one of the resonance of the plate ω j , then equation (25) predicts an infinite amplitude of oscillation of the plate. This is because the radiation damping from the back reaction of the plate movement on the forcing has not yet been included.
(2.11) Figure 3 plots some examples of the forced response of plates of various materials. Unlike figure 3, different points on the plate are moving with different amplitudes, and so figure 3 plots the maximum amplitude occurring anywhere on the plate. The resonant frequencies are clearly visible as the peaks in figure 3. Most importantly, the number of resonant peaks is seen to be very important; polystyrene foam is lighter than polyethylene, but when resonances are included polyethylene oscillates more than than polystyrene. This is perhaps why the MIT researchers [1] recovered better results from a crisp packet (made of polyethylene) than they did from a disposable drinks cup (made of polystyrene foam). Note that the amplitudes of oscillation in figure 3 are larger than those in figure 2, since in figure 2 energy is lost by radiating sound back into the air.
(2.12) In this section, we have neglected any dissipation that might limit resonance, such as dissipation within the air or friction of the plate with its supports. The forcing was also considered given and the response of the plate was calculated; this neglects the back reaction of the plate movement on the wave in the air, which will also limit the amplitude at resonance. This is addressed in the next section.

Interaction between sound and a resonating object
(2.13) In the analysis above, section 2.1 accounts for the back reaction of the plate on the air (through wave reflection and transmission), but ignores resonances. Contrastingly, section 2.3 includes resonances, but ignores the back reaction of the plate on the air, leading to arbitrarily large plate motion at the resonant frequencies.
In this section, we modify the model in section 2.1 to include an artificial spring and damping term, in order to investigate the combination of back reaction and resonance.
(2.14) We modify Newton's law for the beam (10) to include an artificial spring term hρω 2 0 (giving an undamped resonance at frequency ω 0 ) and an artificial damping µ. One could think of this as a crude model of a horizontal beam lying on a carpet, with the carpet providing the extra spring and damping terms, and the transmitted wave being totally absorbed by the carpet without reflection. The resulting governing equation is By following the same method as in section 2.1, we arrive at the equivalent of equation (13), or, for a wave perpendicular to the beam (sin θ = 0), the equivalent of equation (14), .
This shows that the 2iωρ 0 c 0 term found previously is a radiation damping term, which is increased by adding the artificial damping µ, while the resonance at ω 0 can cancel out the mass of the beam so as to give results near resonance as if the beam were much lighter. Without artificial damping (µ = 0), at resonance we recover U = P 0 /(iωρ 0 c 0 ), which is the displacement amplitude of the air. Without both artificial and radiation damping (setting µ = ρ 0 = 0), we find an infinite beam amplitude at resonance when ω = ω 0 , in agreement with the results of the previous section.
(2.15) Figure 4 plots the results of this for a 3 mm thick glass beam. With an artificial resonance at 100 Hz, the amplitude of oscillation of the beam at 100 Hz is the same as that of the air, while without the artificial resonance the beam amplitude is ten times smaller. Adding extra dissipation reduces the amplitude at resonance, while ignoring the radiation damping (by using a fixed forcing as in section 2.3) gives the expected infinite amplitude at resonance.
(2.16) Importantly, note that introducing a resonance can also have negative effects, such as anti-resonance. This can be seen at low frequencies in figure 4, where the amplitude of the glass with artificial resonance is much smaller than the amplitude of the glass without artificial resonance. As is known from current research on generating electricity from water waves, designing resonant systems to capture energy from waves is far from easy. 3 Reflection from a bending mirror (3.1) It may be that looking at the motion of reflections allows for greater sensitivity than simply looking for lateral motion. In this section, we investigate this by considering the motion of images in an oscillating mirror. Figure 5 shows an object (O) seen by an observer in a mirror (M ) 3 . Using complex numbers for 2D coordinates, if a point on the mirror is at location m = m x + im y and at an angle θ to the vertical, and the object is located at z = x + iy, then the image (I) of the object seen by the observer at the origin is given by where an overbar denotes the complex conjugate. If the mirror then rotates to a second angle θ , the distance the image appears to move in the mirror, d, is given by It is helpful to rearrange this in terms of the original image position I, m/I is the distance to the mirror normalized by the distance to the image. Since the image is always "behind" the mirror, this ratio always has modulus less than one. If the image is a significant distance away, such as the reflection of the sun, moon, or clouds, then |m/I| 1, and in this case for small angular changes θ − θ , we find |d| = 2|m||θ − θ |. This is to be expected; if you were looking at yourself in a hand mirror, and then turned the hand mirror 45 • upwards, you would see the ceiling in the mirror, which is a 90 • = 2 × 45 • change in direction.
(3.2) How does this compare with the motion of an actual object? For example, are we better to look at the motion of a crisp packet, or the motion of reflections in the crisp packet? In order to answer this, we consider a mirrored bending beam. The bottom of the beam is fixed, while the top of the beam has moved a distance a. The shape of the beam is therefore given by x = ay 2 / 2 , where is the length of the beam. The angle of the top of the beam, for small deflections, is approximately θ ≈ dx dy | y= = 2a/ , and therefore the displacement of a reflection in the top of the beam is d ≈ 4|m|a/ ; that is, the displacement a is magnified by a factor 4|m|/ when looking at the reflection, where is the length of the beam (e.g. the size of the object) and |m| is the distance to the camera. Clearly |m| , and hence tracking moving reflections in objects is predicted to lead to significantly better sensitivity than just tracking lateral motion of the object. As an example, for an object of size = 10 cm oscillating with 1 µm amplitude viewed from a camera |m| = 10 m away, the effective motion of reflections in the object is predicted to be of the order d = 400 µm, which should easily be detectable. (3.3) This section assumes that there are suitable objects in a room to cause reflections (such as lights, windows, etc), and that motion of reflections may be detected as easily as motion of the object itself. This latter assumption is probably quite limiting, since diffusive reflections may be much harder to get accurate motion from. Practical tests with oscillating reflective objects would be helpful to test the validity of these assumptions.

Detecting motion from video
(4.1) The underlying process behind the visual microphone MIT paper [1] relies on being able to amplify the motion of object in a video at certain frequencies; this is described by a previous publication by MIT researchers Wadhwa et al. [7]. The process makes use of a "complex-valued steerable pyramid" wavelet decomposition [4][5][6]. As described by [5], the is overcomplete (i.e. produces more bytes of data than the input), but is invertible (so that the original image can be reconstructed from the wavelet coefficients). The signal is separated into high-, low-, and band-limited spatial frequencies. The high-frequencies are stored unencoded. The band-limited frequencies are encoded using several different positions and orientations of wavelets. The low-frequencies are downsampled to half the resolution, and the process repeated (hence the pyramid structure). This ensures details of the image at several different scales and at several different orientations is produces.
(4.2) Just as for a Fourier series, Wadhwa et al. [7] claim that the complex coefficients of the resulting transform can be separated into their amplitude and phase, with a change in phase corresponding to translation. By transforming each frame of a video, the phase of each coefficient can have particular temporal frequencies amplified, which then amplifies the motion in the image at these frequencies. The same technique of using the phase of each coefficient was used for the visual microphone paper by Davis et al. [1].
(4.3) Because of the pyramid structure of the wavelets, information about the motion of the entire image will be encoded using the coefficients at the bottom of the pyramid. While these coefficients were not treated differently than the other coefficients in the MIT papers, it is likely that using these coefficients carefully could eliminate camera movements from the signal; although this was not investigated further here.
(4.4) To investigate the detection of motion from videos, we take the code from Ref. 7, available online 4 , and investigate some of the results from the paper, and our own example. For our own example, we excite a projector screen of approximate height h = 3.5m, and film the oscillations with both a DSLR camera on a tripod and a hand-held mobile phone camera. The frame rate for both was 30fps, with a video size of 960 × 480 pixels. We take advantage of the first few resonant frequencies of the screen. Treating the screen as a simple pendulum gives an (angular) frequency of h/g, while adding in some effects of torsion instead gives 3h/g as the angular frequency. Converting to Hz then gives frequencies in the range 0.26-0.46Hz. We therefore choose to selectively amplify from 0.2Hz to 0.6Hz when using the MIT software. The running time on a standard laptop were on the order of several minutes, depending on the video length and number of frames; our processing used a reduced resolution video to speed up the computation, as can be seen in the figures below when comparing the original and motion enhanced images.
(4.5) We get good results for both the mobile phone footage and DSLR footage, provided the cameras are held steady. We recover the oscillations of the projector at the expected frequencies. Some stills from these movies are displayed in Figure 6.
(4.6) We are able to detect and magnify the motion, even when the video frame rate is reduced to 2fps. We do this by just selected the 15-th and 30-th frame per second.
(a) Original image (b) Motion magnified image Figure 6: Stills from the unmagnified (left) and motion magnified (right) videos of a moving projector screen. Note that, to save processing time, the quality of the video on the right has been reduced.
(4.7) We are able to add noise to the original video of a crane that appears to be stationary, and still detect the motion of the crane at 0.2Hz to 0.4Hz as discussed in Wadhwa et al. [7]. It was pointed out in Wadhwa et al. [7] that this is entirely expected, since the technique may redistribute noise but will never amplify it. The noise we added was Gaussian white noise with zero mean and variance σ 2 = 0.01, using the "imnoise" command in Matlab.
(4.8) If the camera is moving (such as a hand held mobile phone footage), we are unable to get sensible results due to the motion of the camera. We see the whole image moving, and it is difficult to detect what is still and what is not after applying the algorithm. Pre processing to reduce the movement of the camera might improve the algorithm, but it was not tested here.
(4.9) We conclude that the software is working reasonably well and able to be used on a standard laptop. Using larger image sizes requires more memory, which was the main limiting factor in the image size we choose.

Rolling shutter
(4.10) We now investigate the effect of a rolling shutter, which allows the recovery of higher frequencies that the frame rate. This technique was suggested in Davis et al. [1].
(4.11) For example, let us consider the case of a 50fps standard camera, and wanted to detect motion at 200Hz, i.e four times faster than the frame rate. For simplicity, we assume we have a couple of objects which move from position 1 to position 2 at a frequency of 200Hz.   this case 1/40th of a frame). Each subsequent line is then offset by a frame delay, which would depend on the number of lines of the camera. Figure 7 illustrates the motion and rolling shutter for a single frequency.
(4.14) The result of using the rolling shutter are displayed in Figure 8, for two toy examples of a circle and multiple lines moving from position 1 to position 2. We can clearly see the effect of the rolling shutter. When the object changes position during the exposure time of a single line, we take the average value which is the result of the grey parts of the image.
(4.15) The problem is then to recover the frequency signal in Figure 7 from the pictures in Figure 8. From Figure 8a we can clearly see this is difficult, as the image is only in the centre of frame, while it is more possible in Figure  8b. Clearly, there is also more than one frequency and original images we can deduce, so in that sense we have an inverse problem. For more realistic cases, we would expect the frequency signal between two positions to be a sine wave rather than a square signal, which would result in a curved line rather than the clear discontinuities in the images in 8. We would also expect that when the motion is by different amounts in different parts of the picture, it would be harder to solve the inverse problem.
(4.16) In Davis et al. [1] results are presented for using the rolling shutter technique, but neither the method nor code are provided to explain how. The audio example they provide demonstrates that their rolling shutter technique is not able to record intelligible speech. This technique may well be promising to explore further, if use of high framerate cameras is limited.

Quality of detected motion and noise
where ω is the frequency, D p is the amplitude of the motion in the camera image in pixels, n p is the number of pixels across the image, and σ n is the standard deviation of the noise. This is as expected, as the "signal" is D p (ω) and the "noise" is described by a per-pixel standard deviation σ n averaged over the number of pixels in the image in the direction of motion (proportional to n p ), giving a standard deviation of the average of σ n / √ n p . Typically, they used D p values between 10 −3 and 10 −2 . However, they did not say what signal to noise ratio was necessary to extract meaningful signals. There must also be other important parameters to consider, too, since their results were most sensitive up to 400Hz [1, fig. 7c]. Most results were taken with about a 2kHz framerate and 700×700 pixels. Their attempt with 20kHz and 192×192 pixels seems to have worked worse due to the increase noise (less light per frame) and lower resolution.
(4.18) Since Davis et al. [1] do not investigate different cameras, we turn to the results of D'Emilia et al. [2]. Since they use an inferior algorithm for motion detection, they appear to need several microns of motion for it to be detectable. They used two cameras: Camera A (AVT Marlin F-131b): 25 fps, 1280x1024px Camera B (Olympus i-speed): 2000 fps, 1280x1024px (4.19) The two cameras were mounted in front of a target which vibrates at controlled frequency and amplitudes. The largest displacement was 51mm and 883 m/s 2 , in a frequency range of 10 − 2000 Hz. This paper attempts to find error bounds on the recovered amplitude. In general, they find the uncertainty depends on a large numbers of factors. For camera A, the experimental results showed that the vibration uncertainty is of the order 84 µm, or 3.4% of the amplitude, in the frequency range 10-70 Hz. For camera B, the experimental results showed a vibration uncertainty of 32 µm, 8.4% of the amplitude, in the frequency range 100-300 Hz, while an uncertainty of 13 µm could be achieved (13% of vibration amplitude) in the range 400-600 Hz. In the paper, camera B had an error bound much greater than 10% when considering a low contrast object, for frequencies 300,400 & 500 Hz. Consequently the authors did not analyse this. This could hint at possible problems when analysing video. The paper does not consider object illumination, which could be a large cause of uncertainty for everyday applications.
(4.20) It is unclear how these results should be applied to the MIT technique [1], although it is clear that the detectable signal depends strongly on the characteristics of the camera used. We would therefore propose further experiments using high-speed cameras to investigate the dependence of motion detection on important factors such as the amount of ambient light and the clarity of the image. We propose a thin sheet or strip of some suitable material (such as LDPE) be held vertically, with the top able to be excited horizontally (for example by connecting it to a horizontal loud speaker) and the bottom allowed to move freely, possibly with some added weight at bottom. A high-speed camera would video the motion of the sheet from the side, so that the sheet would appear as a line oscillating left-right in the video. The MIT software would then be used to extract the motion of the sheet from the video. Direct measurements of the oscillation of the sheet could also be made by other means for comparison. Experiments could then be conducted with different cameras, different framerates, different lighting conditions, different camera lenses and distances, and different amplitudes of oscillation, to map out the conditions necessary for successfully detecting motion, and the corresponding noise. This setup could be used with a second loud speaker in the air, allowing the analysis of section 2 to be validated. Moreover, this setup could also be used to investigate the use of rolling shutters (see section 4.1), since each horizontal line of the video will see a slightly different part of the sheet at a slightly different time; this would be particularly interesting when the frame rate of the camera is lower than the natural period of the sheet.  Figure 9: Plot of the average intelligibility (on a percentage scale, with 100% being fully intelligible) of sample sentences with a given lowpass frequency and a given noise level. Based on a very small study of 108 sentences.

Intelligible speech
1. A king ruled the state in the early days.
2. The ship was torn apart on the sharp reef.
3. Sickness kept him home the third week.
4. The wide road shimmered in the hot sun.
5. The lazy cow lay in the cool grass.
6. Lift the square stone over the fence.
7. The rope will bind the seven books at once.
8. Hop over the fence and plunge in.
9. The friendly gang left the drug store.
Recordings of these sentences were then bandpassed, and white noise added. The upper limit of the bandpass and the amplitude of the noise was varied, with the lower limit of the bandpass set to 300 Hz. In total 12 people each listened to 9 sentences and attempted to identify the 5 important words, leading to an intelligibility score between 0 and 5. The results are summarized in figure 9.
It is clear that a larger sample is needed to get a well-converged average. However, it is also clear that there is a large amount of random variation, with the bottom right corner of figure 9 suggesting that the highest 4kHz cutoff frequency and the lowest noise was necessary to give a repeatably intelligible signal (this being comparable to telephony quality). Based on this, we suggest at a minimum aiming to capture 300Hz to 1.2kHz for a reasonable chance at recognising speech. Not shown in this figure, but notable from our study, was that native speakers were more able to correctly identify the speech against a noisy background than non-native speakers, even for those non-native speakers with otherwise excellent English.
The results of Davis et al. [1] available to listen to have very little noise, suggesting they have been post-processed to aid intelligibility. Indeed, Davis et al. [1] refer to a number of standard speech enhancement techniques which we have not investigated further here. A description of a possible mathematical basis for enhanced speech recovery using Bayesian inference is presented in appendix A. 3) Object oscillations of the order of 0.1 µm need to be detectable in order to recover speech at conversational volumes. The MIT technique is able to extract motion from videos that at least 1/1000th of a pixel, which gives an indication of the level of magnification of the image needed: at least 10 pixels per millimeter, and preferably 100 pixels per millimeter.
(6.4) The investigation in section 5 suggests that, for intelligible speech, we would need to capture at least up to 1.2 kHz, and preferrably higher (potentially 3.4 kHz for telephony quality sound). The MIT technique is able to detect frequencies up to about 850 Hz using a 2,200 fps camera, suggesting that at least a 3,100 fps camera is needed, and potentially a 9,000 fps camera. However, when the MIT researchers tried higher framerates, they found the increased noise made intelligibility harder. Clearly understatnding the signal to noise ratio of the captured sound is important.
(6.5) While we would have liked to be able to give explicit advice about the hardware requirements for obtaining intelligible speech using the MIT technique, to do so would need further experiments. The model of signal to noise ratio in the MIT paper [1] is experimentally derived, and does not account for changes in ambient lighting or camera framerate, camera pixel noise (ISO setting), image clarity (e.g. caused by imperfect camera lenses), nor does it describe the frequency dependence of the signal to noise ratio. A suggestion for a possible future experiment is described at the end of section 4.2.
(6.6) Whatever the specific requirements, a high speed, low noise camera is essential. Existing video footage is unlikely to be of sufficient quality or sufficiently high frame rate.
(6.7) While MIT report using the rolling shutter of a camera to capture frequencies much higher than the camera frame rate, this did not result in intelligible speech (or even in identifiable speech), and the MIT algorithm for doing this has not been made public. While a much more complicated and challenging problem, we see no fundamental reason why rolling shutters could not be taken advantage of if access to high speed cameras is limited.
(6.8) The MIT experiments depend on the motion of light objects (such as crisp packets or disposable cups), but resonances of heavier objects (e.g. curtains) could potentially also be exploited (section 2.4. Due to the drop in oscillating amplitude with increasing frequency (figure 2), it is likely that only the first few resonances will be useful, which could simplify the task of taking advantage of resonances; the MIT results are also suggestive that only the first few modes are important [1, figure 5]. The closer together the resonances are, the larger the resulting motion ( figure 3).
(6.9) By looking at reflections in objects, much more subtle motion could be detectable, possibly including speech at conversational volumes. For example, using reflections in a bending 10 cm mirror viewed from 10 m gives a 400× magnification of motion.
(6.10) Vibrations of the camera were ignored by MIT, but are expected to be significant. The MIT technique could probably be adapted to be resilient to camera vibrations, by using the lowest resolution wavelets as a proxy for whole-scene motion, although this was not attempted here.
(6.11) We are sceptical of the MIT claim that optimized computer code could process video in real time. The technique is highly parallelizable, however, so could possibly be offloaded to a supercomputer for real-time processing.
(6.12) The effect of video compression artifacts was not considered in this work, nor by MIT. Modern video compression (such as H.264 and H.265 codecs often used in mp4 files) use motion estimation to enhance compression, and the motion estimation used is unlikely to be sufficiently accurate to recover the subtle motions needed for the MIT technique.
(6.13) Also not considered here or by MIT is aliasing, where frequencies higher than half the camera frame rate alias and become artifacts at lower frequencies.
In audio recording, it is essential to use a low-pass filter before digitising the sound in order to eliminate aliasing, but this cannot be done with current high-speed cameras.
(6.14) It may be possible to build cheap counter-measures to counter this technique.
For example, small oscillators (such as the vibration units in mobile phones) could be stuck onto surfaces and generate small amounts of white noise oscillations. These would be inaudible to people in the room, but would render the visual microphone technique impossible.
that only the lightest damped modes of F matter, so one could attempt a highly reduced description of F .
(A.1.6) Extension to allow the filter G and noise amplitudes L to vary slowly in time takes one out of the autonomous regime for the efficient Kalman filter solution, but it is still a Gaussian process so may be relatively feasible to do the inference.