Audio Generation

It is also possible to use deep learning approaches to generate audio. Speech synthesis is by far the most studied application but approaches to music composition and sound effect generation have also been proposed. In this section, we describe some of the most interesting applications of deep learning approaches to audio generation.

Generating authentic sounding artificial speech, or speech synthesis, has long been a focus of artificial intelligence researchers. Deep neural networks, however, bring new approaches to this long-standing challenge. WaveNet (van den Oord et al. 2016), a deep autoregressive model developed by Google DeepMind, achieves state-of-the-art performance on text-to-speech generation and the generated speech audio is rated as subjectively natural by human raters. This performance is achieved with a dilated CNN model that manages to model long-term temporal dependencies with a much lower computational load than LSTM models.

Recently, text-to-speech synthesis techniques reached a new milestone after the landmark of WaveNet (van den Oord et al. 2016), and Google researchers introduced Tacotron 2 (Shen et al. 2017). This system employs a sequence-to-sequence model to project textual character embeddings to spectrograms in the frequency domain. Then a modified WaveNet model generates time-domain waveform samples from spectrogram features. Compared with WaveNet, Tacotron 2 has a better performance in learning human pronunciations and its model size is significantly smaller.

Beyond text-to-speech (TTS) techniques, speech-to-speech (STS) has drawn attention in recent years. Google researchers introduced a direct STS translation tool, named as Translatotron (Jia et al. 2019). Traditionally, speech-to-speech translation is achieved in three steps (or models) including speech-to-text transcription on the source language, text-to-text translation, and text-to-speech synthesis to generate audio in the target language. This routine is well established with convincing accuracy, also it is widely deployed in commercial applications. Translatotron is the first trial to merge the aforementioned three steps in one model and show its value. Although the benchmark of Translatotron is slightly below a baseline model on the Spanish-to-English translation task, this direct translation approach is able to mimic the voice of the source speaker in the synthesized target speech.

As a side-effect of the advances on TTS, it is now easy to generate a fake voice or speech toward a target person. An Al startup Dessa released a speech synthesis model called RealTalk which creates the human voice perfectly.1 Currently, details of data set, models, and benchmarks are not publicly available, but people can try to tell the real voice from the fake on this page.[1]

Rather than generating speech from text, deep learning approaches have also been used to generate sound effects based on video inputs. An artificial foley artist[2] described by Owens et al. (2015) can reproduce sound effects for simple silent videos based on an ensemble of CNN and LSTM models. A CNN model is trained to extract high-level image features from each video frame. A sequence of these image features (color and motion) is taken as input to an LSTM model, and the LSTM model is trained to create an intermediate sound representation known as a cochleagram. In the final step, the cochleagram is converted to waveforms through an LSTM-based sound synthesis procedure. Although only applied in very simple environments, the results are impressive.

Deep learning models can also be used to generate original music. DeepBach (Hadjeres and Pachet, 2016), for example, uses an LSTM-based approach to compose original chorales in the style of Bach. The model is composed of multiple LSTM and CNN models that are combined in an ensemble which given a melody can produce harmonies for the alto, tenor, and bass voices. Similar systems based on RNNs that generate original music in other styles have also been demonstrated—for example, music in the style of Mozart1 T or traditional Irish music.^

Image and Video Generation

Deepfake is a buzz word in the recent news press. This word comes from deep learning and fake. Paul Barrett, adjunct professor of law at New York University, defines deepfake as falsified videos made by means of deep learning. We would like to confine the concept of deepfake as falsified human faces in image or video made by generative adversary networks (GAN) (Goodfellow et al. 2014) or related AI techniques. The general goal of deepfake is to transfer stylistic facial information from reference images or videos to synthetic copies.

Hyperconnect released MarioNETte, one of the state-of-the-art face reenactment tools in 2019 (Ha et al. 2019). Previous research suffers from identity preservation problems on unseen large poses. MarioNETte integrates image attention block, target feature alignment, and landmark transformer. These modifications lead to better realistic synthetic videos.

Other than research publications, we find face reenactment tools for smartphones. ZAO, a free deepfake face-swapping app, is able to place user’s face

seamlessly and naturally into scenes from hundreds of movies and TV shows using just a single photograph

Deepfake techniques are developing fast and this is becoming a challenge for personal privacy and public security. It is not only humans that cannot tell a faked portrait or video clip from the original, but advanced face recognition software is also being cheated. Korshunov and Marcel (Korshunov and Marcel, 2018) performed a study where the results showed that state-of-the-art recognition systems based on VGG and Facenet neural networks are vulnerable to Deepfake videos, with 85.62% and 95-00% false acceptance rates respectively. The best fake detection method is based on visual quality metrics which shows an 8.97% error rate on high-quality Deepfakes.

In order to improve fake detection techniques, Korshunov and Marcel (2018) released the first public available fake video data set, vidTIMIT. Tech giants also joined this campaign. Recently, Google and collaborators released a deep fake detection data set with over 3,000 manipulated videos (Rossler et al. 2019). Facebook and partner organizations started the Deepfake Detection Challenge (DFDC) and funded over US$10 million to support this industry-wide effort.1'

Image generation refers to the process of automatically creating new images based on existing information sources. Deep learning has been applied in many image generation tasks, including image (and video) super-resolution, image color- ization, image generation from text or other images, and so-called neural art.

Image super-resolution (ISR) is an image generation problem in which the resolution of a digital image is vastly increased through the application of algorithms. In recent years, Microsoft researchers have applied CNNs to this problem and achieved state-of-the-art restoration quality (Dong et al. 2014). Although deep CNNs significantly improve the accuracy and speed of ISR, there still remains a challenge of restoring the finer texture details. Ledig et al. (Ledig et al. 2016) proposed SRGAN for image super-resolution. The SRGAN is capable of restoring photo-realistic natural images for 4x upscaling factors. Recently, a team from ElementAI developed High Res-net,1 a deep learning model capable of stitching multiple low-resolution satellite images to create a super-resolution image. Unlike other super-resolution models which could add fake details to the final image, their model recovers the original details in the super-resolution version after aggregating the information from multiple low-resolution ones. As such, their model has wide applications: from automatic land management to mapping road networks.

While it remains a very challenging task (and performing it at a human level is well beyond the current state-of-the-art), deep learning has led to advances in the ability of systems to automatically generate images based on textual descriptions. Systems that can do this can be helpful in graphic design, animation, and

architecture. RNNs are one of the successful approaches to automatically synthesizing images from texts. Mansimov et al. (2015) introduced a seminal approach to image generation. There are two parts in the model by Mansimov et al. A bidirectional RNN is used to learn the sequence (or alignment) of words in input captions. Another generative RNN is used to learn the sequence of image patches from training images. Mansimov’s model successfully generates synthesized images from input captions and some of the images are novel from the training set. However, the generated images often look blurry and need further refinement.

More recently, GANs have been demonstrated to be useful for image generation from text. Reed et al. (2016) introduced a text-conditional convolutional GAN architecture to address this challenge. In this design, both the generator network and the discriminator network use convolution layers for text encodings. The GAN generated images tend to look more natural than those produced using other methods.

Slightly different from generating images from text, it is also possible to generate new images from existing ones. For example, there are massive numbers of pictures captured by Google’s Street View project, but an image from a required point of view may not be available. To solve this problem, Google researchers proposed DeepStereo (Flynn et al. 2015), in which CNN models are trained to predict new views based on available image sources to a quite good effect.[3] Similarly, the Irish company Artomatix+ uses models based on CNNs to generate realistic looking texture for 3D models based on existing images.

Framing image generation as an image-to-image translation problem, Isola et al. (2016) used conditional adversarial networks to generate photo-realistic images from edge maps or sketches.[3] Zhu et al. (2016) proposed generative visual manipulation methods for similar objectives to create more stylized images.^ Going even further away from photo-realistic images, so called neural art seeks to create stylistic representations of images. For example, Gatys et al. (2015) used CNNs to generate new paintings with template artistic styles. A sample output image is shown in Figure 5-5.

  • [1] http://fakejoerogan.com/
  • [2] vis.csail.mit.edu t www.wise.io/tech/asking-rnn-and-ltsm-what-would-mozart-write * www.hochart.fr/rnn/ 5 highnoongmt.wordpress.com/20l5/08/07/the-infinite-irish-trad-session/ 5 highnoongmt.word press, com/2015/05/22/lisls-stis-recurrent-neural-networks-for-folk-music-generation/
  • [3] Christopher Hesse has a demonstration of Isola s model atwww.affinelayer.com/pixsrv/5 A demonstration is available at people.eecs.berkeley.edu/~junyanz/projects/gvm/
  • [4] Christopher Hesse has a demonstration of Isola s model atwww.affinelayer.com/pixsrv/5 A demonstration is available at people.eecs.berkeley.edu/~junyanz/projects/gvm/
 
Source
< Prev   CONTENTS   Source   Next >