AI Tech - Artificial Intelligence based music video editing

Researcher's at MIT's CSAIL design PixelPlayer a deep-learning based AI system to enable audio editing in music videos.

Introduction

Have you ever tried using Youtube or Dailymotion to try and learn how to play a part of your favorite song on a guitar or another musical instrument? A team of researchers at the Massachusetts Institute of Technology’s CSAIL (Computer Science and Artificial Intelligence Laboratory) just made your life easier.

The researchers at CSAIL have designed a system called PixelPlayer which enables a user, given a music video containing musical instruments playing sounds, to identify the pixels, or simply the region of the video being clicked , with the sound associated with it.

Essentially what you can do is that if say in a music video a trumpet is being played in addition to another musical instrument, when you click on the trumpet the Artificial Intelligence-based system called PixelPlayer will only identify the sound of the trumpet being played from the audio track of the music video and play that out for you. The AI-based system will omit any other sound associated with other musical instruments that may be being played in accompaniment to the trumpet.

It doesn’t end here. There is much more. As we can see in the music video from CSAIL you can also use the artificial intelligence based music video editing system for mixing audio tracks. This is one more addition to the gifts to the mankind from the researchers toiling away hours making breakthroughs in the field of Computer Science and Artificial Intelligence.

The last several months saw Artificial Intelligence or AI researchers from Stanford and Adobe editing videos using AI, AI directed music video and last but not the least Google's nsynth super - which makes new sounds using the machine learning.

How it works

The PixelPlayer system designed at MIT leverages the so-called Neural Networks which are also known as the Artificial Neural Networks. The Neural Networks used in Artificial Intelligence, Machine Learning, Deep Learning are fundamentally inspired by their human brain counterparts. The human brain and nervous system is composed of billions of tiny cells called neurons. The neurons are connected with each other forming "Neural Networks", The computer scientists programmatically mimic these neurons or human nerve cells and create networks of these inside computers.

These networks of the "artificial neurons" inside the computers are called "artificial neural networks" or simply "neural network". The team at MIT created three neural networks for the Pixel Player. One neural network analyzes the video. Another neural network analyzes the audio track. The third neural network which is also considered as the "synthesizer" neural network by the team of the researchers, associates specific pixels with specific sound waves to separate the different sound videos. Using the three different neural networks fascinating results were obtained.

The previous applications like this focused primarily on audio. However, since the researchers at CSAIL focused not just on audio but also on video this gave them a humungous advantage over the audio-only approach. They further leveraged Self Supervision - a category in Deep Learning which got them the results.

Deep Learning primarily focuses on using neural networks in a much similar manner to how the neurons or the nerve cells forming networks in the human brain works. Think of it this way, how does the brain of a 2-year-old child "learn".

In essence in Deep Learning, the scientists and researchers try and leverage the knowledge of the workings of the human brain's learning processes and mechanisms especially that of the network of nerve cells or neurons in trying to make computers mimic the behavior of the tiny portions of human brains. In Deep Learning they teach the computers, to a large extent, what the human brain does. However, the extent and scale at this point in time of the neurons or nerve cells involved may be only a tiny fraction of that of a typical human brain.

However, a noteworthy point is that since the researches at MIT leveraged "self-supervision" for their neural networks, it can be challenging to point out which part of the neural networks is responsible for generating exactly which results. This situation is not uncommon for scenarios of deep learning where self-supervision has been employed.

The AI-based music video editing system designed by the scientists at MIT first locates the image regions that produce the sounds and then it separates the input sounds into a set of components that represents the sound from each pixel. This is how essentially the AI-based system is able to point out the sound from the trumpet (as mentioned in our example) and distinguish it and play it only while omitting sounds from other musical instruments.

The scientists trained the AI music video editing system - PixelPlayer with just 60 hours of musical performance videos. The AI researchers at MIT's CSAIL firmly believe that with further training and research there is room for enhancements and expansion of the applicability of PixelPlayer

Applications and Future Scope

The PixelPlayer system has opened new horizons for editing the music videos containing performances of musical instruments by leveraging AI. At present it allows the users to change the audio mix of a recording. It is envisioned by the AI researchers at MIT that this system can be utilized by the audio engineers to improve the quality of the old concerts' footages. The PixelPlayer can be utilized by the producers taking specific instrument parts and previewing what they would sound like with other instruments (i.e. an electric guitar swapped for an acoustic guitar). In future the team of scientists at MIT's CSAIL plan to improve the quality and add more instruments and work with more instrument combinations.

Techtresting

Search This Blog