In this project, we will talk about out toy project to categorise the quality of playing roughly. We use Sampled CNN for this project.

This is not post of technology.

I’ll show you followings on the article.

  • What we are doing
  • What we are going to do
  • What we will do


Yugang Hong, Our CMO(Chef Marketing Officer) make small dataset for this toy project. Not Chef Music Officer

There are 15 music samples. 5 music samples for each A, B and C class. The length of music is between 1m30s ~ 2m30s.

Music sample 1
Music sample 2
Music sample 3


Basically music is very high dimentional data. So we need to pre-process before using deep-learning. In this project we resize sample rate to 500(25200 in general). And splited to 10s duration.

If you input a music, this model will split your music as 10s and merge the results from each segment.

Development Environment

We use DeepLearningVM from Google Cloud Service.

We use ‘NVIDIA K80’ which specification in detail is as followings:

  • 4992 NVIDIA CUDA cores with a dual-GPU design
  • Up to 2.91 teraflops double-precision performance with NVIDIA GPU Boost
  • Up to 8.73 teraflops single-precision performance with NVIDIA GPU Boost
  • 24 GB of GDDR5 memory
  • 480 GB/s aggregate memory bandwidth
  • ECC protection for increased reliability
  • Server-optimised to deliver the best throughput in the data center


We use Sampled CNN from KAIST on 2017. In this model we approach to music with sample-level not frame-level. By sample 2~3 frames, we can get more locality without too much increasing on computation.

Sampled based approach

In each layer we can get more specific representation for the music. This make we can recognize each sample from another.

Result of wave form of each filter

Result & Discussion

In following diagram, we can check each sample of music is clustered with axis ‘A-part’. Because we split entier music to 10s segment, there are 15~25 segments for each music sample.

Result of Quality of Playing

There are two main discussion.

  • We can check quality of music with well-labeld music
    • Of course, we don’t know what is A class music exatly… but at least we can cluster music!
  • We can’t utilize ‘B part’ and ‘C part’ currently.
    • When we check the ‘B part’ and ‘C part’ almost of them are silent part
    • But in entire music silent part is also important for A class playing(e.g. dynamic scale)
    • We need to find a way to get more global context from music

For future work

In this project, we check we can clutster image with deep learning. This means we can found proper representation of music!

Although we need to focus on how to get global context of music. But this need more powerful computing and better models. So there will be 2nd toy project for that!