Saturday, September 9, 2017

The reason to try kaggle and how you do

Overview


I sometimes hear as the answer to the question, “What should I do as study of data science?”, the importance of kaggle.

Personally, I agree with the idea that he/she tries kaggle as early as possible.
Why is kaggle awesome to improve the knowledge of data science/mach learning? I summarize the points of kaggle’s advantages from the viewpoint of studying.

enter image description here



What is kaggle?


kaggle is one of the competition site of data science, machine learning. It has some competitions with/without prize.

Of course there are some competition sites and even on not competition-oriented site, sometimes you can see single competition. But I think kaggle has some superiorities to those on the following points.
  • It is easy to communicate with other users.
  • The users can upload their files and other users can check those.
  • There are so many users worldwide that with appropriate question, you can expect fast response from the others.
kaggle is not only competition site but also the communication site which enables you to communicate with the others about the competition.
About many aspects such as base aggregation, modeling, magic number hack, the users can share their trials on the site and study other approaches by the uploaded files and communication.

The links below is to kaggle and the official blog.

Why is “Try kaggle” one of the answer to knowledge improvement?


kaggle has some remarkable points as I wrote. But these days、 there are so many books, online lectures and so on to study data science things. Why is kaggle important?

On books and online articles, only some parts of all process are shown.


For this 5 years, many new books about statistics and machine learning have been published. There are many good-quality articles online. Even when you try to do something difficult, in many cases those can meet your expectation(of course sometimes they can’t). When those are not enough, you can read the appropriate thesis. We can say it is not difficult to study data science things on this kind of the environment.
However, what those show are just some parts of all the process. Of course the role of data scientist depends on the workplace and on the personal tastes, so it’s bit difficult to predicate. But as an example, we can say the data science work flow as followings.

  • get data from database or somewhere
  • check the data with base aggregation
  • do and try proper pre-processing to data
  • make model
  • evaluate model

Those are just parts of the work flow. Actually, you can add some item such as setting business goal, making the model system-friendly and so on. And you can also erase some items from the list. This is just an example.
In a nut shell, data scientist’s work flow can be separated into many pieces.
But as far as I know, almost all information we can get from books, online article and so on is not from the begging to the end, meaning it just focus on some parts, especially on “make model”.
Even about “make model”, some practical hacks are difficult to get as information.

On kaggle, by shared kernels, you can read and check the base aggregation, pre-processing and comment uploaded by other users. You can get sophisticated information from other guys!

Actually for me, main purpose of kaggle is this.

You can get score and rank.


Each kaggle’s competition calculates the score to your output and gives us the rank. By those, you can objectively see your model.
Even without competition, you can evaluate your own model by loss and accuracy. You can learn about the way to make better model.
But on this case, any model to evaluate and compare is made by you. So even when you compare the models, you can only know the model is better than your other models or not. In the comparison with the models made by other people, you can more clearly know how good your own model is.

On which phase, should you try kaggle?


I picked up some attractive points of kaggle from the viewpoint of studying. Next, you need to think that to try kaggle, what kind of skill is necessary?
Here, I don’t think about the route of learning such as “Study linear algebra, calculus, probability theory and statistics. After that, understand machine learning algorithm by reading PRML1. Finally you can try kaggle.”. This route itself is not wrong. Especially if you are kindly to mathematics, this is stable way. But for many people, this is hard route.
So, I just take a stand that you can start kaggle with minimum knowledge.
On this premise, I can say that the two points below are necessary for kaggle.
  • Knowledge about programming enough to deal with pre processing
  • Knowledge about machine learning libraries

Knowledge about programming enough to deal with pre-processing

enter image description here

Which language?


When you tackle with data science, machine learning things, it is preferable to use the languages below.
  • Python
  • R
  • Julia
  • C++
I picked up those 4. But actually as of 2017, Python is on leading position. The majority of kernels uploaded on kaggle are written by Python. To get information from those kernels, it is quite better to get basic Python skill.
Of course, any language is okay. For example, I frequently use Go to write machine learning algorithms. But from the viewpoint of trying kaggle, Python can accelerate your learning and trying.

What kind of knowledge is necessary?


Usually, on kaggle’s competitions, we need to do pre-processing to make model. On this phase, the programming skill is necessary.

Although the difficulty of pre-processig depends on the data and what you want, in many cases, the necessary knowledge is limited.
But at least, by reading one or two basic books, you can avoid being frequently stuck.

Knowledge about machine learning libraries


At first, you need to know that in most of the cases on kaggle’s competitions, it’s not necessary to write machine learning algorithms from scratch. Although sometimes there is no library to fulfill what you want and you need to write by yourself, basically the libraries are enough or better.
So, the knowledge of machine learning libraries such as scikit-learn is necessary. The documents of those libraries have many tips and information to use them properly. But those are not always easy. I strongly recommend to read the book about machine learning such as Python Machine Learning. By the book, you can get necessary information efficiently.

kaggle is nice but….


I wrote the advantages of trying kaggle to improve your knowledge and the necessary skills to do that. I want you to try kaggle’s competitions, even if you feel difficult. It is very informative to read many kernels and you can touch a wide variety of data. Those will improve your knowledge, leading to confidence.
But to improve your skill, you can’t say that only trying kaggle with minimum knowledge is enough.

By trying kaggle on early phase, you can get many opportunities to touch data, see practical hack and keep high motivation. On the other hand, there is much knowledge which is difficult get by that. One of the example is mathematical aspect of the algorithms.

  1. PRML is one of the most popular text book of machine learning. The name is Pattern Recognition and Machine Learning.