Posts
Go Even Deeper with Char-Gram+CNN
deep-learning kaggleThis is a repost from my kernel at Kaggle, which has received several positive responses from the community that it’s helpful to them. This is one of my kernels that tackles the interesting Toxic Comment Classification Challenge at Kaggle, which aims to identify and classify toxic online comments.
Do Pretrained Embeddings Give You The Extra Edge?
deep-learning kaggleThis is a repost from my kernel at Kaggle, which has received several positive responses from the community that it’s helpful to them. This is one of my kernels that tackles the interesting Toxic Comment Classification Challenge at Kaggle, which aims to identify and classify toxic online comments.
Tackling Toxic Using Keras
deep-learning kaggle machine-learningThis is a repost from my kernel at Kaggle, which has received several positive responses from the community that it’s helpful to them. This is one of my kernels that tackles the interesting Toxic Comment Classification Challenge at Kaggle, which aims to identify and classify toxic online comments.
How to build a Apache Spark Cluster with Hadoop HDFS and YARN
hadoop sparkIn our earlier post, we built a pretty light 2-nodes Apache Spark cluster without using any Hadoop HDFS and YARN underneath. We didn’t point the spark installation to any Hadoop distribution or set up any “HADOOP_HOME” as a PATH environment variable and we have deliberately set the “master” parameter to a spark master node.
Do you know what are the most talked about topics in Singapore?
data-mining data-preprocessing data-visualization news toy-projectsThere are tons of news publications in Singapore, from the traditional The Straits Times, Today, and also the digital ones like ChannelNewsAsia.com and many more to grab your latest news fix but let’s face it. *Cough Cough*. It’s beginning to look like what they want to write, not what want to read and know.
How to deploy a Hadoop/Spark Cluster with multiple machines
hadoop sparkWhen you take your machine learning models to the production level, especially in an enterprise setting, you will need your models to give you a fast and reliable response. And this is where Spark comes into the picture. Spark offers a reliable distributed/clustered computing framework that sits on top of the Hadoop framework and if you go the extra mile of configuring the HDFS and YARN, it can even achieve even more resiliency in your product. To start things small, let’s start with Spark and we’ll see how the other components fit in.
How to install Kaggle's Most Won Algorithm - XGBoost (Screenshots included)
machine-learningIf you are on this page, chances are you have heard of the incredible capability of XGBoost. Not only it “boasts” higher accuracy compared to similar boasted tree algorithms like GBM (Gradient Descent Machine), thanks to a more regularized model formalization to control over-fitting, it enables many Kaggle Masters to win Kaggle competitions as well. In fact, it’s probably the most popular machine learning algorithm at the data science space right now!
Titantic Disaster Use Case: Using Seaborn to gain insights
data-preprocessing data-visualizationThere are tons of Python-based visualisation tools out there but my favourite one has to be Seaborn. Some would say using Seaborn is a form of cheating. Well, after all Seaborn is just a wrapper of matplotlib and instead of saying Seaborn VS matplotlib, we should look at it as a upgraded, flashy version of the old trusty matplotlib library. The part I like about Seaborn is that it comes with a ready set of color palettes that not only makes your data visualisation looks tasty, it also shouts out professionalism in just a liner or two.
Practical Data Problems - How to read big data files efficiently
data-preprocessingPanda’s read_table or read_csv is probably the number 1 method that comes to everyone’s mind when you need to read the rows of data into dataframe. After all, you could do that in just 2 lines:
Reality sucks - dealing with imbalanced data
data-preprocessing machine-learningYou stumble upon some intriguing patient cancer dataset that seems to be the last remaining puzzle towards solving the human war against cancer that will make this world a better place for everyone and you excitedly download the dataset.