Table of contents
The idea that data matters more than algorithms for complex problems is a tradeoff for many years. If you are in touch with Machine Learning then you may have come across such topics. in this article, I have tried to list some points that convey data has an edge over algorithms.
Microsoft researchers in 2001 applied various ML algorithms to solve the problem of Natural Language Ambiguition. they noticed that many different algorithms, even the simplest ones performed identically well when they were given enough data. This research suggests people that may want to consider spending time and money on corpus development rather than on algorithm building.
Point to be noted: It is not easy to get a complex and huge amount of data cheaply. So just don't abandon algorithms yet.
Representative Data
Our data must be representative of the cases that we may want to generalize. It becomes harder when our training data is too small, the data will have sampling noise. Even large datasets can have sampling noise which is termed Sampling Bias.
Quality of Data
If the training data is full of outliers, noise and errors, then it becomes harder for the model to discover underlying patterns. It is often well worth the effort to spend time cleaning up your training data. Most data scientists spend a significant amount of their time cleaning the data. If the data has outliers then simply discard them or fix them manually. If some instances are missing then we must decide whether to discard them, ignore or fill in the missing values.
Relevant Features
Our model will be capable of learning only if the training data contains enough relevant features rather than too many unwanted/irrelevant features. A critical part of the success of an ML project is to come up with a dataset that has a good set of features to train on. This process is often termed Feature Engineering which mainly consists of the:
Feature Selection - Selecting the most relevant features to train among the existing features.
Feature Extraction - Combining existing features to produce a more useful one.
Creating new features by gathering new data.