Common taxonomy of data types when considering machine learning
Categorical: qualitative
- Ordinal: innate ordered values with unknown distances between them that cannot be measured
- e.g. first/second/third, good/bad
- Nominal: values (text or numbers) with no order
- e.g. cat/dog, genre, ethnicity
- Ordinal: innate ordered values with unknown distances between them that cannot be measured
Numerical: quantitative
- Discrete: quantitative whole number values
- e.g. step count
- Continuous: quantitative decimal values
- e.g. width, height
- Discrete: quantitative whole number values
- Training data should be representative of the data that will be predicted with
- Sampling noise: small sample leads to models that provide imprecise predictions due to chance 1
- Sampling bias: data in the sample may have a higher or lower probability of occurring compared to the original data
- Discard outliers
- Ignore or impute missing values, or train models with and without those values and compare their performances
- Feature engineering - feature selection (useful data), feature extraction (combining features to make more useful ones, e.g. dimensionality reduction), feature creation (new data)
- More data supplied to simple algorithms can perform better than complex algorithms trained on smaller datasets
- Trade off- cost of acquiring and storing more data vs tuning algorithms
References & Further Reading
- Banko, M. and Eric Brill. 2001. Scaling to very very large corpora for natural language disambiguation. Association for Computational Linguistics, USA, 26–33.
- Halevy, A., Norvig, P. and Fernando, N. (2009). The Unreasonable Effectiveness of Data. Intelligent Systems, IEEE. 24. 8 - 12. 10.1109/MIS.2009.36.
- Geron, A. (2017) Hands-On Machine Learning with Scikit-Learn & TensorFlow : concepts, tools, and techniques to build intelligent systems.