Data in ML

Data Types, Quality & Quantity
data
maths
statistics
ML
Published

April 19, 2022

Types

Common taxonomy of data types when considering machine learning

  • Categorical: qualitative

    • Ordinal: innate ordered values with unknown distances between them that cannot be measured
      • e.g. first/second/third, good/bad
    • Nominal: values (text or numbers) with no order
      • e.g. cat/dog, genre, ethnicity
  • Numerical: quantitative

    • Discrete: quantitative whole number values
      • e.g. step count
    • Continuous: quantitative decimal values
      • e.g. width, height

Quality

  • Training data should be representative of the data that will be predicted with
  • Sampling noise: small sample leads to models that provide imprecise predictions due to chance 1
  • Sampling bias: data in the sample may have a higher or lower probability of occurring compared to the original data
  • Discard outliers
  • Ignore or impute missing values, or train models with and without those values and compare their performances
  • Feature engineering - feature selection (useful data), feature extraction (combining features to make more useful ones, e.g. dimensionality reduction), feature creation (new data)

Quantity

  • More data supplied to simple algorithms can perform better than complex algorithms trained on smaller datasets
  • Trade off- cost of acquiring and storing more data vs tuning algorithms

References & Further Reading

Footnotes

  1. http://economistjourney.blogspot.com/2018/06/what-is-sampling-noise.html.↩︎