Why One-Hot Encode Data in Machine Learning?

Updated:06/10/2021 by Computer Hope

Categorical data is a collection of information that is divided into groups. I.e, if an organisation or agency is trying to get a biodata of its employees, the resulting data is referred to as categorical. This data is called categorical because it may be grouped according to the variables present in the biodata such as sex, state of residence, etc.

    Some examples include:
  • A “pet” variable with the values: “dog” and “cat“.
  • A “color” variable with the values: “red“, “green” and “blue“.
  • A “place” variable with the values: “first”, “second” and “third“.
  • How to Convert Categorical Data to Numerical Data?

    • This involves two steps:
    • Integer Encoding
    • One-Hot Encoding

    One Hot Encoding

    For example :
    Consider the data where fruits and their corresponding categorical value and prices are given.
    BikeCategorical value of BikePrice

    The output after one hot encoding the data is given as follows,


    , you’re playing with ML models and you encounter this “One hot encoding” term all over the place. You see thesklearn documentationfor one hot encoder and it says “ Encode categorical integer features using a one-hot aka one-of-K scheme. " It’s not all that clear right? Or at least it was not for me.