peterwashington Nov 01 2021 2 min read 0 views
Data Science

Reproducibility of machine learning models in crucial for a variety of reasons. In terms of research ethics, it is critical for other researchers to be able to replicate the results to test validity. If you ever want your machine learning models to be used by others, then they should be able to get the same results as you advertise with your model.

Reproducibility is clearly important. How do we ensure that our machine learning process is reproducible? The biggest thing you can do is document your steps. Your code should be written for maximum understanding by others who are reading the code for the first time. This point speaks for itself. Elements of the machine learning process which should be documented include which machine learning method was used, which hyperparameters were selected (e.g., the  “k” in k-nearest neighbors or the learning rate in gradient descent), which data were filtered out, and how the data were balanced. All of these decisions, and many more, can drastically alter the machine learning outcomes.

Another crucial step for reproducibility is to use a consistent random seed. Nothing that a computer does is truly random. Computers use what is called a pseudo-random number generator, which is a mathematical method to come up with a sequence of random numbers. Pseudo-random number generators are frequently used in data science and machine learning because they can quickly generate a sequence of seemingly random numbers, and they are reproducible. Pseudo-random number generators are not truly random because they rely on a random seed. Random seeds are useful because if you feed in the same random seed to a pseudo-random number generator and start generating pseudo-random numbers, the sequence of generated numbers will be consistent across runs. This is critical for creating reproducible machine learning pipelines.