Pipeline in machine learning
- May 21
The main goal of creating a Payplane is control. A well -organized Pipesin makes the implementation more flexible.
And then at work there was a need to refractely and I decided to make some improvements to them. I decided to share with you a few mini posts.
The first thing I did was switched to a new project structure: cookiecutter
This structure is quite logical, standardized and flexible. All you need, install it and start the project:
Pip Install Cookiecutter Cookiecutter -c V1
The structure of the catalogs of your new project can be seen in the screenshot.
For my projects, I remade the structure a little, for example: I do not need SRC/Features, Reports and References folders in Computer Vision projects.
You can also block the structure for your tasks.
You can read in more detail here:
- CookieCutter Data Science
Configurations for machine learning projects - Hydra.
What is the problem itself and why did I use Hydra? When starting the Python scripts, many arguments are added, although sometimes they can be grouped. Here is an example of such a script:
Parser.add_argument ('Data', Metavar = 'Dir', Help = 'Path to DatASET')
Parser.add_argument ('-a', '--arch', metavar = 'arch', default = 'resnet18', choitses = model_names, help = 'Model Architecture:' + '|' .join (Model_names) + ' +' ( Default: Resnet18) '))
Parser.add_argument ('-j', '--workers', default = 4, type = int, metavar = 'n', help = 'number of data Loading Workers (Default: 4)') ')'
Parser.add_argument ('-Epochs', Default = 90, Type = Intavar = 'N', Help = 'Number of Total Epochs to Run')
A common solution that allows you to control the growing complexity is the transition to configuration files. Configuration files can be hierarchical and can help reduce the complexity of the code that determines the arguments of the command line. But they also have their drawbacks.
- During the experiments, you may need to start the application with various configuration parameters. At first you can simply change the configuration file before each launch, but soon you will understand that it is difficult to track the changes associated with each launch.
- The configuration files become monolithic. But if you, for example, want your code to use different configuration parameters, say, one for the Imagenet data set and one for a CIFAR-10 data set, you have two options: support two configuration files or place both parameters in one configuration file And somehow use only what you need during execution.
Well, the decision of all the above inconvenience is Hydra.
Hydra - allows you to create a composition of configurations. The composition can work both with a configuration file and on the command line. Moreover, everything in the compiled configuration can also be redefined through the command line.
Example of use:
CONF/Config.yaml file
- Dataset: Cifar10
CONF/DATASET/IMAGENET.YAML file
Path: /Datasets /Imagenet
Def My_App (CFG: Dictconfig) -> None:
run __name__ == "__main__":
When starting, the default DatASET parameter will be used. But you also convey the parameter and the console: Python App.py Dataset.path = /Datasets /Cifar10
Another cool chip: Multirun is the possibility of Hydra to launch your function several times, each time creating a new object of configuration. This is very convenient for checking parameters without writing additional functions. For example, we can view all 4 combinations (2 set set X 2 optimizers):
Potkhon app. - Multirun Dataset = Itage, kifar10 Optimizer = Adam, Nesterov
In this mini note, I tried to describe the problems with the configures that you are faced with when writing the fees and part of the functions that Hydra offers.
To find out more about Hydra, I propose to read and see:
- Hydra - A Fresh Look At Configuration for Machine Learning Projects
- how to conduct experiments effectively, Roman Suvorov
- Arthur Kuzin: DL Pipelines Tips & Tricks
#machinelearning #artificialintelligence #ai #datascience #programming #Deechnology #Deeuplearning #bigData #bigdata