Error message here!

Hide Error message here!


Error message here!


Hide Error message here!


Error message here!



Case sharing | transfer caffe2 computer vision pipeline task to managed spot training in sagemaker

AWS_ AI developer community 2021-01-12 19:07:51 阅读数:11 评论数:0 点赞数:0 收藏数:0


This article is from SNCF and Olexya Co authored by guest authors of .

This article will introduce the French state-owned railway company Société Nationale des Chemins de fer Français(SNCF) How to work with technology partners Olexya With the help of the , Application AWS Provided ML service , Research 、 Develop and deploy innovative computer vision solutions .

Background introduction

SNCF Founded on 1938 year , At present, it has more than 27 10000 employees .SNCF Réseau by SNCF Its subsidiaries , Responsible for the management and operation of railway network infrastructure .SNCF Réseau And its technology partners Olexya Deployed a whole set of innovative solutions , Hope to help infrastructure operation at the same time , Maintain a high level of security and quality of infrastructure . The field team uses computer vision to detect anomalies in the infrastructure .

SNCF Réseau Our researchers have a wealth of ML Experience , One team has used Caffe2 The deep learning framework develops a local computer vision inspection model . next , Scientists contacted SNCF Réseau Our technology partners Olexya, They help apply for configuration GPU Resources to support the iterative model . In order to keep low operation cost and high production efficiency , While maintaining the overall flexibility of scientific code ,Olexya Decide to use Amazon SageMaker layout Caffe2 Training and reasoning of models .

The whole process involves the following steps :

  1. Create custom Docker.
  2. adopt Amazon Simple Storage Service (Amazon S3) Data channel configuration training data read .
  3. adopt Amazon SageMaker Spot GPU Training to achieve cost-effective training .
  4. Use Amazon SageMaker Training API Achieve cost-effective reasoning .

Create custom Docker

The team created a Docker Mirror image , Among them, the package is in line with Amazon SageMaker Docker The origin of the norm Caffe2 Code .Amazon SageMaker Can hold multiple data sources , And with the Amazon S3 Advanced Integration . Stored in Amazon S3 Data sets in can be automatically extracted to run in Amazon SageMaker In the training container on the .

For smooth handling Amazon S3 Available training data in ,Olexya You need to specify the training code from the associated local path opt/ml/input/data/_<channel name>_ Read at . alike , The model write position must be set to opt/ml/model. In this way , After the training assignment ,Amazon SageMaker The trained model workpiece can be compressed and sent to Amazon S3.

adopt Amazon Simple Storage Service Data channel configuration training data read

original Caffe2 Training code through detailed and flexible YAML Configuration file for parameter adjustment , So researchers can change the model settings directly , Without changing the scientific code . External files can easily be kept outside and read into containers using data channels during training . The data channel here , It means to pass on to in the process of training Amazon SageMaker SDK Of Amazon S3 ARN, It will be added to at the beginning of the training Amazon SageMaker The container .Olexya Configure the data channel to read through the replica ( The replica mode ), This is also Amazon SageMaker Default configuration in . We can also go through Unix The Conduit ( namely Pipe Pattern ) Realize the streaming transmission of data .

adopt Amazon SageMaker Spot GPU Training to achieve cost-effective training

The team uses ml.p3.2xlarge GPU The accelerated computing case is configured with training infrastructure .Amazon SageMaker ml.p3.2xlarge Computational examples are especially suitable for deep learning computer vision workload , It's equipped with a piece of 5120 NVIDIA, a core company V100 GPU And 16GB High bandwidth memory (HBM), Be able to train large models quickly .

Besides ,Amazon SageMaker Training API Is set to activate using hosting Spot example , And Amazon SageMaker On demand case price comparison ,Spot Instances are reported to save 71% cost .Amazon SageMaker Managed Spot Training yes Amazon SageMaker Function options provided , You can use Amazon Elastic Compute Cloud (Amazon EC2) Spot Instance resources for training .Amazon EC2 Spot Instance will be in idle state of redundancy Amazon EC2 Computing capacity is sold to customers at a high discount . stay Amazon SageMaker among ,Spot The actual use of the instance is fully hosted by the service itself , Users can set up two trainings SDK Parameters are called at any time :

  • train_use_spot_instances=True, For the request Amazon SageMaker Spot Resource usage of instance .
  • train_max_wait, Used to set the maximum acceptable waiting time in seconds .

Amazon SageMaker Training API Achieve cost-effective reasoning

In this research project , End users can accept inference interrupts and instantiation delays . therefore , To further optimize costs , The team used Amazon SageMaker Training API Run the inference code , In escrow Amazon SageMaker Spot Examples can also be used for reasoning . In addition to the cost advantage , Use training API It can also reduce the learning curve , Because the same... Is used in the model training and reasoning cycle API.

Time and cost savings

Through the above four steps ,Olexya Success will be local Caffe2 Deep computer vision detection model transplanted to Amazon SageMaker among , Realize training and reasoning . What's more impressive is , The team completed tool learning in about three weeks , And the training period of the model is reduced from three days to ten hours ! The team further estimated that , With the original local available GPU Cluster comparison ,Amazon SageMaker The total cost of ownership (TCO) Reduce 71%. besides , Other optimization techniques can further reduce costs , For example, using Amazon SageMaker Automatic model tuning for super parameter intelligent search , And the matching deep learning framework is used for hybrid precision training .

except SNCF Réseau outside , Many from the transportation and logistics industry AWS The customers are all here ML With the help of Technology , We have improved our business operation and innovation ability . Specific cases include :

  • Logistics companies from Dubai Aramex Use ML Technology solves the problem of address resolution and transportation time prediction . The company used 150 A model , Daily execution 45 Ten thousand forecast jobs .
  • Transport New South Wales Using cloud services to predict the number of passengers in the entire transport network , To better plan the utilization of labor and assets , And then improve customer satisfaction .
  • Korean Air Use Amazon SageMaker Start multiple innovation projects , It aims to predict and maintain the aircraft fleet in advance .


Amazon SageMaker Support from data annotation 、 To production deployment 、 And then to the whole process of operation monitoring ML The development cycle . just as Olexya And SNCF Réseau My work shows that ,Amazon SageMaker It has good frame neutrality , Can accommodate all kinds of deep learning workload and framework . Except in advance for Sklearn、TensorFlow、PyTorch、MXNet、XGBoost as well as Chainer Create a matching Docker The mirror with SDK Beyond the object , Users can also use custom Docker Containers , Almost any framework , Such as PeddlePaddle、Catboost、R as well as Caffe2. about ML practitioners , Don't hesitate. , Please start testing AWS SageMaker service , And share the experience and experience summed up in the construction !


Copyright statement
In this paper,the author:[AWS_ AI developer community],Reprint please bring the original link, thank you