[906] | 1 | .. _kernel-hpc: |
---|
| 2 | |
---|
| 3 | Optional HPC support |
---|
| 4 | ====================== |
---|
| 5 | |
---|
| 6 | The optional ZOO-Kernel HPC support gives you the opportunity to use |
---|
| 7 | OGC WPS for invoking remote execution of OTB applications. The current |
---|
| 8 | implementation rely on `OpenSSH <https://www.openssh.com/>`_ and the |
---|
| 9 | `Slurm <http://slurm.schedmd.com/>`_ scheduler. |
---|
| 10 | |
---|
| 11 | .. note:: |
---|
| 12 | |
---|
| 13 | |slurm| `Slurm <http://slurm.schedmd.com/>`_ is an acronym for Simple Linux Utility for Resource Management. Learn more on official `website <https://slurm.schedmd.com/overview.html>`__. |
---|
| 14 | |
---|
| 15 | For executing an OGC WPS Service using this HPC support, one should |
---|
| 16 | use the OGC WPS version 2.0.0 and asynchornous request. Any tentative |
---|
| 17 | to execute synchronously a HPC service will fail with the message "The |
---|
| 18 | synchronous mode is not supported by this type of service". The |
---|
| 19 | ZOO-Kernel is not the only responsible for the execution and will wait |
---|
| 20 | for the execution on the HPC server to end before being able to |
---|
| 21 | continue its execution. Also, transfering the data from the WPS server |
---|
| 22 | to the cluster and downloading the data produced by the execution will |
---|
| 23 | take time. Hence, when OGC WPS Client will request for GetCapabilities |
---|
| 24 | or DescribeProcess only the "async-execute" mode will be present in |
---|
| 25 | the jobControlOptions attribute for the HPC services. |
---|
| 26 | |
---|
| 27 | You can see in the sequence diagram below the interactions between |
---|
| 28 | the OGC WPS Server (ZOO-Kernel), the mail daemon (running on the OGC |
---|
| 29 | WPS server), the Callback Service and, the HPC Server during an |
---|
| 30 | execution of a HPC Service. The dashed lines represent the behavior in |
---|
| 31 | case the optional callback service invocation has been activated. These |
---|
| 32 | invocations are made asynchronously for lowering their impact over the |
---|
| 33 | whole process in case of failure of the callback service for instance. |
---|
| 34 | |
---|
| 35 | By now, the callback service is not a WPS service but an independent server. |
---|
| 36 | |
---|
| 37 | |hpc_support| |
---|
| 38 | |
---|
| 39 | .. |slurm| image:: https://slurm.schedmd.com/slurm_logo.png |
---|
| 40 | :height: 100px |
---|
| 41 | :width: 100px |
---|
| 42 | :scale: 45% |
---|
| 43 | :alt: Slurm logo |
---|
| 44 | |
---|
| 45 | .. |hpc_support| image:: ../_static/hpc_schema.svg |
---|
| 46 | :scale: 65% |
---|
| 47 | :alt: HPC Support schema |
---|
| 48 | |
---|
| 49 | |
---|
| 50 | Installation and configuration |
---|
| 51 | ------------------------------ |
---|
| 52 | |
---|
| 53 | Follow the step described below in order to activate the ZOO-Project optional HPC support. |
---|
| 54 | |
---|
| 55 | Prerequisites |
---|
| 56 | ..................... |
---|
| 57 | |
---|
| 58 | * latest `ZOO-Kernel |
---|
| 59 | <http://zoo-project.org/trac/browser/trunk/zoo-project/zoo-kernel>`_ |
---|
| 60 | trunk version |
---|
| 61 | * `libssh2 <https://www.libssh2.org/>`_ |
---|
| 62 | * `MapServer <http://www.mapserver.org>`_ |
---|
| 63 | * an access to a server with `Slurm <http://slurm.schedmd.com>`_ |
---|
| 64 | and `OrfeoToolBox <https://www.orfeo-toolbox.org>`_. |
---|
| 65 | |
---|
| 66 | Installation steps |
---|
| 67 | ........................... |
---|
| 68 | |
---|
| 69 | ZOO-Kernel |
---|
| 70 | **************************** |
---|
| 71 | |
---|
| 72 | Compile ZOO-Kernel using the configuration options as shown below: |
---|
| 73 | |
---|
| 74 | .. code-block:: guess |
---|
| 75 | |
---|
| 76 | cd zoo-kernel |
---|
| 77 | autoconf |
---|
| 78 | ./configure --with-hpc=yes --with-ssh2=/usr --with-mapserver=/usr --with-ms-version=7 |
---|
| 79 | make |
---|
| 80 | sudo make install |
---|
| 81 | |
---|
| 82 | Optionally, you can ask your ZOO-Kernel to invoke a callback service |
---|
| 83 | which is responsible to record execution history and data |
---|
| 84 | produced. In such a case you can add the ``--with-callback=yes`` |
---|
| 85 | option to the configure command. |
---|
| 86 | |
---|
| 87 | .. note:: |
---|
| 88 | |
---|
| 89 | In case you need other languages to be activated, such as Python |
---|
| 90 | for exemple, please use the corresponding option(s). |
---|
| 91 | |
---|
| 92 | FinalizeHPC WPS Service |
---|
| 93 | **************************** |
---|
| 94 | |
---|
| 95 | For being informed that the remote OTB application ends on the |
---|
| 96 | cluster, one should invoke the FinalizeHPC service. It is responsible |
---|
| 97 | to connect using SSH to the HPC server to run an ``sacct`` command for |
---|
| 98 | extracting detailled informations about the sbatch that has been |
---|
| 99 | run. If the ``sacct`` command succeed, and the service is no more |
---|
| 100 | running on the cluster, then the informations are stored in a local |
---|
| 101 | conf file containing a ``[henv]`` section definition, the service |
---|
| 102 | connect to unix domain socket (opened by the ZOO-Kernel that has |
---|
| 103 | initially schedduled the service through Slurm) to inform about the |
---|
| 104 | end of the service run on the cluster. This makes the initial |
---|
| 105 | ZOO-Kernel to continue its execution by downloading output data |
---|
| 106 | produced over the execution of the OTB application on the cluster. So, |
---|
| 107 | this service should be build and deployed on your WPS server. You can |
---|
| 108 | use the following commands to do so. |
---|
| 109 | |
---|
| 110 | .. code-block:: guess |
---|
| 111 | |
---|
| 112 | cd zoo-service/utils/hpc |
---|
| 113 | make |
---|
| 114 | cp cgi-env/* /usr/lib/cgi-bin |
---|
| 115 | mkdir -p /var/data/xslt/ |
---|
| 116 | cp xslt/updateExecute.xsl /var/data/xslt/ |
---|
| 117 | |
---|
| 118 | You should also copy the |
---|
| 119 | .. note:: |
---|
| 120 | |
---|
| 121 | FinalizeHPC should be called from a daemon, responsible for reading |
---|
| 122 | mails sent by the cluster to the WPS server. |
---|
| 123 | |
---|
| 124 | |
---|
| 125 | Configuration steps |
---|
| 126 | ............................... |
---|
| 127 | |
---|
| 128 | Main configuration file |
---|
| 129 | **************************** |
---|
| 130 | |
---|
| 131 | When HPC support is activated, you can use different HPC configuration |
---|
| 132 | by adding ``confId`` to your usual ``serviceType=HPC`` in your zcfg |
---|
| 133 | file. For being able to find which configuration a service should |
---|
| 134 | use, the ZOO-Kernel require to know what are the options for creating |
---|
| 135 | the relevant sbatch. |
---|
| 136 | |
---|
| 137 | Also, you can define multiple configuration to run the OTB application |
---|
| 138 | on the cluster(s) depending on the size of the inputs. You should |
---|
| 139 | define in the section corresponding to your ``ServiceType`` the |
---|
| 140 | treshold for both raster (``preview_max_pixels``) and vector |
---|
| 141 | (``preview_max_features``) input. In case the raster or the vector |
---|
| 142 | dataset is upper the defined limit, then the ``fullres_conf`` will |
---|
| 143 | be used, in other case the ``preview_conf`` will be. |
---|
| 144 | |
---|
| 145 | For each of this configurations, you will have define the parameters |
---|
| 146 | to connect the HPC server, by providing ``ssh_host``, ``ssh_port``, |
---|
| 147 | ``ssh_user`` and, ``ssh_key``. Also, you should set where the input |
---|
| 148 | data will be stored on the HPC server, by defining |
---|
| 149 | ``remote_data_path`` (the default directory to store data), |
---|
| 150 | ``remote_presistent_data_path`` (the directory to store data |
---|
| 151 | considerated as shared data, see below) and, ``remote_work_path`` the |
---|
| 152 | directory used to store the SBATCH script created locally then, |
---|
| 153 | uploaded by the ZOO-Kernel. |
---|
| 154 | |
---|
| 155 | Also, there are multiple options you can use to run your applications |
---|
| 156 | using SBATCH. You can define them using ``jobscript_header``, |
---|
| 157 | ``jobscript_body`` and ``jobscript_footer`` or by using |
---|
| 158 | ``sbatch_options_<SBATCH_OPTION>`` where ``<SBATCH_OPTION>`` should be |
---|
| 159 | replaced by a real option name, like ``workdir`` in the following |
---|
| 160 | example. For creating the SBATCH file, the ZOO-Kernel create a file |
---|
| 161 | starting with the content of the file pointed by ``jobscript_header`` |
---|
| 162 | (if any, a default header is set in other case), then, any option |
---|
| 163 | defined in ``sbatch_options_*`` and a specific one: ``job-name``, |
---|
| 164 | then, ``jobscript_body`` is added (if any, usually to load required |
---|
| 165 | modules), then the ZOO-Kernel add the invocation of the OTB |
---|
| 166 | application then, optionally the ``jobscript_footer`` is added, if |
---|
| 167 | any. |
---|
| 168 | |
---|
| 169 | Finally, ``remote_command_opt`` should contains all the informations |
---|
| 170 | you want to be extracted by the ``sacct`` command run by the |
---|
| 171 | FinalizeHPC service. ``billing_nb_cpu`` is used for billing purpose to |
---|
| 172 | define a cost for using a specific option (preview or fullres). |
---|
| 173 | |
---|
| 174 | In addition to the specific ``HPC_<ID>`` section and the corresponding |
---|
| 175 | fullres and preview ones, you should define in the ``[security]`` |
---|
| 176 | section using the ``shared`` parameter to set the URLs from where the |
---|
| 177 | downloaded data should be considerated as shared, meaning that even if |
---|
| 178 | this ressources requires authentication to be accessed, any |
---|
| 179 | authenticated user will be allowed to access the cache file even if |
---|
| 180 | it was created by somebody else. Also, this shared cache won't contain |
---|
| 181 | any authentication informations in the cache file name as it is |
---|
| 182 | usually the case. |
---|
| 183 | |
---|
| 184 | .. code-block:: guess |
---|
| 185 | |
---|
| 186 | [HPC_Sample] |
---|
| 187 | preview_max_pixels=820800 |
---|
| 188 | preview_max_features=100000 |
---|
| 189 | preview_conf=hpc-config-2 |
---|
| 190 | fullres_conf=hpc-config-1 |
---|
| 191 | |
---|
| 192 | [hpc-config-1] |
---|
| 193 | ssh_host=mycluster.org |
---|
| 194 | ssh_port=22 |
---|
| 195 | ssh_user=cUser |
---|
| 196 | ssh_key=/var/www/.ssh/id_rsa.pub |
---|
| 197 | remote_data_path=/home/cUser/wps_executions/data |
---|
| 198 | remote_persitent_data_path=/home/cUser/wps_executions/datap |
---|
| 199 | remote_work_path=/home/cUser/wps_executions/script |
---|
| 200 | jobscript_header=/usr/lib/cgi-bin/config-hpc1_header.txt |
---|
| 201 | jobscript_body=/usr/lib/cgi-bin/config-hpc1_body.txt |
---|
| 202 | sbatch_options_workdir=/home/cUser/wps_executions/script |
---|
| 203 | sbatch_substr=Submitted batch job |
---|
| 204 | billing_nb_cpu=1 |
---|
| 205 | remote_command_opt=AllocCPUS,AllocGRES,AllocNodes,AllocTRES,Account,AssocID,AveCPU,AveCPUFreq,AveDiskRead,AveDiskWrite,AvePages,AveRSS,AveVMSize,BlockID,Cluster,Comment,ConsumedEnergy,ConsumedEnergyRaw,CPUTime,CPUTimeRAW,DerivedExitCode,Elapsed,Eligible,End,ExitCode,GID,Group,JobID,JobIDRaw,JobName,Layout,MaxDiskRead,MaxDiskReadNode,MaxDiskReadTask,MaxDiskWrite,MaxDiskWriteNode,MaxDiskWriteTask,MaxPages,MaxPagesNode,MaxPagesTask,MaxRSS,MaxRSSNode,MaxRSSTask,MaxVMSize,MaxVMSizeNode,MaxVMSizeTask,MinCPU,MinCPUNode,MinCPUTask,NCPUS,NNodes,NodeList,NTasks,Priority,Partition,QOS,QOSRAW,ReqCPUFreq,ReqCPUFreqMin,ReqCPUFreqMax,ReqCPUFreqGov,ReqCPUS,ReqGRES,ReqMem,ReqNodes,ReqTRES,Reservation,ReservationId,Reserved,ResvCPU,ResvCPURAW,Start,State,Submit,Suspended,SystemCPU,Timelimit,TotalCPU,UID,User,UserCPU,WCKey,WCKeyID |
---|
| 206 | |
---|
| 207 | [hpc-config-2] |
---|
| 208 | ssh_host=mycluster.org |
---|
| 209 | ssh_port=22 |
---|
| 210 | ssh_user=cUser |
---|
| 211 | ssh_key=/var/www/.ssh/id_rsa.pub |
---|
| 212 | remote_data_path=/home/cUser/wps_executions/data |
---|
| 213 | remote_persitent_data_path=/home/cUser/wps_executions/datap |
---|
| 214 | remote_work_path=/home/cUser/wps_executions/script |
---|
| 215 | jobscript_header=/usr/lib/cgi-bin/config-hpc2_header.txt |
---|
| 216 | jobscript_body=/usr/lib/cgi-bin/config-hpc2_body.txt |
---|
| 217 | sbatch_options_workdir=/home/cUser/wps_executions/script |
---|
| 218 | sbatch_substr=Submitted batch job |
---|
| 219 | billing_nb_cpu=4 |
---|
| 220 | remote_command_opt=AllocCPUS,AllocGRES,AllocNodes,AllocTRES,Account,AssocID,AveCPU,AveCPUFreq,AveDiskRead,AveDiskWrite,AvePages,AveRSS,AveVMSize,BlockID,Cluster,Comment,ConsumedEnergy,ConsumedEnergyRaw,CPUTime,CPUTimeRAW,DerivedExitCode,Elapsed,Eligible,End,ExitCode,GID,Group,JobID,JobIDRaw,JobName,Layout,MaxDiskRead,MaxDiskReadNode,MaxDiskReadTask,MaxDiskWrite,MaxDiskWriteNode,MaxDiskWriteTask,MaxPages,MaxPagesNode,MaxPagesTask,MaxRSS,MaxRSSNode,MaxRSSTask,MaxVMSize,MaxVMSizeNode,MaxVMSizeTask,MinCPU,MinCPUNode,MinCPUTask,NCPUS,NNodes,NodeList,NTasks,Priority,Partition,QOS,QOSRAW,ReqCPUFreq,ReqCPUFreqMin,ReqCPUFreqMax,ReqCPUFreqGov,ReqCPUS,ReqGRES,ReqMem,ReqNodes,ReqTRES,Reservation,ReservationId,Reserved,ResvCPU,ResvCPURAW,Start,State,Submit,Suspended,SystemCPU,Timelimit,TotalCPU,UID,User,UserCPU,WCKey,WCKeyID |
---|
| 221 | |
---|
| 222 | [security] |
---|
| 223 | attributes=Cookie,Cookies |
---|
| 224 | hosts=* |
---|
| 225 | shared=myhost.net/WCS |
---|
| 226 | |
---|
| 227 | You can see below an example of ``jobscript_header`` file. |
---|
| 228 | |
---|
| 229 | .. code-block:: guess |
---|
| 230 | |
---|
| 231 | #!/bin/sh |
---|
| 232 | #SBATCH --ntasks=1 |
---|
| 233 | #SBATCH --ntasks-per-node=1 |
---|
| 234 | #SBATCH --exclusive |
---|
| 235 | #SBATCH --distribution=block:block |
---|
| 236 | #SBATCH --partition=partName |
---|
| 237 | #SBATCH --mail-type=END # Mail events (NONE, BEGIN, END, FAIL, ALL) |
---|
| 238 | #SBATCH --mail-user=user@wps_server.net # Where to send mail |
---|
| 239 | |
---|
| 240 | You can see below an example of ``jobscript_body`` file. |
---|
| 241 | |
---|
| 242 | |
---|
| 243 | .. code-block:: guess |
---|
| 244 | |
---|
| 245 | # Load all the modules |
---|
| 246 | module load cv-standard |
---|
| 247 | module load cmake/3.6.0 |
---|
| 248 | module load gcc/4.9.3 |
---|
| 249 | module load use.own |
---|
| 250 | module load OTB/6.1-serial-24threads |
---|
| 251 | |
---|
| 252 | In casse you have activated the callback service, then you should also |
---|
| 253 | have a ``[callback]`` section, in which you will define ``url`` to |
---|
| 254 | invoke the callback service, ``prohibited`` to list the services that |
---|
| 255 | should not require invocation of the callback sercvice if any and, |
---|
| 256 | ``template`` pointing to the local ``updateExecute.xsl`` file used to |
---|
| 257 | replace any inputs provided by value to the reference to the locally |
---|
| 258 | published OGC WFS/WCS web services. This execute request is provided |
---|
| 259 | to the callback service. |
---|
| 260 | |
---|
| 261 | .. code-block:: guess |
---|
| 262 | |
---|
| 263 | [callback] |
---|
| 264 | url=http://myhost.net:port/callbackUpdate/ |
---|
| 265 | prohibited=FinalizeHPC,Xml2Pdf,DeleteData |
---|
| 266 | template=/home/cUser/wps_dir/updateExecute.xsl |
---|
| 267 | |
---|
| 268 | |
---|
| 269 | OGC WPS Services metadata |
---|
| 270 | **************************** |
---|
| 271 | |
---|
| 272 | To produce the zcfg files corresponding to the metadata definition of |
---|
| 273 | the WPS services, you can use the otb2zcfg tool to produce them. You |
---|
| 274 | will need to replace ``serviceType=OTB`` by ``serviceType=HPC`` and, |
---|
| 275 | optionally, add one line containing ``confId=HPC_Sample`` for |
---|
| 276 | instance. |
---|
| 277 | |
---|
| 278 | Please refer to `otb2zcfg |
---|
| 279 | <./orfeotoolbox.html#services-configuration-file>`_ documentation to |
---|
| 280 | know how to use this tool. |
---|
| 281 | |
---|
| 282 | Using the HPC support, when you define one output, there will be |
---|
| 283 | automatically 1 to 3 inner outputs created for the defined output: |
---|
| 284 | |
---|
| 285 | download_link |
---|
| 286 | URL to download to generated output |
---|
| 287 | |
---|
| 288 | wms_link |
---|
| 289 | URL to access the OGC WMS for this output (only in case |
---|
| 290 | `useMapserver=true`) |
---|
| 291 | |
---|
| 292 | wcs_link/wfs_link |
---|
| 293 | URL to access the OGC WCS or WFS for this output (only in case |
---|
| 294 | `useMapserver=true`) |
---|
| 295 | |
---|
| 296 | You can see below an example of Output node resulting of the |
---|
| 297 | definition of one output named out and typed as geographic imagery. |
---|
| 298 | |
---|
| 299 | |
---|
| 300 | .. code-block:: guess |
---|
| 301 | |
---|
| 302 | <wps:Output> |
---|
| 303 | <ows:Title>Outputed Image</ows:Title> |
---|
| 304 | <ows:Abstract>Image produced by the application</ows:Abstract> |
---|
| 305 | <ows:Identifier>out</ows:Identifier> |
---|
| 306 | <wps:Output> |
---|
| 307 | <ows:Title>Download link</ows:Title> |
---|
| 308 | <ows:Abstract>The download link</ows:Abstract> |
---|
| 309 | <ows:Identifier>download_link</ows:Identifier> |
---|
| 310 | <wps:ComplexData> |
---|
| 311 | <wps:Format default="true" mimeType="image/tiff"/> |
---|
| 312 | <wps:Format mimeType="image/tiff"/> |
---|
| 313 | </wps:ComplexData> |
---|
| 314 | </wps:Output> |
---|
| 315 | <wps:Output> |
---|
| 316 | <ows:Title>WMS link</ows:Title> |
---|
| 317 | <ows:Abstract>The WMS link</ows:Abstract> |
---|
| 318 | <ows:Identifier>wms_link</ows:Identifier> |
---|
| 319 | <wps:ComplexData> |
---|
| 320 | <wps:Format default="true" mimeType="image/tiff"/> |
---|
| 321 | <wps:Format mimeType="image/tiff"/> |
---|
| 322 | </wps:ComplexData> |
---|
| 323 | </wps:Output> |
---|
| 324 | <wps:Output> |
---|
| 325 | <ows:Title>WCS link</ows:Title> |
---|
| 326 | <ows:Abstract>The WCS link</ows:Abstract> |
---|
| 327 | <ows:Identifier>wcs_link</ows:Identifier> |
---|
| 328 | <wps:ComplexData> |
---|
| 329 | <wps:Format default="true" mimeType="image/tiff"/> |
---|
| 330 | <wps:Format mimeType="image/tiff"/> |
---|
| 331 | </wps:ComplexData> |
---|
| 332 | </wps:Output> |
---|
| 333 | </wps:Output> |
---|
| 334 | |
---|
| 335 | |
---|
| 336 | |
---|
| 337 | |
---|
| 338 | |
---|
| 339 | |
---|