Cloud Storage Concepts

Sesam Workflows Knowledge Centre Help library Using Sesam with Python Cloud Storage Concepts

Cloud Storage Concepts: Blob Storage and Job Execution Nodes

Introduction

In the context of cloud computing, two key concepts are Blob Storage and Job Execution Nodes. Understanding these concepts is crucial for understanding how tasks are executed in the cloud.

Blob Storage

Blob Storage is a service for storing large amounts of unstructured data, such as text or binary data, that can be accessed from anywhere in the world via HTTP or HTTPS. The term "Blob" stands for Binary Large Object, which can include objects like images, audio files, video files, log files, backup data, or other types of data that can be read or written as a byte stream.

Purpose of Blob Storage

The primary purpose of Blob Storage is to allow large amounts of data to be stored and accessed remotely. It's highly scalable, meaning it can handle an increasing amount of data without a decrease in performance. In the context of job execution, Blob Storage is often used to store input data that a job needs to process, as well as the output data that a job produces.

Job Execution Nodes

A Job Execution Node is a computational resource in the cloud where jobs are executed. These nodes can be virtual machines or physical servers, and they are responsible for executing the tasks that make up a job.

Purpose of Job Execution Nodes

The purpose of a Job Execution Node is to provide a dedicated environment for executing a job. This environment can be customized to the needs of the job, including the operating system, installed software, and available resources (CPU, memory, disk space, etc.).

Connection Between Blob Storage and Job Execution Nodes

In a typical workflow, a job execution node will read input data from Blob Storage, perform some computation or processing on this data, and then write the output data back to Blob Storage. This allows the input and output data to be decoupled from the job execution, providing flexibility and scalability.

OneWorkflow: Local Job Execution

Building on the concepts of Blob Storage and Job Execution Nodes, OneWorkflow provides two ways to run a job locally:

In-place execution: The files stay where they are without moving.
Cloud-simulation mode: Temporary folders mimicking the blob storage and an execution node are created locally at a user-defined location.

In OneWorkflow, the Cloud-simulation mode is a unique feature that creates temporary folders locally at a user-defined location. These folders mimic the functionality of blob storage and an execution node in a cloud environment.

Here's a simple representation of how it works:

[ Workspace Folder ] <--> [ Temporary Blob Storage Folder ] <--> [ Temporary Job Folder (Execution Node) ]

In this setup, the Temporary Blob Storage Folder acts as an intermediary between the Workspace Folder and the Temporary Job Folder, which emulates an Execution Node. The process begins when the Temporary Blob Storage Folder receives input files from the Workspace Folder. These files are then sent to the Temporary Job Folder for processing.

Once the job is completed, the results are sent back to the Temporary Blob Storage Folder. From there, they are loaded back into the Workspace Folder, completing the cycle.

The Cloud-simulation mode is particularly useful because it emulates the behavior of a cloud run before executing it on the actual cloud environment. This allows for thorough testing and debugging of workflows locally, saving time and resources while ensuring a smooth transition to the cloud.

Understanding the Workspace Folder Structure

To provide a clearer perspective, let's delve into the organization of a workspace folder. The following layout is one of the possible configurations for a workspace folder, and it's the model we adopt in our examples:

[ Workspace Folder ]
│
├───Input
|   └───LoadCases
|       ├───LoadCase1
|       |   ├───File1.txt
|       |   └───File2.ini
|       └───LoadCase2
|           ├───File1.json
|           └───File2.txt
└───CommonFiles
    ├───File1.json
    └───File2.txt

Let's break down the key components of this structure.

Input: This directory is where all the input files for the simulation are stored. These files are used to set up and run the simulation.

LoadCases: This is a subdirectory within the Input directory. It contains different load cases for the simulation. Each load case is a unique set of conditions under which the simulation is run. For example, LoadCase1 and LoadCase2 could represent different environmental conditions or different sets of input parameters.

CommonFiles: This directory contains files that are common to all load cases. These could include configuration files, shared data files, or any other files that need to be accessible to all parts of the simulation.

Structure of Workspace in Cloud-Simulation

The paths for the Temporary Blob Directory and Temporary Job Directory are determined by the workflow's temporary folder settings. Typically, the Blob Directory is named oc_random_name_blob and the Job Directory is named oc_random_name_job, where random_name is a placeholder for any random string. For example, if the temporary folder is set to C:\Temp, these directories might be named oc_xabd2_blob and oc_dderx_job, respectively. Depending on the workflow configuration, these directories are auto-deleted after a workflow execution.

Let's take a look at a sample directory structure when a workflow is configured to run in cloud-simulation mode:

[ Workspace Folder ] <-------> [ Temp Blob Directory ] <-------> [ Temp Job Directory ]
│                              |                                 |
├───Input                      └───MyWorkspaceId                 └───MyWorkspaceId
│   └───LoadCases                  ├───LoadCases                     ├───jobpreparation
│       ├───LoadCase1              │   ├───LoadCase1                 │   ├───File1.txt
│       │   ├───File1.txt          │   │   ├───File1.txt             │   └───File2.ini
│       │   └───File2.ini          │   |   ├───File1.ini             └───LoadCaseRandomID
│       └───LoadCase2              │   |   └───Results.txt               ├───LoadCases
│           ├───File1.json         │   └───LoadCase2                     |   ├───LoadCase1
│           └───File2.txt          |       ├───File1.json                |   |   ├───File1.txt
├───CommonFiles                    │       ├───File2.txt                 |   |   ├───File2.ini
│   ├───File1.json                 │       └───Results.txt               |   |   └───Results.txt
│   └───File2.txt                  └───CommonFiles                       |   └───LoadCase2
└───LoadCases                          ├───File1.json                    |       ├───File1.json
    ├───LoadCase1                      └───File2.txt                     |       ├───File2.txt
    │   ├───File1.txt                                                    |       └───Results.txt
    │   └───File2.ini                                                    └───CommonFiles
    └───LoadCase2                                                            ├───File1.json
        ├───File1.json                                                       └───File2.txt
        └───File2.txt

When operating in cloud-simulation mode, our workflow examples follow a specific sequence of operations. A utility script is first employed to duplicate the LoadCases folder, which contains the original input files, from the Input Folder to the root directory of the workspace. This step is performed before the workflow is submitted for execution.

The copied LoadCases folder plays a pivotal role in the workflow execution, serving as a repository for result files generated during the workflow run in the Temp Blob Directory. This approach not only ensures the preservation of the original Input folder but also allows the LoadCases folder to function as a temporary workspace. Once the analysis is complete, this temporary workspace can be safely discarded, maintaining the cleanliness and organization of the overall workspace.

At the commencement of the workflow execution, the framework uploads the LoadCases and CommonFiles folders to a temporary blob storage area, identified by the user-assigned workspace-id.

During the execution, the framework duplicates the necessary files/folders from the blob storage to the temp job directory under the LoadCaseRandomId, unless the user specifies a different LoadCaseID. This process ensures each job run accesses the necessary files without altering the original data, maintaining the integrity of the blob storage area.

Upon completion of the run, the framework moves the result files for each load-case run back to their corresponding load-cases folder in the blob storage area. Finally, it copies these result files to the user's workspace folder.