-
Notifications
You must be signed in to change notification settings - Fork 0
/
Outcomes.tex
308 lines (248 loc) · 18.7 KB
/
Outcomes.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
\chapter{Outcomes of the IO-SEA project}
The different outcomes are directly related to the different work packages of the project. This is why they are
exposed WP after WP.
As a reminder, the technical Work Packages are
\begin{itemize}
\item \textbf{Work Package \#1~:} \textit{Use Case and Co-design~:} this WP primarily characterizes the use
cases and provides a detailed understanding of the requirements which is then used to develop an initial
blueprint of the IO-SEA system, including all the components of the infrastructure. The work package also
works on the adaptation of these use cases to fit to the IO-SEA architecture.
\item \textbf{Work Package \#2~:} \textit{Data Workflow and Ephemeral~:} this WP focuses on providing an
on-demand data access environment suitable for the needs of applications and workflows within the IO-SEA
project. The on-demand data access environment will be Ephemeral – that is, it will be provisioned and
available only during the runtime of the workflow. At the end of the workflow, the data access environment is
torn down and any data and state are flushed to the backend storage tiers.
\item \textbf{Work Package \#3~:} \textit{Instrumentation and Monitoring~:} this WP focuses on collecting and
analysing Applications and Workflows IO behaviour, as well as IO infrastructure health. The information will
be used to take optimum dimensioning and placement decisions when launching new instances of workflows.
\item \textbf{Work Package \#4~:} \textit{Hierarchical Storage Management Feature~:} this WP aims is to
provide Hierarchical Storage Management (HSM) for the IO-SEA architecture, making it capable of managing
storage “NVMe to Tapes” tiers including various types of devices (NVRAM, Flash, SSD, HDD and Tapes). This is
achieved by using together two object stores: the Mero Object store from Seagate and Phobos, a tape-based open
source object store developed by CEA.
\item \textbf{Work Package \#5~:} \textit{Application Interfaces~:} this WP focuses on the interfaces exposed
by the data placement and storage system. Requirements identified in WP1 will help shaping the API that will
be offered to applications. This API will enable applications to interact with the system, describing data
flow as managed by WP2 and storage policies used by WP4. The use-cases described in WP1 will be adapted to
use the API.
\end{itemize}
\section{Outcomes from WP1}
The WP1 has been a crucial team within the IO-SEA. The IO-SEA proposal has set the emphasis on the co-design
approach, all those efforts were focused in this work package.
During the run of the IO-SEA, strong interactions has existed between the different WP which definitely did not
work "in silo mode", but all worked as a single team. The figure \ref{fig:white-board} shows an screen copy
of a white board produced during one of those brainstorming session.
\begin{figure}[H]
\centering
\includegraphics[width=0.75\textwidth]{FIGS/white-board.png}
\caption[Co-design whiteboard]{ Screen capture of IO-SEA whiteboard co-design brainstorm session}
\label{fig:white-board}
\end{figure}
Several key ideas came out from those co-design effort, in particular
\begin{itemize}
\item \textbf{interactive data access:~} In all phases of the project, it is important for scientific use
cases to have interactive access to the data sets and namespaces used in the workflow, e.g., to stage initial
data in setup; to check output, peek at meta-data or even begin analysis while the workflow is running; to
run interactive data analysis tools post-workflow.
\item \textbf{Prompted tape archival/retrieval:~} The experience of scientific users at large HPC centres is
that the time to move large datasets to or from tape can sometimes be measured in days. Therefore it is
important that the users have an interface to trigger data retrieval from the tape long before a workflow is
queued.
\item \textbf{Tape-disk explicit copy:~} Much of the discussion regarding the IO-SEA storage hierarchy has
used the phrase “data- movement”. As the tape archive is a tier in the storage hierarchy — the slowest but
most capacious tier, use cases clarified that they rarely “move” data to or from the tape archive systems.
This occurs normally only in response to suddenly needing to make space on a nearly-full disk. Rather the
normal pattern is often that data is generated that has both near- term and long term uses and a copy is
saved to tape for safe-keeping, while the bulk may remain on disk for further immediate access. Likewise,
when older data is retrieved from tape for new analysis, it is usually copied to disk; the tape archive is
not erased. This clarification has resulted in the inclusion of copy functionality of the datamover.
\item \textbf{Pre-fetching to data node:~} Similarly, users in environments where compute-node time is
charged against a budget ex- pressed concern that they be able to pre-fetch files to the data nodes so that
the compute nodes do not idle when the allocation starts. Again, the solution provided in response was the
datamovers, which will also allow users to prompt data movement to the flash storage.
\item \textbf{DASI-to-POSIX staging application:~} Several scientific use-case tasks expressed interest in
exploring the proposed DASI as a method of organizing stored scientific data. However, to integrate DASI
directly into the application generally requires significant re-writes to I/O libraries. In cases where the
application is a large, widely-used community code, this is impractical and beyond the scope of the IO-SEA
project. Instead, the use-case tasks have requested a stand-alone application that can translate a POSIX path
to DASI semantic keys and copy a file back and forth between DASI and POSIX storage. This will enable us to
explore the advantages of DASI with a low barrier to entry. Work Package~5 has indicated they intend to
produce such an application in the coming months in response to this request. In anticipation of this,
Section 5.1 provides an example of how this might be integrated into a workflow with the GBF ephemeral
service.
\end{itemize}
\section{Outcomes from WP2}
The main outcomes from the second work package are related to the creation of the workflow manager. Since
ephemeral service are strongly associated to compute jobs, the following schema is to be respected
\begin{enumerate}
\item allocate the compute nodes to run the simulation jobs
\item allocate the data nodes in order to run the ephemeral services
\item start the ephemeral services
\item make the ephemeral services accessible to the compute nodes, for example mounting a NFS share
exposed via a NFS ephemeral service
\item run the simulation
\item if several steps are involved in the simulation, run the other steps of the simulation
\item end the simulation
\item flush the data to the persistent storage (typically the object stores)
\item stop the ephemeral services
\item release the data nodes and the compute nodes
\end{enumerate}
The WP2 made all the development to achieve this schema. The main outcome developed in the Workflow manager,
which makes it possible to manage the allocation of temporary IO servers, or ephemeral servers, implemented in
the IO-SEA. As compute jobs are running on compute nodes, ephemeral services are working on data nodes, which
are part of dedicated modules within the MSA (see \ref{fig:msa}).
The Workflow Manager operates as described in figure \ref{fig:wfm}. The concept of compute job is replaced by
the more complete concept of workflow. A workflow is made of different steps, where simulation program are run
as well as ephemeral service.
\begin{figure}[H]
\centering
\includegraphics[width=0.35\textwidth]{FIGS/wfm.png}
\caption[Worflow Manager Architecture]{ Worflow Manager Architecture}
\label{fig:wfm}
\end{figure}
The workflow contains explicit dependencies, for example you need to have an
already started ephemeral IO server before you can do actual IO operations. The workflow is described by a
workflow description file, written in JSON.
The Workflow Manager is a key component inside the IO-SEA, it's a central place where all pieces are set
together. The user requirements from WP1 drove most of the development of this feature and operated in clode
collaboration with WP2. The WFM starts and uses the ephemeral servers developed by WP4 as well as elaborated
user interfaces provided by WP5.
Inside the Workflow Manager resides an orchestrator whose tasks is to allocate and start all the executables
involved in a workflow. From this perspective, the orchestrator has strong acquaintance with the compute center's
resource scheduler (such as Slurm). The figure \ref{fig:it4i} showns how such a framework was deployed on the
OpenStack Cloud hosted at IT4I, it is an actual proof that the workflow manager actually works.
\begin{figure}[H]
\centering
\includegraphics[width=0.75\textwidth]{FIGS/it4i.png}
\caption[Worflow Manager at IT4I]{Instances deployed at IT4I OpenStack Cloud}
\label{fig:it4i}
\end{figure}
The Workflow manager comes with itw own API and a set of command-line utility that help in efficiently manage
the workflow related operations (such a "start" / "stop" or "pause" / "resume").
\section{Outcomes from WP3}
The third Work Package of the IO-SEA project makes a focus on instrumentation and monitoring, plus the
development of tools to extract useful bits of information from the collected data.
During the three years run of IO-SEA, WP3 has succeeded in developing a fine instrumentation framework able to
get hardware and software probes. In particular, the WP3 utility makes it possible to observe and measure the
behavior of applications. By having a fine understanding of the mechanisms involved in the different core
components of IO-SEA, mostly used in WP2 and WP4. This makes it possible to put probes at the exact right
location, exposing quite relevant data.
\begin{figure}[H]
\centering
\includegraphics[width=0.75\textwidth]{FIGS/io_instr.png}
\caption[I/O Monitoring architecture]{ The Architecture of the I/O monitoring framework}
\label{fig:instr}
\end{figure}
This data is collected in databases, and can be later displayed via interactive
GUI. Such integration as for example been done in tools via LLView or ParaStation HealthChecker. Giving a clear
overview is interesting, but this is not the only benefit of WP3, the instrumentation databases are dug via
AI framework in order to build a \textit{recommendation system}. This new feature helps in identifying patterns
in the user application, opening the path to optimising those applications. Within the scope of the IO-SEA
project, such a recommendation systems could automatically add "hints" on objects or datasets, helping WP4 and
the HSM feature capable of using this hints as an input in their internal policies.
\begin{figure}[H]
\centering
\includegraphics[width=0.75\textwidth]{FIGS/recom.png}
\caption[Recommendation system]{ The general architecture of the recommendation system}
\label{fig:recom}
\end{figure}
A side effect of this work is the "IO Traces Collection" initiative supported by both projects, ADMIRE and
IO-SEA. This initiative offers a non intrusive instrumentation framework whose well formatted output should be
stored in a common database, hosted on the internet and available to everyone. Researchers could use this tool
to work on optimisation for they have a large zoo of many different application behaviors.
\section{Outcomes from WP4}
Several outcomes come from the WP4. During the three years of the project, it became obvious that a clear
management of a generic HSM made it necessary to have a defined API and a defined architecture. This API, name HESTIA (ADD MEANING OF THE ACRONYM), make this possible. HESTIA is not attached to any feature of a given object
store. In IO-SEA, it brings all the tools needed to implement the HSM feature between Phobos and another object
store.
The HESTIA API models and formalises ways to use various storage technologies like NVMe, SSD, HDD, or tape, to
allow object stores to handle every type of device within the same storage hierarchy. It will make IO-SEA
capable of dealing with the complete lifecycle of the data inside the same system. This system will be capable of managing most of the existing storage devices, from NVMe to tapes. To achieve this, it leverages two open-source
object stores: CORTX-Motr on top of Phobos (with the necessary abstractions to support additional backends).
Thus, HSM mechanisms are implemented in two products, making them capable of managing data movements between
them. The figure \ref{fig:hestia} depicts how the different compoments are organised within the HESTIA
framework.
\begin{figure}[H]
\centering
\includegraphics[width=0.75\textwidth]{FIGS/hestia.png}
\caption[HESTIA software architecture]{ The architecture of the HESTIA framework}
\label{fig:hestia}
\end{figure}
Hestia is an open source project that can be found at \url{https://git.ichec.ie/io-sea-internal/}.
HESTIA manages the data movement and triggers some of them internally (e.g. for managing data migration as data
have become cold enough to be migrated to tapes). The Robinhood software (available at
\url{https://github.com/cea-hpc/robinhood}, is the policy engine used to achieve that goal.
PHOBOS, or \textit{Parallel Heterogeneous OBject Store}, is an object store developed at CEA and available as an
open-source software at \url{https://github.com/cea-hpc/phobos}. PHOBOS existed before the beginning of IO-SEA,
but was strongly improved during the three years of the project. PHOBOS is designed to manage large volumes of
data for various kinds of storage technologies from SSD to tapes. Phobos can efficiently handle very large
datasets on inexpensive media without sacrificing scalability, performance, or fault-tolerance requirements.
Phobos is designed to allow the easy integration of new modules for layouts such as mirroring and erasure coding
or I/O adapters. It's architecture can be seen on figure \ref{fig:phobos}.
\begin{figure}[H]
\centering
\includegraphics[width=0.75\textwidth]{FIGS/phobos.png}
\caption[PHOBOS architecture]{ The architecture of the PHOBOS software}
\label{fig:phobos}
\end{figure}
KVSNS is a library that makes it possible to implement a POSIX semantic on top of the services provided by an
object store and a key-value store. KVSNS was originally developed in the run of the SAGE and SAGE2 project, but
the related version was not capable of managing object store with only a CRUD\footnote{Create Read Update Delete}
interface, the object store had to be capable of performing random I/O operations in objects, just like POSIX
does via the \verb|pread()| or \verb|pwrite()| from the LibC. Inside IO-SEA, KVSNS gain a new feature, making it
capable of dealing with CRUD-only obbject store. This required the introduction of a local filesystem used as
a cache and a way to fetch and dispose entry from/to this cache. This logic should not be a bottleneck, it should
be implemented as a fully parallel and distributed service. This lead to the development of the GRH feature.
It's layers are described by figure \ref{fig:kvsns}
\begin{figure}[H]
\centering
\includegraphics[width=0.75\textwidth]{FIGS/KVSNS_architecture.png}
\caption[KVSNS layers]{ The layers of the KVSNS library}
\label{fig:kvsns}
\end{figure}
GRH gets request issued by the KVSNS library itself, or externally. It creates an asynchronous request id that
can be used by the client to get the status of the request in an asynchronous way. GRH implement workers and
tasks queues. GRH is based on the Celery framework, which is strongly parallel. GRH could be used outside its
original scope.In particular, GRH can be used to implement a data mover. The overall architecture of the GRH is
described by figure \ref{fig:grh}.
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{FIGS/GRH.png}
\caption[GRH architecture]{ The design of the GRH component}
\label{fig:grh}
\end{figure}
IO-SEA strongly relies on object stores. Implementing the IO-SEA architecture from scratch is easy, as new
data will be inserted inside directly in the object store, but what about moving data, already stored in standard
distributed and parallel file-systems? It's important to implement the required tools to make it possible to
migrate data from the previously existing storage systems. One of the promises in the IO-SEA's proposal is
providing such a framework. This goal was actually achieved as shown in figure \ref{fig:migration-path}.
\begin{figure}[H]
\centering
\includegraphics[width=0.75\textwidth]{FIGS/migration-path.png}
\caption[Migration Path]{ The migration path in the IO-SEA}
\label{fig:migration-path}
\end{figure}
\section{Outcomes from WP5}
The main outcome of the WP5 is the \textit{Data Access Storage Interface}, or \textbf{DASI}.
DASI (available at \url{https://github.com/io-sea/dasi}) is a semantic interface for data, where the data is
indexed and uniquely identified by sets of scientifically-meaningful metadata keys. DASI is modular and is
compatible with multiple backends (i.e., object stores or POSIX) through diverse frontends (Python, C++, C).
Internally, DASI is based on software developed by ECMWF, named FDB2, which has been developed in previous EU
projects (NEXTGenIO), but has been heavily adapted and extended to be agnostic of and configurable for
different scientific domains. Leveraging FDB in this way means that many backends are already supported by DASI,
including a high-performance POSIX backend, a Ceph object storage backend, and an NVRAM backend. During the
IO-SEA project, a backend will also be created for CORTX. Applications will have seamless access to these
backends once they have been modified to support DASI. The DASI architecture is shown in figure \ref{fig:dasi}
as well as its dependency to other work done in the scope of IO-SEA.
\begin{figure}[H]
\centering
\includegraphics[width=0.75\textwidth]{FIGS/dasi.png}
\caption[Migration Path]{ DASI and its interactions with the other elements of IO-SEA}
\label{fig:dasi}
\end{figure}
DASI is domain-agnostic, and is configured for each scientific domain using a schema. The schema
defines the metadata keys which index and identify the data within a domain.
\paragraph{}
DASI comes with a specific libraty and its own API, but it includes a POSIX interface based on the KVSNS
library and the nfs-ganesha server. It is interface with existing middleware like SIONlib, but it has
dedicated interfaces to standards like MPI or HDF5.