WEBVTT
Kind: captions
Language: en

00:00:02.800 --> 00:00:05.600
Hello,

00:00:03.679 --> 00:00:07.359
my name is Peter Amstutz and I'm a

00:00:05.600 --> 00:00:08.480
developer on the Arvados open source

00:00:07.359 --> 00:00:10.320
project.

00:00:08.480 --> 00:00:11.920
Today I'm going to give a technical

00:00:10.320 --> 00:00:14.400
overview of some of the features and

00:00:11.920 --> 00:00:16.320
components of the Arvados platform.

00:00:14.400 --> 00:00:20.320
This presentation covers Arvados

00:00:16.320 --> 00:00:20.320
features up to the 2.1 release.

00:00:21.039 --> 00:00:24.960
First I want to say a little bit about

00:00:22.960 --> 00:00:28.240
the company behind Arvados,

00:00:24.960 --> 00:00:28.720
Curii. Curii is an open source software

00:00:28.240 --> 00:00:31.679
company

00:00:28.720 --> 00:00:32.719
serving the life sciences. Our customers

00:00:31.679 --> 00:00:36.079
include pharma,

00:00:32.719 --> 00:00:38.160
small biotechs and academic labs.  Curii

00:00:36.079 --> 00:00:39.920
offers support and development services

00:00:38.160 --> 00:00:41.520
for the Arvados platform,

00:00:39.920 --> 00:00:43.040
as well as engaging in research and

00:00:41.520 --> 00:00:44.160
development on machine learning for

00:00:43.040 --> 00:00:46.640
genomics

00:00:44.160 --> 00:00:49.120
and around open biomedical data and

00:00:46.640 --> 00:00:49.120
standards.

00:00:50.800 --> 00:00:55.920
What is Arvados? Arvados is an open

00:00:53.920 --> 00:00:58.079
source platform for managing,

00:00:55.920 --> 00:00:59.600
processing and sharing genomic and other

00:00:58.079 --> 00:01:03.120
large scientific and

00:00:59.600 --> 00:01:03.840
biomedical data.  We'll talk about what

00:01:03.120 --> 00:01:05.600
this means

00:01:03.840 --> 00:01:07.280
in terms of specific features and

00:01:05.600 --> 00:01:09.040
capabilities in a minute,

00:01:07.280 --> 00:01:11.600
but first let me tell you about some of

00:01:09.040 --> 00:01:13.760
our high level goals.

00:01:11.600 --> 00:01:16.400
We wanted to build a platform that could

00:01:13.760 --> 00:01:18.240
be deployed in a variety of environments

00:01:16.400 --> 00:01:19.439
such as various cloud providers and

00:01:18.240 --> 00:01:22.320
on-premise HPC

00:01:19.439 --> 00:01:23.040
clusters. We also wanted to make it

00:01:22.320 --> 00:01:25.360
possible

00:01:23.040 --> 00:01:26.799
to access and use data located on

00:01:25.360 --> 00:01:28.320
multiple clusters

00:01:26.799 --> 00:01:32.159
that may be running in different

00:01:28.320 --> 00:01:34.560
locations, environments or organizations.

00:01:32.159 --> 00:01:35.280
Another high-level goal is to work at

00:01:34.560 --> 00:01:38.320
the scale

00:01:35.280 --> 00:01:40.240
of modern biomedical research.

00:01:38.320 --> 00:01:42.159
Instruments such as sequencers and

00:01:40.240 --> 00:01:44.479
microscopes can produce hundreds of

00:01:42.159 --> 00:01:46.479
gigabytes of data in a single run,

00:01:44.479 --> 00:01:47.680
and produce hundreds of terabytes of

00:01:46.479 --> 00:01:50.159
data annually,

00:01:47.680 --> 00:01:51.040
so Arvados needs to work at petabyte

00:01:50.159 --> 00:01:53.439
scale

00:01:51.040 --> 00:01:55.600
and manage thousands of cores to process

00:01:53.439 --> 00:01:58.479
all that data.

00:01:55.600 --> 00:02:00.159
Finally, we want a complete record of

00:01:58.479 --> 00:02:02.320
everything done on the system

00:02:00.159 --> 00:02:04.640
so we can determine what was done,

00:02:02.320 --> 00:02:07.439
confirm or reproduce results,

00:02:04.640 --> 00:02:08.000
track the origin of data and be able to

00:02:07.439 --> 00:02:11.120
verify

00:02:08.000 --> 00:02:11.599
that a data set has not been modified, or

00:02:11.120 --> 00:02:13.840
have a

00:02:11.599 --> 00:02:16.879
history of changes when it

00:02:13.840 --> 00:02:16.879
has been modified.

00:02:21.200 --> 00:02:24.640
The key capabilities that

00:02:23.280 --> 00:02:26.560
Arvados gives you

00:02:24.640 --> 00:02:28.400
are the ability to manage storage and

00:02:26.560 --> 00:02:30.160
compute at scale

00:02:28.400 --> 00:02:31.920
and the ability to integrate those

00:02:30.160 --> 00:02:33.519
capabilities into your existing

00:02:31.920 --> 00:02:37.120
infrastructure and applications

00:02:33.519 --> 00:02:39.200
using Arvados APIs. The main

00:02:37.120 --> 00:02:41.360
unit of data management is an Arvados

00:02:39.200 --> 00:02:43.440
collection which is a self-contained set

00:02:41.360 --> 00:02:46.560
of files and directories.

00:02:43.440 --> 00:02:49.360
A collection can be identified by a name

00:02:46.560 --> 00:02:51.360
assigned by the user, by a universally

00:02:49.360 --> 00:02:54.000
unique database identifier,

00:02:51.360 --> 00:02:56.000
or by a portable data hash, also called a

00:02:54.000 --> 00:02:58.560
content address.

00:02:56.000 --> 00:03:00.080
The portable data hash is an immutable

00:02:58.560 --> 00:03:02.000
identifier computed

00:03:00.080 --> 00:03:04.959
based on the content and structure of

00:03:02.000 --> 00:03:07.360
the files making up the collection,

00:03:04.959 --> 00:03:08.239
and if any of those files change you get

00:03:07.360 --> 00:03:11.200
a different

00:03:08.239 --> 00:03:12.000
portable data hash. Because the portable

00:03:11.200 --> 00:03:13.519
data hash

00:03:12.000 --> 00:03:15.920
is based on the content of the

00:03:13.519 --> 00:03:18.480
collection, you can verify

00:03:15.920 --> 00:03:19.440
that the actual data has the expected

00:03:18.480 --> 00:03:22.159
hash

00:03:19.440 --> 00:03:23.760
as well as easily determine if a copy of

00:03:22.159 --> 00:03:25.680
the collection is the same as the

00:03:23.760 --> 00:03:28.239
original.

00:03:25.680 --> 00:03:29.599
Access to collections is controlled, and

00:03:28.239 --> 00:03:31.840
collections

00:03:29.599 --> 00:03:33.120
can be shared with users and groups at

00:03:31.840 --> 00:03:36.400
different levels of read

00:03:33.120 --> 00:03:36.400
only or write access.

00:03:36.640 --> 00:03:40.879
Once data is loaded into the system, you

00:03:39.040 --> 00:03:42.959
probably want to analyze it.

00:03:40.879 --> 00:03:44.799
Arvados provides a complete workflow

00:03:42.959 --> 00:03:46.080
engine that's closely integrated with

00:03:44.799 --> 00:03:48.560
the storage system

00:03:46.080 --> 00:03:49.599
to ensure efficient data movement, as

00:03:48.560 --> 00:03:51.760
well as handling

00:03:49.599 --> 00:03:53.360
the scheduling of compute jobs across

00:03:51.760 --> 00:03:55.120
multiple nodes,

00:03:53.360 --> 00:03:57.760
and keeps a record of everything that's

00:03:55.120 --> 00:04:01.200
been done so it's easy to repeat

00:03:57.760 --> 00:04:04.080
previous computations. Finally,

00:04:01.200 --> 00:04:06.000
Arvados provides a complete API making

00:04:04.080 --> 00:04:06.959
it possible to integrate with existing

00:04:06.000 --> 00:04:09.599
systems

00:04:06.959 --> 00:04:12.319
as well as build new applications on top

00:04:09.599 --> 00:04:12.319
of Arvados.

00:04:13.280 --> 00:04:20.000
The Arvados storage system is called "Keep".

00:04:17.120 --> 00:04:20.560
As I mentioned, it organizes sets of

00:04:20.000 --> 00:04:23.919
files

00:04:20.560 --> 00:04:25.919
into a collection.  A collection

00:04:23.919 --> 00:04:27.840
can have additional user metadata

00:04:25.919 --> 00:04:30.000
associated with it in the form of

00:04:27.840 --> 00:04:32.160
searchable key value properties,

00:04:30.000 --> 00:04:34.320
and records the history of changes made

00:04:32.160 --> 00:04:36.960
to that collection.

00:04:34.320 --> 00:04:38.880
To store data in a collection, the files

00:04:36.960 --> 00:04:41.600
are broken up into a set of blocks

00:04:38.880 --> 00:04:42.160
up to 64 megabytes in size which are

00:04:41.600 --> 00:04:44.400
hashed

00:04:42.160 --> 00:04:47.120
to get an identifier which is used to

00:04:44.400 --> 00:04:49.280
store and retrieve the data block.

00:04:47.120 --> 00:04:51.680
The identifier can be used to validate

00:04:49.280 --> 00:04:54.479
the content of the block.

00:04:51.680 --> 00:04:56.000
It also provides for deduplication as

00:04:54.479 --> 00:04:57.360
the same data block

00:04:56.000 --> 00:05:00.080
can be referenced by multiple

00:04:57.360 --> 00:05:02.080
collections this means

00:05:00.080 --> 00:05:05.360
that it is cheap to copy and modify

00:05:02.080 --> 00:05:08.840
collections as only the block identifiers

00:05:05.360 --> 00:05:10.000
need to be copied and not the actual

00:05:08.840 --> 00:05:12.400
data.

00:05:10.000 --> 00:05:15.120
On the back end, the actual data blocks can

00:05:12.400 --> 00:05:18.720
be stored on a conventional file system

00:05:15.120 --> 00:05:22.160
in S3 buckets or in Azure blob storage.

00:05:18.720 --> 00:05:24.800
Using the block hash as the file name

00:05:22.160 --> 00:05:28.320
On the front end, keep

00:05:24.800 --> 00:05:31.360
provides a variety of access options:

00:05:28.320 --> 00:05:32.160
you can use the Arvados SDK which uses

00:05:32.160 --> 00:05:36.720
the block level API and reassembles

00:05:34.720 --> 00:05:38.800
files on the client side,

00:05:36.720 --> 00:05:40.320
but you can also use higher level access

00:05:38.800 --> 00:05:43.039
methods including

00:05:40.320 --> 00:05:43.680
a file system in user space or FUSE

00:05:43.039 --> 00:05:46.160
mount

00:05:43.680 --> 00:05:49.520
which lets you access Keep more or less

00:05:46.160 --> 00:05:53.199
like a regular POSIX file system,

00:05:49.520 --> 00:05:56.560
or over HTTP using WebDAV or using an

00:05:53.199 --> 00:05:58.639
AWS S3 compatible API

00:05:56.560 --> 00:06:00.880
where an Arvados collection acts like a

00:05:58.639 --> 00:06:03.039
bucket and enables applications that

00:06:00.880 --> 00:06:07.840
already support object storage

00:06:03.039 --> 00:06:07.840
to access files in Arvados.

00:06:08.880 --> 00:06:12.880
The Arvados compute layer is called

00:06:11.280 --> 00:06:15.039
"Crunch".

00:06:12.880 --> 00:06:17.840
Requests to run a compute job are

00:06:15.039 --> 00:06:20.319
submitted through the Arvados API.

00:06:17.840 --> 00:06:22.560
This includes the command line to run,

00:06:20.319 --> 00:06:25.120
the container image to use,

00:06:22.560 --> 00:06:28.160
required hardware resources, and what

00:06:25.120 --> 00:06:30.400
input files are required by the job.

00:06:28.160 --> 00:06:31.600
Crunch handles translating that job

00:06:30.400 --> 00:06:35.360
request into an

00:06:31.600 --> 00:06:38.639
HPC batch submission or on the cloud

00:06:35.360 --> 00:06:39.520
by using cloud providers APIs to request

00:06:38.639 --> 00:06:42.560
a new compute

00:06:39.520 --> 00:06:42.960
instance on demand, running the job and

00:06:42.560 --> 00:06:44.639
then

00:06:42.960 --> 00:06:46.880
shutting down the compute instance when

00:06:44.639 --> 00:06:49.199
it's no longer needed.

00:06:46.880 --> 00:06:50.400
Crunch keeps a record of every job that

00:06:49.199 --> 00:06:52.400
has been submitted

00:06:50.400 --> 00:06:53.759
and all of the inputs, outputs, and

00:06:52.400 --> 00:06:56.240
container images

00:06:53.759 --> 00:06:59.759
are keep collections

00:06:56.240 --> 00:07:02.560
identified by portable data hashes.

00:06:59.759 --> 00:07:03.440
The same portable data hash means you

00:07:02.560 --> 00:07:06.880
have the same

00:07:03.440 --> 00:07:09.360
file structure and content.

00:07:06.880 --> 00:07:11.440
If a job which is identical to a

00:07:09.360 --> 00:07:13.360
previous job is submitted

00:07:11.440 --> 00:07:14.880
meaning it was submitted with an

00:07:13.360 --> 00:07:19.440
identical container image,

00:07:14.880 --> 00:07:23.199
input files, and command line, Arvados

00:07:19.440 --> 00:07:24.639
will recognize that it is identical to a

00:07:23.199 --> 00:07:27.440
previous job

00:07:24.639 --> 00:07:29.840
and instead of redundantly

00:07:27.440 --> 00:07:32.479
re-running the job it will simply reuse

00:07:29.840 --> 00:07:35.280
the result from the past run.

00:07:32.479 --> 00:07:37.599
This is especially useful if you need to

00:07:35.280 --> 00:07:39.599
stop and restart a workflow

00:07:37.599 --> 00:07:41.120
as it will quickly move through all the

00:07:39.599 --> 00:07:44.240
reused steps

00:07:41.120 --> 00:07:47.599
that have previously been executed.

00:07:44.240 --> 00:07:49.919
Crunch also collects complete logs,

00:07:47.599 --> 00:07:52.080
information about the compute node, and

00:07:49.919 --> 00:07:55.360
extensive metrics about the job,

00:07:52.080 --> 00:07:56.000
such as moment-to-moment CPU usage, RAM

00:07:55.360 --> 00:07:59.440
usage,

00:07:56.000 --> 00:08:00.160
I/O and so forth enabling you to easily

00:07:59.440 --> 00:08:02.240
diagnose

00:08:00.160 --> 00:08:03.199
common problems such as out of memory

00:08:02.240 --> 00:08:06.800
conditions

00:08:03.199 --> 00:08:10.000
or optimize cost by determining the most

00:08:06.800 --> 00:08:13.840
compact and least expensive node size

00:08:10.000 --> 00:08:13.840
that fits a job.

00:08:14.639 --> 00:08:19.520
The native workflow language of Arvados

00:08:17.120 --> 00:08:22.879
is common workflow language.

00:08:19.520 --> 00:08:25.759
CWL is an open standard for describing

00:08:22.879 --> 00:08:26.000
computational data analysis workflows

00:08:25.759 --> 00:08:27.840
which

00:08:26.000 --> 00:08:30.319
is supported by a number of different

00:08:27.840 --> 00:08:32.479
vendors and software platforms.

00:08:30.319 --> 00:08:34.399
The Arvados project has been involved in

00:08:32.479 --> 00:08:36.800
the development of CWL

00:08:34.399 --> 00:08:39.039
since its inception, and offers robust

00:08:36.800 --> 00:08:43.399
CWL support.

00:08:39.039 --> 00:08:46.399
If you would like to learn more go to

00:08:43.399 --> 00:08:46.399
commonwl.org.

00:08:48.320 --> 00:08:53.360
Day-to-day use of Arvados typically

00:08:50.800 --> 00:08:55.519
involves using the Arvados workbench web

00:08:53.360 --> 00:08:57.839
application.

00:08:55.519 --> 00:08:59.120
Workbench lets you search and browse

00:08:57.839 --> 00:09:00.800
collections,

00:08:59.120 --> 00:09:03.200
start and monitor the progress of

00:09:00.800 --> 00:09:06.160
workflows, create projects,

00:09:03.200 --> 00:09:08.480
upload data, share data with other users,

00:09:06.160 --> 00:09:10.640
and a variety of other features.

00:09:08.480 --> 00:09:11.600
there's also a suite of command line

00:09:10.640 --> 00:09:13.279
tools

00:09:11.600 --> 00:09:14.720
that are capable of doing everything

00:09:13.279 --> 00:09:17.120
that can be done through the web UI.

00:09:19.519 --> 00:09:23.519
Arvados offers software development kits

00:09:21.680 --> 00:09:26.720
for several different languages,

00:09:23.519 --> 00:09:30.560
currently Python, Go, R,

00:09:26.720 --> 00:09:32.959
Ruby and Java. The SDKs

00:09:30.560 --> 00:09:34.880
make it easy to access the underlying

00:09:32.959 --> 00:09:37.519
REST APIs

00:09:34.880 --> 00:09:39.120
as well as direct access to data stored

00:09:37.519 --> 00:09:41.600
in Keep.

00:09:39.120 --> 00:09:42.240
In addition, software can access files in

00:09:41.600 --> 00:09:45.200
Keep

00:09:42.240 --> 00:09:50.560
through WebDAV and S3 compatible APIs

00:09:45.200 --> 00:09:52.880
offered by Arvados.

00:09:50.560 --> 00:09:54.959
Arvados takes security and access control

00:09:52.880 --> 00:09:57.279
very seriously.

00:09:54.959 --> 00:09:58.000
Access to API endpoints requires a

00:09:57.279 --> 00:09:59.760
client

00:09:58.000 --> 00:10:02.079
to present an access token that

00:09:59.760 --> 00:10:04.720
identifies the user.

00:10:02.079 --> 00:10:06.560
All traffic is encrypted by default

00:10:04.720 --> 00:10:08.640
using TLS

00:10:06.560 --> 00:10:10.399
and Arvados can be easily configured for

00:10:08.640 --> 00:10:13.519
data

00:10:10.399 --> 00:10:16.240
to be encrypted at rest. Arvados

00:10:13.519 --> 00:10:19.120
supports various single sign-on systems

00:10:16.240 --> 00:10:21.200
including LDAP, OpenId Connect and

00:10:19.120 --> 00:10:23.760
Google accounts.

00:10:21.200 --> 00:10:24.399
Data upload to Arvados is private by

00:10:23.760 --> 00:10:26.160
default

00:10:24.399 --> 00:10:29.760
but can be shared with other users or

00:10:26.160 --> 00:10:29.760
groups at different access levels.

00:10:31.760 --> 00:10:38.160
A major feature of Arvados is federation.

00:10:35.519 --> 00:10:40.720
Arvados clusters are able to communicate

00:10:38.160 --> 00:10:43.040
with other clusters in a federation

00:10:40.720 --> 00:10:44.880
in order to enable the user to log in

00:10:43.040 --> 00:10:46.079
with a consistent identity and

00:10:44.880 --> 00:10:48.399
credentials,

00:10:46.079 --> 00:10:50.079
and search and access data across

00:10:48.399 --> 00:10:53.279
multiple clusters in different

00:10:50.079 --> 00:10:55.760
regions or organizations.

00:10:53.279 --> 00:10:56.320
Federation enables you to use Arvados to

00:10:55.760 --> 00:10:58.959
create a

00:10:56.320 --> 00:11:00.959
data commons in which data can be both

00:10:58.959 --> 00:11:02.000
shared widely among users in an

00:11:00.959 --> 00:11:04.320
organization,

00:11:02.000 --> 00:11:08.399
or between organizations, while still

00:11:04.320 --> 00:11:08.399
having controlled, audited access.

00:11:11.519 --> 00:11:17.800
In April 2020, a week-long online

00:11:15.200 --> 00:11:19.200
BioHackathon was organized to see how

00:11:17.800 --> 00:11:21.040
bioinformatics

00:11:19.200 --> 00:11:23.360
could help in the fight against SARS-CoV-2.

00:11:23.360 --> 00:11:29.519
One of the projects to emerge from this

00:11:26.480 --> 00:11:34.800
was the public sequence resource PubSeq

00:11:29.519 --> 00:11:36.800
located at covid19.genenetwork.org.

00:11:34.800 --> 00:11:39.360
The vision of this resource was to

00:11:36.800 --> 00:11:41.600
provide viral sequence data

00:11:39.360 --> 00:11:43.839
in a place where, unlike other

00:11:41.600 --> 00:11:45.440
repositories that supply data but no

00:11:43.839 --> 00:11:47.360
compute capability,

00:11:45.440 --> 00:11:49.600
scientists could easily run their own

00:11:47.360 --> 00:11:51.760
custom batch analysis.

00:11:49.600 --> 00:11:54.079
Using Arvados we were able to go from

00:11:51.760 --> 00:11:55.279
concept to working prototype in five

00:11:54.079 --> 00:11:57.440
days.

00:11:55.279 --> 00:11:59.760
Today this resource has over thirty

00:11:57.440 --> 00:12:01.519
thousand viral sequences available

00:11:59.760 --> 00:12:05.440
along with metadata and a number of

00:12:01.519 --> 00:12:05.440
workflows for processing the data.

00:12:08.079 --> 00:12:12.320
To sum up, Arvados is a complete software

00:12:11.440 --> 00:12:14.639
platform

00:12:12.320 --> 00:12:15.360
for managing data and compute on that

00:12:14.639 --> 00:12:18.079
data

00:12:15.360 --> 00:12:20.160
without compromising on scale, security,

00:12:18.079 --> 00:12:22.839
or control.

00:12:20.160 --> 00:12:25.120
Curii has set up a demo instance at

00:12:22.839 --> 00:12:27.200
playground.arvados.org

00:12:25.120 --> 00:12:28.639
where you can get a feel for what it is

00:12:27.200 --> 00:12:31.480
like to use Arvados

00:12:28.639 --> 00:12:34.800
as a regular user. Follow the

00:12:31.480 --> 00:12:37.279
documentation link on the arvados.org

00:12:34.800 --> 00:12:38.480
front page to get to the Arvados

00:12:37.279 --> 00:12:41.760
documentation,

00:12:38.480 --> 00:12:43.360
which includes our user guide, and if you

00:12:41.760 --> 00:12:44.480
think you would like to set up Arvados

00:12:43.360 --> 00:12:48.000
for yourself,

00:12:44.480 --> 00:12:49.680
our installation guide. Arvados is open

00:12:48.000 --> 00:12:51.440
source and we welcome and encourage

00:12:49.680 --> 00:12:54.560
community users.

00:12:51.440 --> 00:12:56.480
Follow the community link on arvados.org

00:12:54.560 --> 00:12:57.839
to get information about the Arvados

00:12:56.480 --> 00:13:00.560
discussion forum,

00:12:57.839 --> 00:13:02.560
chat, video calls, and other community

00:13:00.560 --> 00:13:04.320
events.

00:13:02.560 --> 00:13:06.079
Finally if you're interested in

00:13:04.320 --> 00:13:10.200
professional support and development

00:13:06.079 --> 00:13:13.839
services, please contact us at info@curii.com

00:13:10.200 --> 00:13:19.920
for more information.

00:13:13.839 --> 00:13:19.920
And with that, thank you for your time.

