Arvados 2.7.0 Release Notes
September 21, 2023
The Arvados team is pleased to announce Arvados 2.7.0. Highlights of this release include a new container logging system, scalability and performance improvements, and user interface improvements. We recommend that new and existing installations of 2.6.3 or earlier upgrade to 2.7.0. See Upgrading Arvados for instructions.
New Container Logging System
Arvados 2.7.0 introduces an entirely new system to view logs from running containers. This new API enables clients to retrieve logs directly from the running container, rather than storing logs on the API server as log objects. This greatly reduces API server load, network traffic, and database storage requirements for clusters with heavy compute load: live logs only need to be sent to interested clients, and not from every running container. This also makes it easier for clients to provide users with a consistent and complete view of all logs available, whether a container is running or finished.
The new logging API was implemented in #19889, #20319, and #20647.
Workbench 2 support for this new API was implemented in #20219. The interface is the same as before, but the logs shown should be much more complete.
Log viewing at the command line was added as the arvados-client logs
command in #18790.
With the introduction of this API, the default configuration for Containers.Logging.LimitLogBytesPerJob
is now 0
. This functionally disables old log record creation in Crunch. Those code paths are still available for now, but expected to be deprecated and removed in future releases of Arvados. #20894
Workbench
Workbench 2 is now the default for new installs, and Workbench 1 is deprecated. All new development is going into Workbench 2, which means that Workbench 1 cannot take advantage of new APIs like the container logging API described above. Our user guide has been updated to give people instructions based on Workbench 2, not Workbench 1. #20497, #20688, #20731, #20850, #20890
Workbench supports a richer set of copy and move operations when you select multiple files from a collection view. You can copy or move those files elsewhere within the same collection, to a new collection, to an existing collection, or to a separate collection for each selection. The underlying operations are handled by the API server so they’re fast and efficient. #20031
The process view links to the collection that contains the CWL workflow definition that was run when applicable. #20513
The process view shows both complete workflow and single-container cost in a single, simplified line item. #20454
Workflows correctly display inputs with an optional enum type. #19359
Added a Delete action to the workflow definitions menu. #20477, #20899
New users can be directed to complete a user profile on first login. The user profile, including required fields, is configurable by the administrator. #18946, #20913
Breadcrumbs in Workbench render even when the parent project is not visible in the left-hand navigation menu. #19991
Searching and filtering on the Workbench “Shared with me” view now works as intended. #20617
Scalability and Performance Improvements
The API server prioritizes requests that come from interactive clients, processing them before any others. This improves the responsiveness of the system to users using clients like Workbench while the cluster is under heavy compute load. An “interactive” request is one with the Origin
header set. Both versions of Workbench and keep-web set this. #20602
The installer can deploy nginx load balancing in front of multiple controller nodes. This provides an easy way to deploy Arvados clusters with more availability and scale. #20610
Improved the scalability of the Crunch cloud dispatcher can by supporting a list of subnet IDs in Containers.CloudVMs.DriverParameters.SubnetIDs
. If an attempt to create a compute node fails because a subnet is full, the dispatcher will retry the request in the next subnet in the list, cycling through it as needed. #20755
Improved the responsiveness of container update cascades (such as cancelling a running workflow with many children) by optimizing the SQL and reducing back and forth between the database and the API server. #20457, #20529, #20472
Improved the performance of keep-web when users are writing many small files. #20559
Improved the performance of keep-web’s S3 API when listing directories with more than 1,000 items. #20726
Improved the responsiveness of keep-web under heavy cluster load by using a shorter timeout on API requests. #20425
The default Containers.CloudVMs.SupervisorFraction
for the Crunch cloud dispatcher has been changed to 0.50
(50%) to allow more progress when the dispatcher just started and there are many workflows waiting to run. #20894
API
The /arvados/v1/groups/contents
API supports a select
parameter like many others. Clients can use this to request only the fields they need and reduce load on the API server. #20470
An API tokens with usage limited by scopes can always make a request to GET /arvados/v1/api_client_authorizations/current
—i.e., get itself. This makes authentication across clusters in a federation more reliable. #20750
When the API server is authenticating a remote user, if it fails to get the current user record from the original cluster for any reason, but already has a record of the user in its local database, it will use that local record for the session. This improves the reliability of authentication across clusters in a federation, and makes it easier to issue tokens with more limited scopes (e.g., tokens intended for collection sharing). #20750
The API server accepts SSH public keys in any format recognized by OpenSSH. This means it accepts ECDSA and ED25519 keys in addition to RSA and DSA. (Note that Workbench 2 has a separate validation that has not been updated yet.) #20241
Optimized default values for several scale-related settings in the installer to account for changes in API server behavior. #20680
When a project that contains running container requests is trashed, any containers that are running to fulfill the request are cancelled. #20877
The API server returns a more appropriate status when a request for a collection by portable data hash receives a mix of error responses from different clusters in a federation, so clients have better information about whether or not they can retry the request. In particular, the return code is 404 Not Found if all clusters return 404; 422 Unprocessable Entity if all clusters return a 4xx error; or 502 Bad Gateway if any cluster returns a 5xx error. #20425
Improved the performance of API requests with a “property exists” filter by optimizing the underlying SQL query. #20858
Improved the performance of API requests that list collections by name by adding a database index on this column. In our experience this is a common user query. #14070
Crunch
Crunch supports the Linux kernel cgroups v2 API. You can now deploy Crunch on more modern distributions with full compute usage reporting without turning on the older cgroups v1 API. #17244
If a container request does not specify preemptible compute node instances, then Arvados will no longer reuse unfinished containers that used preemptible instances. This situation can occur when a user notices that preemptible instances are failing before Arvados finishes retrying, and resubmits their workflow with preemptible instance use disabled. The change ensures that Arvados runs new containers on reserved instances as the user intended, rather than reusing the preemptible containers that the user expects to fail. #20606
The Crunch cloud dispatcher’s internal concurrency limit more closely follows the known cloud quota, to avoid excess thrashing around the limit. A new configuration setting Containers.CloudVMs.InitialQuotaEstimate
provides the initial value used by the dispatcher at startup. #20667
The Crunch cloud dispatcher waits longer after hitting a cloud quota limit, to reduce request thrashing and increase the chances that the next attempt to create a compute node will succeed. #20457
crunchstat-summary
reports a warning when a category of statistics is not available from a container’s logs to help the user understand why a graph is empty. This can occur when compute nodes are not configured with cgroup statistics accounting that Crunch can read. #20705
The arvados-server cloudtest
diagnostic respects the Containers.CloudVMs.DeployPublicKey
setting, so the test more closely mirrors Crunch’s own behavior. #20649
If the Crunch cloud dispatcher encounters an SSH authentication error, that is logged immediately to aid debugging, rather than waiting for the boot probe timeout. #20649
If the Crunch cloud dispatcher times out waiting for a successful boot probe on a newly created instance, it logs the last error in addition to error output from the boot probe command. It also suggests using arvados-server cloudtest
to help diagnose the problem. #20649
Improved performance in the Crunch cloud dispatcher and reduced load on the API server by optimizing several queries. #20601
Improved performance in arvados-cwl-runner
and reduced load on the API server by optimizing several queries. #20652
SDKs
The writeFile
function in the R SDK has been extended to take collectionUUID
and fileFormat
arguments. This makes the function more extensible, allowing you to write to an existing collection rather than a new one, or to write a file with a particular format but nonstandard extension. Thanks to AnetaSta22 for this contribution. #20660
The collection list method in the Java SDK now supports the include_old_versions
and include_trash
arguments of the Arvados API. Thanks to Krzysztof Majewski for this contribution. #20664
Optimized the Python SDK’s thread pool that prefetches data from Keep to scale better when fetching from thousands of Collection objects. #20637
Added the arvados.api_resources
module to the Python SDK. It documents the API provided by the Arvados API client object, like you get when you call arvados.api('v1')
. This documentation should help developers make fuller use of the Python SDK. You can view the documentation on the web or in pydoc (e.g., run pydoc arvados.api_resources
on a system with the Python SDK installed). #18799
Updated installation documentation for all the Arvados Python SDKs and tools to recommend installing inside a virtualenv as best practice following the adoption of PEP 668. #20543
Expanded installation documentation for Arvados client tools in the user guide. #20684
Deployment
The installer can reduce cluster downtime by performing rolling upgrades when a cluster is deployed with a load balancer and multiple controllers. #20680
The installer’s Terraform tools can deploy into existing cloud infrastructure (VPC, subnets, etc.) instead of creating a completely new stack. #20482
Administrators can configure which resources are managed by arvados-login-sync
: user accounts, group memberships, SSH keys, and Arvados API tokens. Arvados clusters in environments that already have infrastructure to manage some of these resources can configure arvados-login-sync
to disregard them and prevent conflicts. A flag for each resource is in the Users
section of the configuration. #20663
Update the default configuration for arvados-login-sync
to avoid managing security-sensitive groups on Debian- and Red Hat-based distributions. If you are granting users access to groups like sudo
or wheel
through Arvados, you may need to configure Users.SyncIgnoredGroups
with your own list. #20663
Improved the security of the installer by using a separate file to configure cluster secrets. This file can be managed in more secure environments to better protect these secrets during the deployment process. #20665
Added configuration options to the installer for administrators to adjust:
-
several nginx and Passenger settings that need to be tuned to match cluster size and load - #20468
-
how long Prometheus retains data - #20889
-
names of the Arvados PostgreSQL database, database role, Keep’s S3 bucket, and Keep’s IAM role - #20889
Improved scalability in the installer by configuring more nginx settings based on CONTROLLER_MAX_QUEUED_REQUESTS
. #20594
Improved availability in the installer by configuring nginx to allow a few more connections than the API server is willing to handle. This ensures metrics are available even when the API server has no more capacity for requests. #20474
Expanded the installer documentation to cover different certificate modes, optional encryption of the TLS certificate key, and Keep’s S3 backends. The documentation for the previous manual rolling upgrade process has been removed now that the installer natively supports rolling upgrades. #20888, #20889
Improved the reporting of arvados-client diagnostics
by extending the test container to make Arvados API requests. This lets users know if compute nodes have trouble making API requests. To do this, arvados-client
builds a tiny Arvados Docker image to use to run the test container. If you cannot build this image in your environment, you can select what Docker image is used for diagnostics with the -docker-image
option. #20612
API Deprecations
With the release of Arvados 2.7.0, we are formally announcing the deprecation of some older APIs. These are scheduled to be removed in a future major Arvados release. The following API resources and their associated endpoints are all deprecated:
jobs
,job_tasks
,pipeline_instances
, andpipeline_templates
: These resources were all used by the previous version of Crunch. They have been replaced bycontainers
,container_requests
, andworkflows
.keep_disks
: Replaced bykeep_services
.nodes
: This resource was used by the previous version of Crunch. Crunch now better integrates with the underlying dispatcher, so it no longer needs to duplicate this information.repositories
: This was meant to support workflow development in the previous version of Crunch. CWL workflows let you deploy software in container images, andarvados-cwl-runner
records Git metadata for registered workflows, so this functionality is no longer useful.humans
,specimens
, andtraits
: These resources were originally intended to hold metadata for specific kinds of samples. They have been replaced by project and collection properties, which are more flexible and can be enforced with metadata vocabularies.
In addition, when Arvados returns api_client_authorizations
, the fields api_client_id
, user_id
, and default_owner_uuid
are all deprecated. The first two are internal fields that are not useful to clients. default_owner_uuid
has never been implemented and we have no plans to do so.
Updated the Arvados API documentation to announce these deprecations. #20840, #20951
Some classes and functions in the Python SDK that were built around these APIs, or have been replaced by new functionality in the Python standard library, have been deprecated as well. Calling them will emit a DeprecationWarning
with a suggested alternative where possible. Their docstrings note this information too. #20839
The Keep S3 driver version 2 became the default driver in Arvados 2.5.0. The version 1 driver has been removed completely from this release. #19620
Bug Fixes and Minor Enhancements
Fixed a websockets server bug which caused it to stop sending updates under high load. #20507
Improved the reporting of various services with a new configuration option Users.AuditLogs.RequestQueueDumpDirectory
. If a service is near its configured maximum of concurrent requests, it will write a JSON file to this directory with details about the request queue. This can help diagnose performance problems even when the problem is difficult to catch in realtime. #20475
The browser back button correctly navigates to the previous panel after visiting a collection by portable data hash. #19793
The “Trash” view in Workbench shows all items in the trash, not only those owned by the current user. #20603
Fixed a Workbench 2 bug where certain users with “manage” permission on an object were not able to access the sharing UI. #20829
Filtering Processes on “Queued” status lists containers in both “Queued” and “Locked” state. #20845
When you launch a workflow or other process from Workbench 2, it is submitted with the usual default priority 500, rather than the lowest possible priority 1. #20882
Fixed an issue where the API controller could not serve its cached discovery document in some network configurations. Thanks to George Chlipala for contributing this fix. #20919
Fixed a bug where arvados-cwl-runner
would crash with an IndexError
message when there was exactly one file in a set of related inputs. #20462
The arvados-client shell
command reads connection settings from ~/.config/arvados/settings.conf
like other client tools. #20757
Documented the collection metadata property arv:workflowMain
. #20374
Improved the scalability of the Crunch cloud dispatcher by recalculating the number of allowed supervisor containers after hitting a cloud concurrency limit. #20601, #20667
Fixed a bug where Crunch would retry containers for a workflow that had been cancelled. #20614
Fixed some consistency issues in the installer to prevent “unbound variable” errors. #20889
Dependency Updates and Development Improvements
Arvados 2.7.0 runs on Go 1.20.6 and Ruby 2.7.7. We also upgraded various libraries and services that Arvados works with. #20325, #20735
We publish Arvados packages that are built on Rocky 8. We expect these packages to be compatible with any distribution based on RHEL 8. Note the installer has not been updated to support these distributions yet; that work is coming in a future release. #20797, #20844, #20822, #20878
The web documentation for our Python SDK is built using pdoc, instead of its pdoc3 fork. #20853
Prevented some deprecation warnings coming from regular expressions and use of the pipes
module in the Python SDK. #20343, #20710
Fixed a crash in arvados-docker-cleaner
by updating the docker
library to prevent dependency conflicts. #20754
arvados-docker-cleaner
now uses version 1.35 of the Docker API to better match other Crunch tools. #20754
Improved the reliability of the arvados-client
Debian package by declaring its dependency on fuse
. This client tool has long depended on the FUSE library; this just lets the package manager know so the library can be installed if necessary. #20619