Arvados 2.6.0 Release Notes
April 6, 2023
The Arvados team is pleased to announce Arvados 2.6.0. This release includes many improvements to Arvados’ performance and reliability under heavy compute load, along with a variety of other new features and bug fixes. We recommend that new and existing installations of 2.5.0 or earlier upgrade to 2.6.0. See Upgrading Arvados for upgrade instructions.
New features and enhancements
Workbench 2
Workbench 2 now provides a dedicated view for registered workflows, which lets you easily view and refer to their details. #19482
The Workbench 2 sharing dialog has numerous usability improvements (#19294, #20085):
- “All users” has been added as a top-level sharing option to easily share with everyone on the cluster but not anonymous users.
- All sharing changes are saved immediately.
- The button to add a sharing permission runs parallel to the button to remove one.
- If the user starts searching for a user or group to share with, then abandons the search, the input will be cleared to reflect that no change has been made.
- Only users’ names are listed to provide a more friendly view. Additional details about the user are available through a tooltip when you need them.
Workbench 2 now filters workflow intermediate and log collections out of the default project view to reduce clutter. You can browse these collections by selecting them from the Type filter pulldown. #19295
The container process action menu in Workbench 2 now includes “Copy and re-run process.” This creates a copy of the process in draft state so you can run it again. #15557
Workbench 2 now provides a “Cancel” button for container processes that are queued but have not yet started running. #20000
Workbench 2 now provides a “Run” button to start container processes in the “draft” or “on hold” states. #20000
Workbench 2 now reports container process status as “Cancelling” when the user has requested that a process be cancelled but Crunch has not yet shut it down. #19295
Workbench 2 now reports container process status as “Reused” when Arvados reuses results from a previous workflow run. #19295
Administrators can now configure Workbench 2 to display a banner message to users. This can be used to inform users about upcoming cluster maintenance or configuration changes. Refer to the Workbench configuration documentation for details about how to set this up on your cluster. #18368
Administrators can now configure Workbench 2 to display custom tooltips. These can provide users with tailored guidance about your site’s intended workflows and procedures. Refer to the Workbench configuration documentation for details about how to set this up on your cluster. #19836
Crunch
When running on AWS, Crunch now obtains instance pricing information from AWS APIs, and uses that to calculate container costs. This especially improves the accuracy of reported costs for containers run on spot instances. #19320
Crunch periodically updates running container records with their cost so far. This provides users with an estimate of the cost of containers that haven’t finished yet, as well as containers that abort unexpectedly. #19967
Shortly after Crunch starts a container, it will create a log collection that includes information about the compute node. Workbench 2 already uses this collection to display information about the compute node allocated for the container, so now this information will be available sooner. #19886
When Crunch dispatches a container to an AWS spot instance, if AWS announces a planned interruption of the instance, that information will be recorded in the container logs and runtime status. #19961
Crunch now logs the first time a running container uses more than 90%, 95%, or 99% of its requested memory. This can help users diagnose if a container likely failed because it ran out of memory. #19986
Crunch now logs the maximum usage it recorded for each resource after a container finishes running. This should help users hone their resource requests for a workflow. #19986
Documentation
The Python SDK cookbook has been expanded with organization by subject, background discussion for each recipe, and clearer examples. #19792
The Python SDK install instructions have been reorganized so they’re easier to follow. #19926
The container request API documentation now includes much more detail about how container cancellation works. #19624
SDKs
The R SDK includes a new writeFile
function that can write to an existing collection, rather than creating a new one every time. #20214
API server and controller
All of the API server’s internal id columns have been migrated from 32-bit to 64-bit integers to provide more room for growth. Note that running this migration on a large production instance may take several hours. #19890 #20074
The API server automatically deduplicates permission links as they are created and updated. As a consequence, these API operations may now return an existing link. There is also a migration to deduplicate existing links. This migration could take a while to run if you have many duplicate links already, but this shouldn’t be common. #18693 #19954
LDAP configuration now includes a MinTLSVersion
setting. You can set this to allow all Arvados systems to negotiate LDAP connections that use a version of TLS older than what’s recommended (currently TLS 1.2) if that’s the most your server supports. #19896
When Arvados retries a container, it will synthesize a new set of scheduling parameters from all outstanding container requests that provide the maximum requested resources to the new container. This means Arvados acts as expected when a user sees that a container is likely to fail and submits a new request with more generous scheduling parameters before it actually fails. #19917
Containers now have a log method that provides WebDAV access to running container logs. Future releases will include client tools that use this endpoint for more performant log viewing. #19889
Salt installer
The Salt installer now provides cluster monitoring by integrating with Salt’s Prometheus and Grafana formulas. Arvados nodes and services will be configured to publish metrics to a local Prometheus server, and those can be browsed with Grafana. #16379
The multi-node Salt installer now supports deploying to AWS with TLS private keys encrypted with a passphrase. nginx retrieves the passphrase securely from AWS Secrets Manager. #20035
The default deployment strategy used by the Terraform + Salt install has been adjusted to require fewer public IPv4 addresses. In particular, this means Arvados can now be installed in a fresh AWS account without modifying the installer or needing to request additional public IPv4 address quota. #20270
Scalability and reliability improvements
arvados-cwl-runner
now uploads workflows to a collection that includes its dependencies, rather than a single JSON document. This is faster and the uploaded workflow stays much closer to the original source, which simplifies debugging. #19385
arvados-cwl-runner
supports a new workflow extension arv:OutOfMemoryRetry
. If a workflow step has this hint defined, and fails because the tool ran out of memory, arvados-cwl-runner
will automatically retry the step once with a request for more RAM in its runtime constraints. The extension can define how much additional memory to request and how to detect out-of-memory errors from the tool. See our CWL extensions documentation for full details. #19975
The Go SDK can now automatically retry requests that encounter temporary failures. Retries are delayed with exponential backoff, limited by the duration of Client.Timeout
. #19972
If the Crunch dispatcher receives a 503 error response from the API server, it reduces the number of API requests it puts in flight at one time to allow the API server time to recover. This limit gradually increases over time without an error. #19973
If the Crunch dispatcher receives InsufficientFreeAddressesInSubnet
or InsufficientVolumeCapacity
errors from EC2 when it tries to create new compute nodes, it treats those like hitting other quota limits, and will pause trying to create new nodes. #20188
CloudVMs configuration now includes a MaxInstances
setting. This limits the number of compute nodes created by arvados-dispatch-cloud
to ensure your compute capacity does not grow beyond what your API server can support. #18075
CloudVMs configuration now includes a SupervisorFraction
setting. This limits the number of instances created out of MaxInstances
to run workflow supervisor processes like arvados-cwl-runner
to ensure they do not take so many compute node resources that they collectively bottleneck each other. #20182
API configuration now includes a LogCreateRequestFraction
setting. This limits the number of concurrent requests out of MaxConcurrentRequests
that can be log create requests. Log create requests that come in when the server is at this limit will receive a 503 Service Unavailable response. This ensures capacity is available for cluster administration even when the API server is under heavy log load from running containers. Crunch logs that receive this response will be discarded. #20200
The default API configuration for MaxConcurrentRequests
has been changed from 0 (unlimited) to 64. With more deployment experience, we believe this limit is appropriate for most new installs, and is easy to increase as clusters grow. #20200
The default Collections configuration for BalancePeriod
has been changed from 10 minutes to 6 hours. With more deployment experience, we believe this default will still provide sufficient block balancing and cleanup for most clusters, while leaving more resources available for other Arvados work. #20227
The API server had background logic to keep priority consistent across related containers. For example, if you cancelled an entire workflow by setting the priority of your original container request to 0, this logic would set priority 0 on all the container requests it spawned as well. We diagnosed several performance problems in this code, so Arvados 2.6.0 includes a more performant implementation in the controller. #20183 #20240
Several large database queries throughout the API server have been optimized to work in small batches and/or select the specific data fields they need to reduce memory requirements. #20223
The controller now caches the API server discovery document and serves it directly to clients. #20187
Workbench 2 now copies and updates collections using the replace_files
API option. This provides better performance when modifying large collections. #20029
Bug fixes and smaller changes
Workbench 2
When you search a project in Workbench 2, and view the details of the one of the result items, Workbench 2 now retains your search and view. #19865
The advanced search dialog in Workbench 2 no longer requires you to select a project to search. #19908 #19969
When you advance through pages of subprocesses on a process page, then reload the process page in your browser, Workbench 2 now remembers and displays the page of subprocesses you were viewing. #20252
Fixed a bug in Workbench 2 where sorting a project listing did not work correctly for some data columns. #19988
Workbench 2 now revalidates caches when displaying collection contents to avoid showing users an out-of-date listing. #19899
If your Workbench 2 session expires, then you log back in from that page, you will be returned to the page you were previously viewing. This is true whether your session expired due to inactivity or because your underlying authorization token is no longer good. #19715
Workbench 2 no longer displays a “Not Found” error when it fails to load resources associated with a container process. #19900
Workbench 2 now displays process status as “Unknown” when it does not have this information available. Previously it would show “Cancelled” in this case. #19273
Fixed a bug where Workbench 2 would construct invalid WebDAV URLs for collections when the cluster was not configured with wildcard certificates. #20089
When a user edits an object’s description to be empty, Workbench 2 will now explicitly update the API object’s description
field to null
. Previously it would update the description with a contentless HTML skeleton, which prevented the API.FreezeProjectRequiresDescription
setting from being enforced as intended. #19930
Fixed a bug where Workbench 2 would show status incorrectly for some container processes in a very long list. #20251
API server and controller
The controller now reads requests sent as multipart form data. Workbench 2 sometimes sent requests encoded this way, so those requests are now handled properly. #19597
If the controller encounters an error when it tries to validate an OIDC token, it now returns a 5xx error so the client knows it can retry. Previously it returned a 401 Unauthorized response, which was indistinguishable from an invalid token. #19907
Fixed several bugs in the API server’s configuration reload thread that made it unreliable. #20137 #20198
The API server should now recognize all system properties, so you can define a strict vocabulary without having to redefine any system properties. System properties are documented in the Metadata vocabulary API documentation. #19980
Improved the trusted clients match detection to support configured URLs that explicitly specify their scheme’s default port number. #20264
A database migration included in Arvados 2.5.0 has been adjusted so it can run on PostgreSQL 11. #19993
SDKs
Fixed a bug in the R SDK that could prevent you from fetching multiple files from a collection in succession. Thanks to Konrad Rudolph for this fix. #20295
Keep
Fixed a bug in keep-web
where it would try to update a collection just because another client sorted the manifest differently than keep-web
would. This issue could cause users to receive Unauthorized error responses if they had read-only permission to a collection when keep-web
sent the update request. #20083
Fixed a bug in keep-web
where it did not redirect unauthenticated users to Workbench 2 if the cluster was not configured with an anonymous token. #19963
In previous releases arv-mount
reported a generic “I/O error” if you tried to create a file directly inside a project directory. Now it reports “Operation not supported” to clarify that the problem is with the request and not the system. #19897
When the same volume is available from multiple Keep stores, and keep-balance
wants to trash blocks on that volume, it will now select one Keep store at random to receive the trash request. It previously sent the request to all Keep stores, which could cause S3-backed stores to detect a race condition and discard the trash request. #20242
keep-balance
now has an option that limits it to working on blocks with a specific checksum prefix. For now this is intended primarily as a way for us to instrument potential keep-balance
scaling strategies. #19923
Crunch
Fixed a bug in arvados-dispatch-cloud
where it did not properly install crunch-run
after an upgrade, causing it to abandon running containers. #20235
Dependencies
The Python SDK’s dependency on google-api-python-client
has been upgraded to version 2.1.0+. This makes it easier to install alongside other libraries that use that package. #19895
The Prometheus client library used by Arvados has been upgraded to version 1.14.0 to address a security vulnerability in earlier versions. #20121
Fixed a bug in the build scripts that would generate inconsistent version numbers on old commits from release branches where their nearest tag is older than their base merge commit. #19937