Raphael Dussin
6 min readJan 23, 2021

ROMS is one of the most popular numerical models for regional and coastal oceanography. One of ROMS’ particularities is that it uses a terrain-following vertical coordinate, in which the “layers” depend on the bathymetry (ocean depth) and the free-surface (difference of the sea surface elevation to the geoid) which evolves with time. To analyze and/or plot ROMS output, the depth of the vertical layers needs to be computed from the bathymetry, free-surface and a fixed set of parameters describing the vertical coordinate transformation and stretching. Several packages, such as pyroms or CROCO_tools, are available to assist users perform these computations, and many more such as extracting and plotting horizontal or vertical slices of data.

Xarray has been a game changer for oceanographers and climate scientists, and xgcm provides the framework extension to support staggered grids used by General Circulation Models. Unfortunately ROMS netcdf output do not play well with xarray and xgcm. One of my latest simulations with ROMS is a 2km model of the US West Coast, it has 1440x450x50 grid points. The physics is coupled with a biogeochemistry model (36 tracers) and hence producing a large number of outputs for prognostic and diagnostic variables. ROMS also writes runtime parameters as 0D netcdf variables in the output files (for bookkeeping). This results in two sets of output files (avg and dia) containing respectively 358 and 400 netcdf variables of various sizes. Trying to open 10 years of data, saved every 5 days, using xarray.open_mfdataset just does not work. Profiling (using %prun) on a subset of model outputs shows that a very large fraction of the time is spent opening and inspecting netcdf variables. The second problem comes from the dimension names (at least in the Rutgers version of the model), which require significant modifications to be properly understood by xgcm.

Because of these issues, the proposed solution is to take an extreme step and convert the roughly 5TB of netcdf files to a consolidated zarr store. In this zarr store, the metadata for the dataset is written in a .zmetadata text file, which solves the problem of opening and inspecting hundreds of netcdf files, each containing hundreds of variables. The data itself is stored in 3D chunks (longitude, latitude, vertical levels) of about 125 MB. The first part of this post describes the conversion process. For readability, it is split in the discussion of various functions and a main script calling them sequentially. The second part shows a couple of examples of how to use xarray and xgcm to compute and plot some maps and sections from the newly created dataset.

ROMS writes a lot of small variables, such as runtime parameters, in the netcdf files. This really cripples the opening of a dataset consisting of a large number of files with xarray.open_mfdataset. Although keeping all the runtime parameters as variables in the zarr store should not make a big difference in performance when opening the store, it would significantly increase the number of variables in the xarray.Dataset, which is already close to 200! The runtime parameters can always be retrieved from the model’s namelist (e.g ocean_CCS2.in). Of all 0D/1D variables, the ones that need to be kept in the dataset are the ones describing the vertical coordinate and the time axis.

The grid discretization in ROMS includes many “exterior” grid points, used either for the open boundary schemes or for better parallel performance of the code, that are not needed to analyze the data. Removing these points results in a nicely formatted symmetric grid, similar to what is used in MOM6 .

On the Arakawa C-grid, variables such as density (rho), zonal (u) and meridional (v) velocities and vorticity (psi) are located on different grid points (grid center, faces and corners). ROMS defines dimensions for all 4 point types (e.g xi_u, eta_u), which creates duplicate information (e.g xi_rho = xi_v). Other models, such as MITgcm or MOM6, only define 2 dimension sets for the center and corner points, which is the data model supported by xgcm. This function perform a reduction and renaming of the dimension sets, and uses the MOM6 naming convention (xh,yh for centers and xq,yq for corners), although this is a completely personal choice.

Although the sets of dimensions change from 4 to 2, longitude and latitude are defined on all 4 point types and keep their original names. To plot against spatio-temporal coordinates with xarray.plot, the variables for longitude, latitude and vertical coordinates need to be defined as coordinates of the xarray.Dataset.

With all the functions established, the main script can now process sequentially the hundreds of netcdf output files, for which a list is created using glob. In the Rutgers version of ROMS, the horizontal grid coordinates can be written in a separate grid file and the grid variables need to be included in the zarr store. Some less relevant (e.g. spherical, hraw) and redundant variables (e.g. Cs_r, Cs_w, hc) can be dropped and only interior points are kept so that the grid variables are consistent with the rest of the data variables. Once the grid variables are merged with the data variables in a temporary dataset, the choice of the final chunking is made and data is either written or appended to the zarr store namedCCS2_store.

The script takes hours to run since it is very heavy in I/O operations but that’s a one-timer and opening the dataset just takes a second after the full reformatting.

This second part demonstrates how easy it is now to plot variables from the newly created dataset, on top of being way more performant. In order to analyze the data, additional information about the depth of the layers z_rho and the interfaces z_w is required. Since they depend on the free-surface zeta, which is time-dependent, the depth arrays are 4D. Using the equations defining the vertical coordinate described here, xarray can build the depth arrays lazily (i.e. the data is not loaded in memory and is only computed as needed) thanks to the dask layer. Since xarray cannot plot against a coordinate containing NaN (short for Not A Number, here on land points), the missing value is set to an arbitrary minimal depth. Finally xgcm provides the right interpolation procedure to build the depth arrays for the U and V velocity points z_u and z_v as well as the thickness of the layers dz by differentiating between interfaces along the vertical axis.

An analysis script starts by opening the zarr store, building a xgcm grid object and computing the depth arrays. The xgcm grid object is what allows to perform grid-aware operations, such as interpolation or differentiation of gridded variables. Each axis (X, Y, Z) receive the information about the relative location of the dimensions along it. Note that the vertical axis is defined in terms of s_rho, s_w which are the vertical dimensions and not z_rho, z_w which are coordinates but not dimensions. In the regional model, the domain is not periodic. With the knowledge of the grid staggering, the depth arrays at all points can be computed using the function previously described.

The first example demonstrates how to plot a section. Now that all the coordinates are properly defined, xarray can plot any variable against its own time-varying depth. The plot shows the “meridional” velocity (following the model’s axes) v for a given time and y index. Using the keywords arguments x=’lon_v’, y='z_v' allows to choose against which coordinates to plot. The other arguments control the color palette and other cosmetic features.

The second example shows how to create a temperature map at a given depth. Because of the structure of the vertical coordinate, a geopotential will intersect multiple model layers. Hence the value at a given depth is the result of the vertical interpolation between data points located at z_rho onto the target depth. xgcm.transform allows to perform linear or conservative interpolation of the temperature data along the Z axis to the target depth of -100 meters using the 3D depth array to define interpolation weights.

Hopefully this post convinced the reader about the benefits of moving to a zarr-based storage to leverage the power of xarray/dask and xgcm to perform computationally efficient computations and easily produce diagnostics and plots of ROMS outputs.

Responses (2)