EDSS and Models-3 I/O:
REQUIREMENTS, DESIGN, AND IMPLEMENTATION
Carlie J. Coats, Jr., Ph.D.
Environmental Programs,
MCNC North Carolina Supercomputing Center
Copyright 1992-2001 MCNC
INTRODUCTION
EDSS is a system designed to support decision making and related
activities (such as modeling) for both regulation and research
concerning environmental issues. That means that EDSS is more than
just a model: more, even, than a family of models together with a
graphics and analysis package attached to them. Models-3 is the EPA's
National Environmental Research Laboratory's interoperable air quality
counterpart, which shares many models, parts, etc.,with EDSS. Within
both of these, however, lies a common thread: tools which can create
and/or get to the data and distill it in ways useful to decision
making, and do so in a timely fashion. Decisions need to be based
upon facts -- upon environmental data which may be
the result of observations, or the result of modeling (whether air
quality modeling or economic modeling or what.) With the initial
emphasis on air quality issues, EDSS will initially contain at least a
family of air quality models, together with meteorology and emissions
models to support them, impact models (economic, ecosystem, etc.) to
assess their effects, and analysis tools to examine their results.
Undistilled environmental data is so voluminous, however, that just
"looking at" the numbers (or any substantial subset thereof)
is not a profitable enterprise for the decision maker. As Hamming so
aptly put it, "The purpose of computing is insight, not
numbers."
The voluminous data associated with environmental issues leads us
to two different but related subjects which need to be handled by
different levels of the EDSS system:
- data management; and
- data access.
Data management is concerned with operations which affect data
sets as a whole: indexing, archiving, and migration, while data
access is concerned with the ways that tools access data extracted
from data sets for use within themselves, whether for the purpose
of modeling or for the purpose of analyzing it in order to attain
insight. This paper is concerned with the second issue, the issue
of data access; data management in a way which supports the needs
of EDSS is the issue of a separate paper. Data access is concerned
with the ways that programs access data from data sets. Three
issues come up in this regard:
- How do programs refer to data sets?
- What operations need to be performed, and therefore what
subroutines perform these operations?
- What data structures (and other common assumptions) are
needed for the interfaces to these subroutines?
Specifying these three items cleanly and in a modular, re-usable
fashion leads to a data access interface for programs, what the
current computer jargon will call an input/output applications
programming interface, or I/O API. This document is concerned
with describing the requirements, design, and implementation of the
Models-3/EDSS I/O API. To some extent, this description will use
the language of object-oriented programming, although this treatment
is not thoroughgoing in that regard. (Some would say that we are
taking an object-based rather than an object-oriented view.) In O-O
terms, this combination of operations, or methods, to be performed on
data and structures for storing it are said to form a class. From
the point of view of a user of a class, what is important is what the
operations are, rather than how they do it -- a concept called
encapsulation -- that emphasizes the distinction between externally
visible or public interfaces and internal or private implementations.
The idea of using a generalized class as a foundation for more
specific subclasses specialized for a particular purpose is called
inheritance.
It should be realized that the present implementations may not scale
to the long term, since it is unclear at present what the shape of
computing will be when it is dominated by massively parallel systems
with many thousands of processors. The one thing which seems clear
at present is that parallel systems researchers are still groping
for the "right" way to exploit such parallelism. With sufficient
forethought, however, the requirements analysis and interface design
may survive even if the implementation does not.
I. REQUIREMENTS
A: Objective
The objective of the Models-3/EDSS I/O API is to provide a single
generalized I/O structure class (with subclasses) with the data
structures and data access operations to fulfill the needs of Models-3
and EDSS. Initially, this class needs to serve the needs of meteorology,
emissions, and air quality models, preprocessors, postprocessors, analysis
and visualization programs, and such other computational tools are used
within Models-3 and EDSS, for both regulatory and research applications.
B: Systems Requirements
The I/O API must be callable from at least
FORTRAN and C .
It must be stable over time and should be upward compatible with successive
versions of EDSS models and framework, and capable of fulfilling
Models-3 and EDSS' needs for the foreseeable future. It must be
compatible with at least the following platforms, operating systems,
and other software systems:
- POSIX
- Cray / UNICOS
- UNIX
- TCP/IP/FTP
- NFS and AFS
The data must be portable across machines; ideally, they should
be transparently accessible across a heterogeneous distributed
network using remote-mounted file systems such as NFS and AFS.
Files must contain sufficient self-description that they can
"stand alone." In particular, reading they should
not depend upon the availability of additional external
"grid description " or "file dictionary" software.
The I/O API must be both friendly for research modelers and also
have adequate integrity to serve for regulatory applications. In
particular, it must maintain "chain of custody" adequate to
stand up in court for the data it manipulates. See the
Models-3 Coding Standards Document for
some of the relevant coding standards, in order to achieve this level
of integrity.
C: Data Types Supported
For purposes of further analysis, we assume that data is organized
into logical files which are assumed to be
multiple-variable data sets having a common origin and common
data structure type (common
time-stepping assumptions, grid dimensions, etc.). On the other hand,
emissions data shows that variables of different
basic data types -- INTEGER,
REAL, and DOUBLE PRECISION -- may be required in the same file.
Preliminary investigation suggested that at least 60 variables per file
may be necessary (later experience has expanded this, so that the
current limit is 120 for versions through 3.0, and 2048 for versions 3.1
and later), and that it is useful for files to be able to hold data for
periods of at least a year in duration. Whether the logical files are
implemented as physical files or as tables in a scientific database or
by some other means is an implementation issue which may in fact change
over time; what is important for analysis here is the nature of the
interface to these logical files. A systems analysis of meteorology,
emissions, and air quality modeling finds a number of particular types
of data which must be supported. These particular types of data can
be fitted into several generalized data types which need to be
supported by the I/O API, in keeping with the object-based
methodology we are employing. These particular data types are:
- terrain-height data
- land-use data
- demographic data
- meteorology observations:
- surface
- upper-air profiles
- satellite (this is a future item not currently in use)
- radar (also a future item)
- air quality monitor observations
- deposition monitor observations
- emissions input data:
- point-source
- area-source
- mobile-source
- natural-source (biogenic, lightning strike)
- processed (modeled) meteorology data
- processed (modeled) emissions data
- initial condition data
- boundary condition data
- geospatial transform (sparse) matrices for emissions and other modeling
- air quality concentration and deposition data
- intermediate-stage modeling data
- diagnostic
- other research
- Added Nov. 2001: Additional types of data that
it may be useful to support are the following (either of which may
be time-independent or time-dependent):
- variables defined on geospatial coverages
- variables defined on finite element (unstructured) grids
Note that the usual GIS formats and access methods do NOT
provide efficient access to time-stepped variables
defined on geospatial coverages.
During the process of both input (of emissions) and analysis, we
are frequently concerned with aggregated data, data which has been
combined using such operations as averages, maxima, and minima
applied to particular subsets of the data. The I/O API must support
the results of such aggregations. Two important types of aggregation
operations are:
- temporally aggregated data (daily-maximum ozone concentrations
or annual-total sulfate depositions, for example)
- geographically aggregated data (such as state or county totals)
Analysis of the types of data arising and the operations applied to
them yields that the the system must support the following:
- Read and write operations must do the status checking necessary
to ensure whether they are successful, and maintain audit trails of
all the relevant operations as they do so. In particular, failing
and/or incomplete read and write operations must be flagged as
unsuccessful, and the nature of this failure logged.
- both time-independent
data and time-stepped data, with
time step granularity ranging
from very small (~ 1 second) to
very large (~ 1 year), and should correctly support dates for
the year 2000 and beyond. Climate modeling requires
support for dates 1970 and before. Source-attribution
modeling may require writing data in other than chronological
order.
- multiple layers of multiple
(at least 60 for full-chemistry aerosol modeling) variables
per data set.
- at least the following general data structure types:
- Added Nov. 2001: An additional datatype that
we have prototyped, and that may be useful is that of
geospatial-element cell
complexes (GECC), which is a type efficiently
supporting both (time-stepped and time-independent)
geospatial coverages and finite element data, and modeled
after the data structures used in the (pure) mathematical
field of geometric topology.
All files must contain all the information necessary to access the data
contained in them. This is important both for analysis, where it
permits unified tools supporting a variety of files, and for sharing
data with others: the only data needing to be transported to a
colleague's system is the file itself, and not a collection of
auxiliary files as with some current models.
At least the following sorts of descriptive
information are required:
- file description (text about this type of file)
- file type (gridded, grid-boundary, ID-referenced, profile,
gridnest, sparse matrix, or custom)
- time step (or 0 for time-independent)
- starting date and time (relevant if time step nonzero)
- number of variables, their names, units designations, and descriptions
- update description (date and time of update; name of updating
program; text concerning the computational model run which supplied
the data in this file)
- coordinate system type and specifications (e.g.,
"Lambert, with defining angles 30N, 60N, 90W, and
center at 40N, 90W").
- data structure dimensionality (depends upon file type) --
e.g., for gridded:
- number of columns
- number of rows
- number of levels
- horizontal grid geometry,
if relevant:
- location (X,Y) of the grid origin (SW corner)
- cell-size (DX, DY)
- Coordinate system and horizontal grid geometry specifications
must be sufficiently precise to support (the ill-conditioned
arithmetic in) geospatial transforms for very-high-resolution
(e.g., 10-meter) modeling. In particular,
REAL*4
representation is not adequate.
- Coordinate system and horizontal grid geometry specifications
must be coded so as to support questions like "Is this grid a
properly implemented nest into that one?"
- vertical grid geometry,
if relevant (if number of levels is greater than 1):
- type of vertical coordinate system:
- hydrostatic sigma-P;
- nonhydrostatic sigma-P
- sigma-Z
- pressure
- height above sea level
- height above ground
- other
- array of layer surfaces;
- (for sigma-coordinates only:) the model-top
D: Functional Requirements
The I/O API must support the access needs of both the environmental
models used to simulate situations of interest to decision makers and
also the analysis and visualization tools used to distill insight
from the model inputs and outputs. Of particular concern is the
fact that Models-3 and EDSS will contain families of air quality models
with interchangeable-part science modules which implement the simulation
of the various relevant physical processes -- horizontal or vertical
advection, convective mixing, deposition, chemistry, etc., at a
variety of scales. Supporting both the model structure and the
analysis tools suggests that the view of the data presented to
the programmer by the I/O API should be selective random access
in terms compatible with model usage (i.e., access by file,
variable, layer, date and time, with possible further selection
by grid location). The I/O API should automate routine activities
such as the logging of I/O transactions to the extent feasible.
Examination of the initial air quality model prototypes has added
a pair of additional features: time-interpolation to a particular
date and time should be added as an additional operation; and that
it should verify consistency between the file structure as requested
by the caller and the file structure as recorded in the file itself,
e.g., by an additional buffer-size argument. The desired operations
are the following:
- start up the system;
- create a new file ,
according to a caller-supplied specification;
- open an existing file ,
either for input-only or for input/output;
- get the description
of an (existing) file;
- read data from a file, with at least the following variants:
- read the data for a
specified date and time, variable, and layer (where
"all" is a valid layer or variable specifier);
- read a subrectangle within
the grid of gridded data for a specified variable,
range of dates and times, and subrectangle within the grid;
- time-interpolate gridded,
boundary, or custom data for a specified variable to a
specified date and time (time interpolation of
ID-referenced data might not be well-defined if,
for example, differing sets of sites occur at adjacent
time steps);
- write data for a specified
variable, date and time to a file; and
- shut the system down,
flushing all data to disk.
E: Performance Requirements
In order to support both analysis and modeling, the system must support
operations on multiple (at least 20) simultaneously open files. Model
nesting and model intercomparison imply that the system must support
simultaneous access to files for different domains (something not
possible with the current (1990's) generation of some models). Access
should be by meaningful name or meaningful value rather than by
arbitrary index values (so that the caller asks for "O3" by
name for example, rather than needing to know whether ozone is variable
# 17, and requesting that). As used by calling programs, file names
themselves should be "logical names" in the sense that they
are properties of the program using them, do not depend upon particular
physical file names in the file system, and permit simultaneous and
independent execution of different instances of the same program on the
same machine without interference with each other (so that different
runs of the same air quality model might be executing simultaneously on
the same machine, for example). Using only the globally-visible
namespace provided by the file system makes this impossible -- or
difficult, at best -- in many instances.
II. DESIGN
The design of the Models-3/EDSS I/O API is given here in terms of its
externally visible properties, i.e., in terms of the conventions used,
the public INCLUDE-file interfaces, and the function-call interfaces for
the public routines in the I/O API. This section documents these externally
visible properties from the FORTRAN
programmers point of view, rather than from that of the
C programmer, as documented in a separate section.
A: Conventions
There are a number of data structuring and manipulation conventions
used consistently throughout the Models-3 and EDSS systems, and which
affect the I/O API. Among these are the representation of object-names,
grids, dates, times, and time-deltas. object names are (blank-padded)
FORTRAN CHARACTER strings of length at most 16. Case is significant.
Horizontal coordinate
systems are named entities, with map projections
taken from a short list of types: Lat-Lon, Lambert conformal,
Mercator, and Stereographic. Because of the ill-conditioned nature
of arithmetic relating to coordinate transformations, descriptive
parameters which completely specify the coordinate systems are kept
in 8-byte REALs. For all these except Lat-Lon (for which the
parameters are ignored), specification of a map projection
requires three parameters to determine the map projection,
and two additional parameters to specify the coordinate-system
origin relative to that projection.
horizontal grids
are named entities, for purposes of unambiguous identification.
For many models, it suffices to deal with regular grids, which are
completely characterized by the specification of a horizontal coordinate
system and four additional parameters which specify the grid origin
(lower-left corner) and the cell-size. Irregular grids are specified
by grid-geometry files, which are gridded files specifying cell location
and extent on a cell-by-cell basis.
vertical grids
are presumed to be irregularly-spaced and are characterized by the
following;
- vertical coordinate type, from a short list:
- hydrostatic sigma-P
- nonhydrostatic sigma-P
- sigma-Z
- Z (m above ground)
- H (m above sea level)
- eta
- specified by a geometry file
- other
- value of model-top (sigma-coordinates only)
- number
NLEVS
of levels
- array
VLEVEL( 0:NLEVS )
of values for the levels
dates and times
are stored as integers coding the Julian date and (24-hour) time
using the formulas
JDATE = 1000 * YEAR + DAY
JTIME = 100 * (100 * HOUR + MINUTE) + SECOND
= 10000 * HOUR + 100 * MINUTE + SECOND
where the year is specified using all four digits, the day number is
between 1 and 365 or 366 (depending upon leap year), hour is between
0 and 23, and minutes and seconds are between 0 and 59. For example,
the date Feb. 2, 1993 is coded as the integer 1993033, and the time
3:46:53 PM as 154653. When finer-grained resolution is required,
this two-integer representation is supplemented by a third component
which is a REAL between 0.0 and 1.0 representing fractions of a second.
This representation satisfies the granularity requirement of one-second
resolution, gives exact and machine-independent calculation of record
numbers within datasets, etc., correctly handles dates and times both
before 1970 and after 2000, and is easy for modelers to interpret and
manipulate within, e.g., a debugger. time-deltas are stored using the
same conventions as times, except that they may have arbitrarily large
hours-fields, and may be either positive or negative (in the latter
case, all three fields are negative or zero: -333 means a time step
backwards by three minutes and thirty-three seconds). A variety of
utility routines are available for manipulating dates, times, and
time deltas, and which handle arbitrary time deltas correctly.
We recommend that the convention be adopted that all times are given
in GMT; however, this policy is by no means required by the system.
B: Files -- Logical Names and Physical Names
Rather than forcing the programmer to deal with hard-coded file
names or hard-coded unit numbers, the I/O API introduces the
concept of logical file names .
The modeler can define his or her own
logical names, which then become properties of the program.
Then at run-time the EDSS process manager (or the user who
writes his own shell-scripts) uses the UNIX setenv command
(or the VMS ASSIGN command) to connect up the logical names
to the physical file name of any "real" file desired.
For programming purposes, the significant facts are that the
names should not contain blanks (except as padding at the end:
'foo '
is OK; 'f oo'
is not), and
when they are used in subroutine calls are FORTRAN character
strings at most 16 characters long.
C: Data Structures for Input and Output
Each logical file has header attributes describing itself, and a
sequence of time steps divided into logical data records accessed
by variable and layer .
Dates and times and time-steps are
represented as indicated in the preceding section. All layers of all
variables are assumed to have the same time-step and
data type (gridded, boundary, etc.)
structure.There are three categories of
time step structure
presently in use:
- time-independent files have time step = 0;
the date and time arguments to access functions are ignored when
these access functions are applied to time-independent files;
- time-stepped files have time step > 0
with the time step indicated;
- restart or circular-buffer files, which have
time step < 0 with actual time step the absolute value of the
time step indicated, store exactly two active time steps of data
(the "even step" and the "odd step") and may be
used either for communications buffers or as restart-data files, at
a considerable savings in space over a normally time-stepped file used
for the same purpose.
There are currently eight types of data structure currently supported,
although the system is designed to permit the addition of extra types
in an upward-compatible fashion. The present grid-nest type was actually
implemented as a test of this extensibility. Each type except dictionary
has additional layer structure and array dimensionality structure as well.
Indexes for these are subscripted according to FORTRAN conventions
(i.e., starting with 1). Layers are counted from bottom to top vertically;
rows are counted from bottom (south) to top (north) and columns are
counted from left (west) to right (east() horizontally. The data structure
types identified by
"magic number" parameters
defined in
INCLUDE-file PARMS3.EXT .
Together with the magic-number values, the types are:
- Type -1:
custom User-defined REAL data with
one logical record
per variable, layer, and time step, with structure interpreted by
the user. Record size (in words) is stored as the number of columns.
This type of file may be used to handle situations otherwise
unanticipated by the present requirements analysis.
- Type 0:
dictionary The "reusable"
portions of a file description, with a named-record structure
(mapping onto the variables referenced by READ3()) to index the file
descriptions. This type should be considered as a tentative prototype
step in file type management rather than a complete and lasting
solution. The fields in such a description are:
- file type ID (custom, dictionary, etc.
- time step
- number of variables
- number of layers
- number of rows or maximum number of ID-referenced data sites
- number of columns or custom words per record or maximum number of
- profile levels
- boundary thickness in cells (used for boundary files only)
- coordinate type ID (lat-lon, Lambert, Mercator, etc.)
- coordinate specification parameters
- grid name
- grid specification parameters
- file description
- list of variable names
- list of units designations for variables
- list of variable descriptions
- Type 1:
gridded (usually regularly) gridded
data having one
logical record per time step, variable, and layer, with memory
layout as in the FORTRAN declaration
REAL ARRAY( NCOLS, NROWS )
- Type 2:
boundary boundary data has one
logical record per time
step, variable, and layer. Its structure is defined in terms of
a thickened grid perimeter proceeding counterclockwise from the
SW (1,1) corner. The array size for one layer of data is computed
in terms of the dimensions of the corresponding gridded data grid
and the additional thickness parameter
NTHIK
according
to the following formula
2 | NTHIK| * (NCOLS + NROWS + 2*NTHIK)
where NTHIK > 0
indicates an external boundary and
NTHIK < 0
indicates an internal boundary. It has component subarrays along
each edge of the grid, each layer of which is structured as follows:
REAL SOUTH( NCOLS + NTHIK, NTHIK )
REAL EAST ( NTHIK, NROWS + NTHIK )
REAL NORTH( NCOLS + NTHIK, NTHIK )
REAL WEST ( NTHIK, NROWS + NTHIK )
- Type 3:
iddata ID-referenced data has one
logical record per
time step. Note that such data as county-aggregation files may
be treated as a special case by the use of some such encoding of
the site-ID as the FIPS codes. Note also that location parameters
must be explicitly treated as variables if they are stored in such
a file. The data records are structured as follows (where MAX is
the file attribute maximum number of sites):
- number of actual sites
INTEGER NSITES
- array of site ID's
INTEGER ID( MAX )
- array of data
REAL DATA( MAX, NLAYS, NVARS )
- Type 4:
profile For geographically scattered
vertical profile arrays of rawinsonde data referenced by ID or by
location. Note that location
is DOUBLE PRECISION and is treated as potentially time-dependent
(to match the behavior of rawinsonde profiles, which the NWS moves
around from time to time). The data has one logical record per
time step, structured as indicated below (where MXLVL is the maximum
number of vertical levels):
- number of actual sites
INTEGER NSITES
- array of site ID's
INTEGER ID( MAX )
- array of site level counts
INTEGER NLVL( MAX )
- array of site X-locations
DOUBLE PRECISION X( MAX )
- array of site Y-locations
DOUBLE PRECISION Y( MAX )
- array of site Z-locations
DOUBLE PRECISION Z( MAX )
- array of data
REAL DATA( MXLVL, MAX, NLAYS, NVARS )
- Type 5:
grid-nest or multiple-grid is a data type
implemented largely as a test
of how extensible the system was in terms of new data types. Its
structure is somewhat similar to profile, except that each time
step has a potentially varying number of regular grids, each of
which has a time-dependent 2-D dimensionality, location, and cell
size. The description of the storage order (which is quite tedious)
is omitted here for the sake of brevity.
- Type 6:
sparse matrix uses so-called
"skyline-transpose representation" to store
sparse matrices for use by the new emissions model (and
possibly other programs that need it.
The data has one logical record per time step, as indicated
below, where MXROW is the number of rows in the matrix and
MXCOL is the maximum number of active columns per row.
- number of active cols per row
INTEGER NC( MXROW )
- subscripts for active cols
INTEGER IC( MXCOL, MXROW )
- coefficients for active cols
REAL CC( MXCOL, MXROW )
D: Public Include-file Structures
There are three public INCLUDE files
in the FORTRAN interface to the I/O API. They are the following:
- PARMS3.EXT contains dimensioning parameters
and the standard file-type, coordinate-system-type,
"All Layers", etc., token values for the FORTRAN
interface to the I/O API.
- FDESC3.EXT contains FORTRAN data structures
(COMMONs) for a Models-3/EDSS I/O API file description, and is used
to give name syntax for passing file description data between routines
OPEN3 and DESC3 and their callers. Requires PARMS3.EXT for dimensioning.
- IODECL3.EXT contains declarations and usage
comments for the public routines in the FORTRAN I/O API.
E: Public Call Interfaces and Specifications
Except for INIT3(), which is an INTEGER
function, the routines
in the I/O API are LOGICAL functions which return .TRUE. exactly when they
succeed (and .FALSE. otherwise). In the examples below, the names
(FNAME for logical file name, VNAME for variable name, PNAME for program
name, CNAME for calling-routine's name) are CHARACTER*(*) of length at
most 16, RDFLAG is INTEGER, ARRAY is the output buffer for data access
routines, dates and times follow Models-3/EDSS conventions described above,
and LOGDEV is the INTEGER FORTRAN unit number for the program's log file.
From the functional point of view there are four groups of routines.
- INIT3(), OPEN3(), and DESC3() are related to initialization,
- READ3(), XTRACT3(), and INTERP3() are related to data
retrieval,
- WRITE3() is related to data storage, and
- SHUT3() is related to system shutdown.
Note that for time-independent
files , the date and time arguments are ignored by the data access
routines. Data sets are "stateless" in the sense that access
operations may be done in any (meaningful) order -- a given time step of
a variable may be read many times, time steps may be read or written in
reverse (or even random) order, etc.
Integer function INIT3() initializes the
entire state for the I/O API, and returns the unit number for the
log-file (which will be attached
to the file whose logical name is 'LOGFILE' if one exists, and to
standard output otherwise). INIT3() may (should) be called multiple
times by application routines and programs in order to get the log-file's
unit number. A typical call to INIT3() might look like the following:
LOGDEV = INIT3()
IF ( LOGDEV .LT. 0 ) THEN
...(can't proceed; probably couldn't open the log
... file. Stop the program.)
END IF
Logical function OPEN3 opens files
according to the requested status, and writes a file summary to the
program log . For those files opened
for writing, it sets the update info in the file header. May be called
multiple times with multiple files; if called repeatedly for a file
already open, it returns .TRUE. unless the request is for READ/WRITE
and the file is already open for READONLY. Legal values for STATUS
are given in PARMS3.EXT: 1 for READONLY, 2 for READ/WRITE/UPDATE
of existing files, 3 for READ/WRITE for new files, and 4 for READ/WRITE
of unknown (whether new or old) files. A typical call looks like:
IF( .NOT. OPEN3( FNAME, STATUS, PNAME ) ) THEN
...process the error: OPEN3 failed.
END IF
Logical function DESC3 puts all the
descriptive data for the specified file into the standard file
description data structures in
FDESC3.EXT .
A typical call looks like:
IF( .NOT. DESC3( FNAME ) ) THEN
...process the error: DESC3 failed.
END IF
Logical function INTERP3 provides
encapsulated read-and-time-interpolate functionality for gridded and
boundary data to EDSS programs. It reads enough data from the specified
file to interpolate all layers of the single specified variable to the
specified date and time, after checking that the specified record-size
is correct for that file. Internally it uses its own data buffers to
optimize the read-operations. Note that for time-independent data,
"interpolate" is taken to mean "copy" and the
date and time are irrelevant. A typical call looks like:
IF( .NOT. INTERP3( FNAME, VNAME, CNAME, DATE, TIME,
& RECSIZE, ARRAY ) ) THEN
...process the error: INTERP3 failed.
END IF
Logical function READ3 reads data from
the specified file for the specified date and time, variable, and layer.
If the file is a dictionary file, the variable name is used as the
dictionary-entry index. Tokens ALLAYS3 and ALLVAR3 from PARMS3.EXT
may be used to read all layers or all variables for the time step,
respectively. A typical call looks like:
IF( .NOT. READ3( FNAME, VNAME, LAYER, DATE, TIME, ARRAY ) ) THEN
...process the error: READ3 failed.
END IF
Logical function XTRACT3 reads data
from the specified gridded file for the specified date and time,
variable, and ranges of rows, columns, and layers. The row, column,
layer range may be shrunk down as far as a single cell, or may be
expanded to include the entire 3-D grid (although it may be less
efficient reading the entire grid than is READ3). Token ALLVAR3
from PARMS3.EXT may be used to read all variables for the time
step. A typical call looks like:
IF( .NOT. XTRACT3( FNAME, VNAME, LAY0, LAY0, ROW0, ROW1,
& COL0, COL1, DATE, TIME, ARRAY ) ) THEN
...process the error: XTRACT3 failed.
END IF
Logical function WRITE3 writes either
an individual variable (for GRIDDED, BOUNDARY, or CUSTOM files only),
or an entire time step (all variables, all layers) of data for the
specified date and time to the specified file. To write an entire
time step, VNAME should be 'ALL', A typical call looks like:
IF( .NOT. WRITE3( FNAME, VNAME, DATE, TIME, ARRAY ) ) THEN
...process the error: WRITE3 failed.
END IF
Logical function SHUT3 flushes all
open files to disk and then closes them. (Failure probably indicates
some unrecoverable file-system error, but the user at least should be
notified when that happens. A typical call looks like:
IF( .NOT. SHUT3( ) ) THEN
... SHUT3 failed.
END IF
III. IMPLEMENTATION
The first two Models-3/EDSS I/O API implementations are built on top of
UCAR's
netCDF library.
It is largely is a modeler-oriented wrapper around netCDF calls, and
constructs files with particular structure defined in terms of sets
of attributes as indicated above. For the most part, the implementation
is written in FORTRAN, and uses a number of lower-level subroutines to
manage the details of its operation. There is a matching set of C
routines, which are for the most part wrappers around the Fortran
routines. The interface consists of 65 FORTRAN-77
routines, 5 FORTRAN INCLUDE-files, 26 C routines, and three C include
file, with about 14000 lines of code. In three particular places
it was necessary to do multi-language programming for the Fortran bindings.
First, it was necessary to write wrappers callable from FORTRAN around the
getenv() and time() system calls in order
to evaluate logical names and to get the current wall-clock time.
In addition, because of the necessity to do dynamic memory allocation
for the buffers used by INTERP3 (which, it should be noted, requires
a more general notion of dynamic allocation than that available in
Fortran 90), it is implemented as one module written in two parts --
a FORTRAN part responsible for managing the file name and variable-name
interface, and a C part responsible for buffer management and interpolation.
We have also implemented C interfaces with semantics matching
the FORTRAN interface for use by graphics and analysis programs.
(Presently, some EDSS visualization programs use a C module which
directly calls the netCDF C API in order to read EDSS data sets --
a potential source of inconsistency as EDSS expands and develops further.)
IV. LIMITATIONS
A major limitation of the present implementation is the limits
imposed by 32-bit addressing within most UNIX file systems. Model
management and data indexing within Models-3 and EDSS would both be
far easier if it were possible to keep the outputs of entire episodes within
single files, rather than being forced to "chunk" the episodes
into shorter segments just to fit within the 2 GB limits of most file
systems (or the even more stringent necessity of fitting within
"small" (less than 1 GB) physical devices. Consider that
the primary output file for a single ozone episode might have the
following dimensions:
30 days, at
24 (hourly) time steps per day, for
60 variables, on a grid with
100 columns
100 rows
25 layers, for a total data volume of
43.2 GB, assuming single precision (4 bytes per number) storage.
NOTE: For hydrological applications, the I/O API
has been used for much larger data sets than these (and appropriately
designed I/O API based analysis and visualization tools were
routinely used with):
33 years, at
4 (6-hourly) time steps per day, for
8 variables, on a grid with
2760 columns
3320 rows
1 layer, for a total data volume of
3.53 TB
Another major limitation has to do with massively parallel
supercomputers, for which the "correct" I/O semantics
is a matter of research as of this writing, rather than a matter
of settled practice.
NOTE added OCT. 24, 1997:
Various prototypes for domain
decomposition data parallel models have been implemented and we are
evaluating them as part of the MCNC Environmental Program's
Practical
Parallel Project
V. FUTURE EXTENSIONS
A: Data Types
One obvious kind of future extension is in the set of data types
supported. There are several candidates, none of which is yet
sufficiently developed that we can specify them in detail. A first
candidate is new data types designed to better structure emissions
data in connection with EDSS improvements to emissions modeling.
A second candidate is a data type designed to deal with finite-element
or finite-volume data on unstructured meshes. A third candidate is
exchange-flux matrices to support air quality models incorporating
the results of generalized-chemistry research being performed by
Prof. Harvey Jeffries of the University of North Carolina at Chapel Hill.
B: Communication for Parallel Computing
NOTE added OCT. 24, 1997:
The following has been
implemented and we are evaluating it as part of the
Practical
Parallel Project
Another possible kind of future extension is in structuring
communication and coordination for parallel programs. If the
I/O API had two modes -- a communications mode in addition to
the existing file storage mode -- it could be use to structure
well-engineered coupled models and parallel models in the
following fashion: In the communications mode, the read
operations must be selective by simulation-time (as they
are now), and must block (i.e., suspend the execution of
their calling process) until the data for the time requested
becomes available. One would then construct coupled or
parallel models by building an ordinary program for each
component, capable of execution as a stand-alone model when
the I/O API is used in file storage mode. When the programs
are executed at the same time, the coupled models would use
the communications mode of the I/O API to exchange data.
The scheduling for coupled models is performed implicitly
by the operating system (using the blocking nature of the
read operations to determine the order of execution), without
the developer having to construct an explicit scheduler for
the processes being simulated. This methodology for constructing
coupled models requires the right sort of underlying interprocess
communications tools upon which to build, and does incur the
corresponding communications overheads (which, one hopes, are
small in comparison to the computational overheads of the
component models themselves). However, it does seem to offer
several advantages:
- It supports good software engineering principles (modularization
and encapsulation), since each of the components must deal with
only a single sort of simulation.
- It makes for easier re-use of code, since each component is a
functioning environmental model in its own right.
- It leads to smaller and simpler software systems, since scheduling
is supplied by the operating system (and its interaction with the
I/O API), and the developer need not worry about interactions
between the component simulations.
- It provides for the decomposition of the modeling system into
explicitly parallel components (which may possibly be distributed
to different host machines, if the underlying communications layer
permits it.) Hence it provides one approach to the use of MIMD
massively parallel machines.
One important potential application is the construction of
nested atmospheric models, possibly several levels of nesting
deep. In such a nest-model system, there is an explicit nest
interaction science-process module in all except the highest
resolution models, which is responsible at every model time
step for aggregating nest results over the model's grid, and
then broadcasting boundary conditions for all the models nested
within it. The remainder of the science process modules (and
the remainder of the individual models themselves) are otherwise
unchanged. The one requirement for synchronization is that the
high resolution models' time steps divide exactly into the time
step of the parent in which they are nested. If this approach
is used, the same component models could be used for both one-way
and two-way nesting (they need not even know whether they are
operating one-way or two-way!).
Another family of applications is the coupling of different types of
environmental models -- perhaps meteorology, emissions, and air quality
at first -- possibly at high-resolutions time scales that are impractical
otherwise because the data volume would overwhelm all available disk
space if the data were stored there. If, however, the meteorology
data volume is kept in temporary communication memory rather than
on disk, the problem is avoided.
Another application of a communication mode of the I/O API might be to
use it to achieve domain-decomposition parallelism for the distributed
execution of environmental models: First, decompose the geographic
domain into subdomains. On each of the subdomains, run a copy of the
environmental model, and a master modeling program whose task is to
assemble the results from the subdomains into a coherent whole on the
entire domain, and then to broadcast boundary conditions to each of
the subdomain models. This may well be the paradigm by which we get
air quality models to efficiently use the resources of MPP machines
while at the same time writing well-engineered, maintainable systems.
Added 1997:
This extension, the
coupling mode of the
I/O API, has been developed under the aegis of the MCNC
Environmental Programs
Practical Parallel Computing Strategies
Project, a project partially funded by US EPA. It has proved
very useful for constructing coupled modeling systems, such as that
used for MCNC numerical air quality
forecasting and for coupled
hydrological-meteorological modeling.
Other Extensions
At some point, it might be worthwhile to implement C++ interfaces
with a full-blown class structure for files, variables, layers,
dates and times, etc., which fully supports the structure of the
data. Since the requirements analysis and the design were object
based (with inheritance implemented in terms of call hierarchy and
"cut, paste, and edit" instead of the implementation language),
it should be possible to do so. It would, however, be a nontrivial task :-).
Added Nov. 2001: MCNC Environmental Modeling
Center has prototyped a
geospatial-element cell complex
(GECC) datatype that efficiently supports both
(time-stepped and time-independent) geospatial coverages and finite
element data, on cell complexes with either time-stepped and
time-independent node coordinates.
Previous: netCDF User's Guide
Next: Notices: Copyright, Acknowledgements
To: Models-3/EDSS I/O API: The Help Pages