Preparing Input Data

Preparing Input Data#

Overview:#

Pywr-DRB is designed to simulate water resource operations using multiple different input datasets.

In order to perform a simulation with a given streamflow dataset, there are some necessary pre-processing steps in order to prepare other necessary model inputs.

This notebook is designed to help demonstrate this preprocessing workflow.

Links:#

Pre-requisit Reading:#

Supplemental information for Hamilton, Amestoy and Reed (2024)

Content:#

The Input-Data-Processing repository
- NHMv1.0 and NWMv2.1 data retrieval
- Inflow scaling regression
Pywr-DRB Preprocessing
- Subtract upstream catchment inflows
- Disaggregate DRBC demands
- Extrapolate NYC and NJ diversions
- Predict inflows and diversions
Running the scripts

1.0 The Input-Data-Retrieval repository #

This repository is located under the Pywr-DRB GitHub organization page. This repo was designed to store the code needed to extract/retrieve streamflow from the National Hydrologic Model version 1.0 (NHMv1.0; nhmv10) and National Water Model version 2.1 (NWMv2.1; nwmv21).

Since it’s creation, it has slowly started to house more of the input data preparation process. Data which is prepared & stored here now includes:

USGS gauge streamflow data which is retrieved from the NWIS.
NHMv1.0 (NHM-PRMS) modeled streamflow data which is extracted from a CONUS scale dataset.
NWMv2.1 modeled streamflow which is extracted from a CONUS scale dataset.
Scaled gauge streamflow using NHMv1.0 and NWM2.1 to predict gauge-catchment scaling relationships at some reservoirs in the DRB.

These data are stored in the datasets/ folder of this repository, within subfolders /USGS/, /NHMv10/, and /NWMv21/ respectively.

Also, some of these streamflow sets are exported directly to the Pywr-DRB/input_data/ folder.

The work done in this repository is not meant to be replicated by a new Pywr-DRB user. This is partly due to the fact that the original NHM and NWM datasets are very large files and it is not convenient to replicate. Instead, all of the necessary data for running Pywr-DRB is stored in the Input-Data-Retrieval/datasets/ folder and the Pywr-DRB/input_data/ folder.

However, for example, there might be a time when a user wants to consider adding new nodes to the Pywr-DRB model which is not currently available in the datasets/. In that case they will need to be able to go through the data retrieval process themselves. Regardless of whether you need to replicate the work yourself or not, it is important to understand how the input data is being sources.

1.1 NHMv1.0 and NWMv2.1 data retrieval:#

Before we can extract any NHM/NWM streamflow data, we need to download the sources. These are publicly available datasets published at the links below:

National Hydrologic Model Precipitation-Runoff Modeling System (NHM-PRMS; NHMv1.0) Daily Streamflow (92GB): Hay, L.E., and LaFontaine, J.H., 2020, Application of the National Hydrologic Model Infrastructure with the Precipitation-Runoff Modeling System (NHM-PRMS),1980-2016, Daymet Version 3 calibration: U.S. Geological Survey data release, https://doi.org/10.5066/P9PGZE0S. Download here: ScienceBase
National Water Model V2.1 (NWMv2.1) NWIS Retrospective (2GB): Blodgett, D.L., 2022, National Water Model V2.1 retrospective for selected NWIS gage locations, (1979-2020): U.S. Geological Survey data release, https://doi.org/10.5066/P9K5BEJG. Download here: ScienceBase

Pause: Take some time here and go to the NHM and NWM links above. No need to download anything, but read the abstract summary of each of the datasets. Put together some basic notes about each of the different datasets. What are their timespans? What information is used to generate these model predictions? Anything else you think is interesting?

Breath

1.2 Inflow Scaling Regression#

For some reservoirs, we have observation gauges on the inflow streams upstream of the reservoir which is very helpful. However, these gauges might not capture all of the inflow entering the reservoir. Especially for a reservoir like Pepacton which is large with lots of different inflow streams. In these cases, we want to scale the inflow data to estimate the total inflow.

Goal: Modify the observed inflow data such that we improve the total mass-balance estimate of total inflow, while keeping the daily accuracy of the observed timeseries.

We can use the hydrologic models, NHM/NWM, to estimate the scaling relationship.

In NHM/NWM, the total inflow into the reservoir is modeled at the catchment outlet, termed the hydrologic response unit ($HRU$). This flow can be represented as the sum of upstream flows at modeled points of interest (POI) multiplied by some daily scaling coefficient (c_t):

\[HRU = (POI_A + POI_B)c_t\]

Using the NHM/NWM model, we can calculate the scaling coefficent for every day, as the ratio:

\[c_t = \frac{HRU}{(POI_A + POI_B)}\]

Ultimately, the goal is to use the value of $c_t$ to scale the observed flow timeseries at the gauges, such that we can estimate the true inflow into the reservoir:

\[\text{Inflow}_t = c_t(\text{Gauge}_A + \text{Gauge}_B)\]

However, we cannot use the value of $c_t$ from the hydrologic models directly, since these don’t align well with the observed series, and we want to prioritize observed data here.

Linear regression models are used to predict the daily inflow scaling coefficient ($\hat{c_t}$) for a specific season $s$ and day $t$ using the relationship:

\[\hat{c_t} = \beta_{0,s} + \beta_{1,s}x_t+\epsilon_t\]

where $\beta_{0,s}$ is the regression intercept for season $s$, $\beta_{1,s}$ is the regression slope for season $s$, and the predictor $x_t$ is the log of the sum of the $k$-day rolling mean observed flows upstream of the reservoir. The value of $k$ days was determined through trial and error, with $k=3$ resulting in the best model fit when considering the model’s R-squared values.
$$x_t = log(\sum^n_i \bar{q}_{i,t})$$

Prior to fitting the linear regression models, the hydrologic model (NHM or NWM) flow estimates are used to calculate $X = \set{x_1,\dots,x_t}$ and $C = \set{c_1, \dots, c_t}$. These data are then used as model training data. Once the regression models have been fit, the observed gage flows are used to calculate observed $x_t$ which is used to predict $\hat{c_t}$. It is assumed that inflows into the reservoir are always greater than or equal to the observed gauge inflow, and the scaling coefficient is capped at a lower bound such that $\hat{c_t} \geq 1$. The final estimate of the total inflow into the reservoir during timestep $t$ is then the product of the estimated scaling coefficient multiplied by the sum of observed inflows on that day: $$\hat{Q}_t = \hat{c_t}\sum_{i}^{n} q_{i,t}$$

We repeat this process using both NHM and NWM, so that we end up with two different versions of the scaled inflow data, which uses scaling information from each of the respective models.

Currently, the list of reservoirs which has scaling inflows are:

Cannonsville
Pepacton
Neversink
FE Walter
Beltzville

The create_hybrid_modeled_observed_datasets() function is used to take the scaled inflow data and combine it with the raw NHM/NWM data. This function exports the gage_flow_{nhm/nwm}_withObsScaled and catchment_inflow_{nhm/nwm}_withObsScaled CSV files to the Pywr-DRB/input_data/ directory so that they can be used for simulation.

The following code will create the hybrid dataset:

import sys
path_to_pywrdrb = '../'
sys.path.append(path_to_pywrdrb)
                
from pywrdrb.pre.prep_input_data_functions import read_csv_data, create_hybrid_modeled_observed_datasets
from pywrdrb.utils.directories import input_dir

start_date = '1983/10/01'
end_date = '2016/12/31'

# Read in the NHM modeled streamflow data
df_nhm = read_csv_data(f'{input_dir}modeled_gages/streamflow_daily_nhmv10_mgd.csv', 
                       start_date, end_date, units = 'mgd', source = 'nhm')

# Create the hybrid dataset using scaled observed inflows where available
# and the NHM modeled streamflow data where needed
create_hybrid_modeled_observed_datasets('nhmv10', df_nhm.index)

c:\Users\tjame\Desktop\Research\DRB\Pywr-DRB\Pywr-DRB\notebooks\..\pywrdrb\pre\prep_input_data_functions.py:137: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  inflows[node].iloc[lag:] += inflows[upstream].iloc[:-lag].values
c:\Users\tjame\Desktop\Research\DRB\Pywr-DRB\Pywr-DRB\notebooks\..\pywrdrb\pre\prep_input_data_functions.py:139: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  inflows[node].iloc[:lag] += inflows[upstream].iloc[:lag].values
c:\Users\tjame\Desktop\Research\DRB\Pywr-DRB\Pywr-DRB\notebooks\..\pywrdrb\pre\prep_input_data_functions.py:144: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  inflows[node].loc[inflows[node] < 0] = 0

2.0 Pywr-DRB Preprocessing#

The following sections outline each of the pre-processing steps necessary to prepare the input data for Pywr-DRB. Some of these processes must be repeated for any new dataset (e.g., subtract upstream catchment inflows) while some others are only done once and can be used for multiple datasets (e.g., demand data).

2.1 Subtract upstream catchment inflows#

Pywr-DRB is designed to take marginal catchment inflows as model inputs. However most datasets contain total streamflows. We designed a function to calculate the marginal catchment flow using the total flow at different nodes.

The pywrdrb.pre.prep_input_data_functions.py contains this function, called subtract_upstream_catchment_inflows(). This function takes a pd.DataFrame containing total flow at each of the Pywr-DRB nodes and iteratively removes upstream flows.

Two important pieces of information used in this function are the upstream_nodes_dict and the downstream_node_lags which are stored in pywrdrb.utils.pywr_drb_node_data. Take a look at both of these before going further.

from pywrdrb.utils.pywr_drb_node_data import upstream_nodes_dict, downstream_node_lags

def subtract_upstream_catchment_inflows(inflows):
    """
    Subtracts upstream catchment inflows from the input inflows timeseries.

    Inflow timeseries are cumulative. For each downstream node, this function subtracts the flow into all upstream nodes so
    that it represents only the direct catchment inflows into this node. It also accounts for time lags between distant nodes.

    Args:
        inflows (pandas.DataFrame): The inflows timeseries dataframe.

    Returns:
        pandas.DataFrame: The modified inflows timeseries dataframe with upstream catchment inflows subtracted.
    """
    inflows = inflows.copy()
    for node, upstreams in upstream_nodes_dict.items():
        for upstream in upstreams:
            lag = downstream_node_lags[upstream]
            if lag > 0:
                inflows[node].iloc[lag:] -= inflows[upstream].iloc[:-lag].values
                ### subtract same-day flow without lagging for first lag days, since we don't have data before 0 for lagging
                inflows[node].iloc[:lag] -= inflows[upstream].iloc[:lag].values
            else:
                inflows[node] -= inflows[upstream]

        ### if catchment inflow is negative after subtracting upstream, set to 0
        inflows[node].loc[inflows[node] < 0] = 0

        ### delTrenton node should have zero catchment inflow because coincident with DRCanal
        ### -> make sure that is still so after subtraction process
        inflows['delTrenton'] *= 0.

    return inflows

2.2 Disaggregate DRBC demands#

The function disaggregate_DRBC_demands() is designed to disaggregate DRBC (Delaware River Basin Commission) water demand data to align with PywrDRB catchments.

The core of this process lies in the application of geometric operations to align and adjust catchment areas, followed by the statistical disaggregation of water demand data based on the proportional areas of these catchments. The method effectively redistributes aggregated data to a finer spatial resolution aligned with the model’s hydrological catchment areas.

This is only done once, and the same demand data is currently used for each simulation.

See the pywrdrb.pre.disaggregate_DRBC_demands.py script to see this in action.

2.3 Extrapolate NYC and NJ diversions#

The NYC diversions from Cannonsville, Pepacton, and Neversink Reservoirs into Rondout Reservoir and the Delaware Aqueduct from 2000-2021 were acquired from ODRM (Office of the Delaware River Master, 2021). The NJ diversion is approximated as the Delaware and Raritan Canal gage flow at Port Mercer, NJ, which has observations from 1991-present. A two-step extrapolation approach is used to fill the NYC 1983-1999 period and the NJ 1983-1990 period. The two-step procedure combines linear regression and nearest neighbor bootstrapping to extrapolate the diversion timeseries based on the observed relationship between diversions and basin inflows.

The first step in the extrapolation procedure is to build a linear regression model that predicts total monthly diversions as a function of total log monthly inflows. A separate regression model is built for each season (December-February, March-May, June-August, September-November) due to differences in seasonal water use patterns.

The next step is to disaggregate the monthly diversions into daily values using a nearest neighbor bootstrapping approach. For each monthly prediction in the extrapolation period, we find the nearest neighbor from the training period in the 2D space of monthly log-inflows vs. diversions using the Euclidean norm. The daily diversion profile from this neighbor is used to disaggregate the predicted monthly diversions into daily diversions.

Since diversions are dependent on streamflow, this can be done uniquely for every dataset. In practice, we are currently only creating 1 version using USGS observation data as the predictor for diversions.

See the pywrdrb.pre.extrapolate_NYC_NJ_diversions.py script to see this in action.

2.4 Predict inflows and diversions#

When implementing the FFMP operations, the NYC reservoirs must consider future flow conditions at Trenton since the NYC reservoirs are 4-days upstream of Trenton.

To help with this, we first generate predictions of the future streamflow 4, 3, 2, and 1-day ahead in time. These predictions are then used during simulation.

The pywrdrb.pre.predict_inflows_diversions.py script uses a statistical approach for predicting future time series data, specifically catchment inflows and interbasin diversions. The focus is on using linear regression techniques to project future values based on historical data.

The core method involves predicting future time series (catchment inflows or diversions) using linear regression. The prediction is based on a lagged relationship: the future value at a time t+lag is predicted using the value at time t. Options are included for using a logarithmic transformation of the data, removing zero values, and adding a constant term in the regression model.

This must be done for each dataset individually.

3.0 Key scripts#

All of the pre-processing steps described above are executed by a single script: pywrdrb.prep_input_data.py.

Rather than explaining this process here, I think it is best to leave it to you to go through this code and take notes on the workflow.

Activity:#

Assume that you want to use a new streamflow dataset for Pywr-DRB.

You have a dataset with total streamflow at each of the different locations in Pywr-DRB ready to go. However, you need to go through some pre-processing steps before you can run any simulations.

Write out each step in the preprocessing workflow necessary to set you up for a new Pywr-DRB simulation. At each step, make notes on what other data is necessary and how it is used.

The pywrdrb.prep_input_data.py workflow will be helpful here.