Accessing Data#

Overview#

There are multiple datasets that will be of interest when interacting with the pywrdrb, including:

  • Pywr-DRB output data ("output") : Results from pywrdrb simulations (e.g., streamflow at model nodes, reservoir storage, flow targets, etc.)

  • Observation data ("obs") : Measured USGS gauge streamflow and reservoir storage timeseries

  • Hydrologic Model Data ("nhmv10", "nwmv21"): Outputs from external hydrologic models like NWMv2.1 or NHMv1.0.

The pywrdrb.Data class is designed to provide easy methods for loading and accessing all of the datatypes.

pywrdrb.Data() Functionality#

Arguments#

The keyword arguments are:

  • output_dir (str): Directory for output files.

  • input_dir (str): Directory for input files.

  • datatypes (list): List of data types to load. Options: ‘output’, ‘obs’, ‘nhmv10’, ‘nwmv21’ (default []).

  • results_sets (list): List of results sets to load (default [])

  • output_filenames (list): List of pywrdrb output filenames with path to load. Only necessary when output is in datatypes (default None).

  • units (str): Units for the results. (default ‘MG’).

  • print_status (bool): Whether to print status updates during load (default False).

Keyword arguments can be provided during either the initialization or the load methods. I.e.,

data = pywrdrb.Data(**kwargs)
data.load()

or:

data = pywrdrb.Data()
data.load(**kwargs)

A note on results_sets#

The results_set options are used to specify what type of variables you are interested in. For example, if you want to access the streamflow at major Pywr-DRB model nodes, then you want results_sets = ['major_flow']. Another example is results_sets = ["res_storage"] which returns reservoir storage timeseries.

You can find descriptions of each results_set type in pywrdrb.utils.results_sets.

Different datatypes have different valid results_set options.

  • "output" data can access all results_set options

  • "obs" data can only access results_set in ['major_flow', 'reservoir_downstream_gage', 'res_storage']

  • "nhmv10" and "nwmv21" can only access results_set in ['major_flow', 'reservoir_downstream_gage']

Fortunately, the pywrdrb.Data class will identify valid results_set options for each datatype, and only process those valid options.

Accessing Data#

Given that there are multiple datasets of interest, the pywrdrb.Data class uses a hierarchical structure for storing this data within the pywrdrb.Data object.

After using pywrdrb.Data().load(), results are stored in the class as a nested dictionary structure following:

data = pywrdrb.Data()
data.load(**kwargs)

data.results_set[datatype_label][scenario_number] -> pd.DataFrame

A note on the dataset labeling…#

In the pywrdrb.Data class, the dataset_label for "output" datatypes is derived from the output filenames. This label serves as a key in the final stored data, allowing users to identify and access results associated with custom pywrdrb simulations or scenarios.

For example, if you run a pywrdrb simulation and store the results in output_file = "<path>/my_simulation.hdf5", then these simulation results can be accessed using:

data = pywrdrb.Data()
data.load(
    datatypes = ['output'],
    output_filenames = [output_file],
    results_sets = ['major_flow']
    )

# Access custom output data
data.major_flow["my_simulation"][0]
# Set path to access pywrdrb
import sys
path_to_pywrdrb = '../'
sys.path.append(path_to_pywrdrb)


from pywrdrb import Data
from pywrdrb.utils.directories import output_dir

# List of datatypes to load
# Options: 'obs', 'output', 'nhmv10', 'nwmv21'
datatypes = ['output', 'obs', 'nhmv10']

# List of pywrdrb output files, with path
# NOTE: This is only necessary if 'output' is in datatypes
output_files = [
    rf"{output_dir}drb_output_nhmv10.hdf5",
]

# List of results sets to load 
# See pywrdrb.utils.results_sets for available results sets
results_sets = [
    'major_flow',
    'res_storage'
]
# Initialize Data object and load data
data = Data(print_status=True)

data.load(datatypes=datatypes, 
          output_filenames=output_files, 
          results_sets=results_sets
          )

# show an example of the major_flow results set for `drb_output_nhmv10` output file
data.major_flow['drb_output_nhmv10'][0].head(2)
Loading output data...
Loading major_flow data from drb_output_nhmv10
Loading res_storage data from drb_output_nhmv10
Loading obs data...
Loading major_flow data from observations
Loading res_storage data from observations
Loading nhmv10 data...
Loading major_flow data from nhmv10
01417000 01425000 01433500 01436000 01447800 01449800 01463620 01470960 delDRCanal delLordville delMontague delTrenton outletAssunpink outletSchuylkill
1983-10-01 245.248604 192.037241 702.126510 216.100859 1296.612713 30.641998 77.324137 981.138761 872.861166 626.737443 1050.155248 804.943656 110.874962 329.747828
1983-10-02 157.896171 150.911779 716.300505 60.483957 1298.492457 32.867673 77.732775 982.026219 1837.957440 514.753462 2386.712148 1772.532316 132.337899 688.332292
# check the datatype keys stored in the data object
list(data.major_flow.keys())
['drb_output_nhmv10', 'obs', 'nhmv10']
import matplotlib.pyplot as plt

plt.figure(figsize=(4,4))
plt.scatter(data.major_flow['drb_output_nhmv10'][0].loc[:, 'delTrenton'], 
            data.major_flow['nhmv10'][0].loc[:, 'delTrenton'], alpha=0.3)
plt.ylabel('NHMv1.0 flow at Trenton')
plt.xlabel('Pywr-DRB simulated flow using NHMv1.0 as input')
plt.yscale('log')
plt.xscale('log')
plt.show()
../_images/ba209ae576c61da04ede2f264657595c2499b943ab6ebb56ac716f2d47613265.png

Exporting custom datasets#

Once a pywrdrb.Data object has been loaded and populated with data, then you can export the full contents of the pywrdrb.Data object to a custom HDF5 file. This custom dataset can then be re-loaded later, without having to replicate the original pywrdbr.Data.load() workflow shown above.

The code block below shows how the previously generated data object can be exported and re-loaded.

# Custom dataset filename
custom_datafile = rf"{output_dir}custom_datset.hdf5"

# Export the data object to a custom dataset
data.export(custom_datafile)

# Make a new data object and load the custom dataset
data2 = Data(print_status=True)
data2.load_from_export(custom_datafile)


# Show that the data object contents are the same
data2.major_flow.keys() == data.major_flow.keys()
data2.major_flow['drb_output_nhmv10'][0].equals(data.major_flow['drb_output_nhmv10'][0])
c:\Users\tjame\Desktop\Research\DRB\Pywr-DRB\Pywr-DRB\win_venv\lib\site-packages\tables\path.py:137: NaturalNameWarning: object name is not a valid Python identifier: '0'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
  check_attribute_name(name)
True

Technical detail (To be completed…)#

The pywrdrb.Data class relies on multiple different .

  • AbstractDataLoader: The base class defining shared methods for parsing arguments, validating datasets, and handling attributes.

    • Output(AbstractDataLoader): Loads Pywr-DRB simulation results.

    • Observation(AbstractDataLoader): Loads observational data.

    • HydrologicModelFlow(AbstractDataLoader): Loads hydrologic model outputs.

  • get_pywrdrb_results(): Handles loading results from simulation output files; used in Output(AbstractDataLoader).

  • get_base_results(): Loads observation or hydrologic model data; used in Observation and HydrologicModelFlow classes.

Output Class#

The Output class consolidates this entire process into a single load() method, automatically validating inputs, handling scenarios, and storing results inside the Ouput object.

Output.load() uses get_pywrdrb_results() to fetch data for all specified models and results_sets. It manages datetime indexing and scenarios, and then stores the results as attributes within the class.

pywrdrb.post.get_pywrdrb_results()#

The get_pywrdrb_results() reads data from Pywr-DRB output files (hdf5) stored in the output directory and returns a dictionary of results along with a datetime index.

This function is designed to handle different result types (results_sets), such as "res_storage", "major_flow", "res_release", and more. This function makes it easy to extract specific variables, but it requires specifying each results_set separately and can be slow when constructing date indexes.

Observation Class#

TBC.

HydrologicModelFlow Class#

TBC.