Data Import

library(amerifluxr)
library(pander)

amerifluxr is a programmatic interface to the AmeriFlux. This vignette demonstrates examples to import data and metadata downloaded from AmeriFlux, and to parse and clean data for further use. A companion vignette for site selection is available as well.

Download data

AmeriFlux data and metadata can be downloaded using amf_download_base() and amf_download_bif(). Users will need to create a personal AmeriFlux account here before download.

The following downloads AmeriFlux flux/met data (aka BASE data product) from a single site: US-CRT.

## When running, replace user_id and user_email with a real AmeriFlux account
floc2 <- amf_download_base(
  user_id = "my_user",
  user_email = "my_email@mail.com",
  site_id = "US-CRT",
  data_product = "BASE-BADM",
  data_policy = "CCBY4.0",
  agree_policy = TRUE,
  intended_use = "other",
  intended_use_text = "amerifluxr package demonstration",
  verbose = TRUE,
  out_dir = tempdir()
)

The downloaded file is a zipped file saved in tempdir() (e.g., AMF_{SITE_ID}_BASE-BADM_{VERSION}.zip), which contains a BASE data file (e.g., AMF_{SITE_ID}_BASE_{RESOLUTION}_{VERSION}.csv, RESOLUTION = HH (half-hourly) or HR (hourly)) and a metadata file (aka BADM data product, e.g., AMF_{SITE_ID}_BIF_{VERSION}.xlsx). The amf_download_base() also returns the file path to the downloaded file, which can be used later to read the file into R.

The following downloads a single file containing all AmeriFlux sites’ metadata (i.e., BADM data product) for sites under the CC-BY-4.0 data use policy.

## When running, replace user_id and user_email with a real AmeriFlux account
floc1 <- amf_download_bif(
  user_id = "my_user",
  user_email = "my_email@mail.com",
  data_policy = "CCBY4.0",
  agree_policy = TRUE,
  intended_use = "other",
  intended_use_text = "amerifluxr package demonstration",
  out_dir = tempdir(),
  verbose = TRUE,
  site_w_data = TRUE
)

The downloaded file is a Excel file saved to tempdir() (e.g., AMF_{SITES}_BIF_{POLICY}_{VERSION}.xlsx, SITES = AA-Net (all registered sites) or AA-Flx (all sites with flux/met data available); POLICY = CCBY4 (shared under AmeriFlux CC-BY-4.0 data use policy) or LEGACY (shared under AmeriFlux Legacy data use policy)). Similarly, the amf_download_bif() also returns the file path to the downloaded file, which can be used later to read the file into R.

For this vignette, we will use following example data files [files are truncated to limit package size, for demonstration purposes only].

# An example of BASE zipped files downloaded for US-CRT site
floc2 <- system.file("extdata", "AMF_US-CRT_BASE-BADM_2-5.zip", package = "amerifluxr")

# An example of unzipped BASE files from the above zipped file
floc3 <- system.file("extdata", "AMF_US-CRT_BASE_HH_2-5.csv", package = "amerifluxr")

# An example of all sites' BADM data
floc1 <- system.file("extdata", "AMF_AA-Flx_BIF_CCBY4_20201218.xlsx", package = "amerifluxr")

BASE data product

Import data

The amd_read_base() imports a BASE file, either from a zipped file or an unzipped comma-separated file (.csv). The parse_timestamp parameter can be used if additional time-keeping columns (e.g., year, month, day, hour) are desired.

# read the BASE from a zip file, without additional parsed time-keeping columns
base1 <- amf_read_base(
  file = floc2,
  unzip = TRUE,
  parse_timestamp = FALSE
)
pander::pandoc.table(base1[c(1:3),])
Table continues below
TIMESTAMP_START TIMESTAMP_END CO2 H2O FC NEE_PI CH4 FCH4 H
2.011e+11 2.011e+11 NA NA NA NA NA NA NA
2.011e+11 2.011e+11 NA NA NA NA NA NA NA
2.011e+11 2.011e+11 NA NA NA NA NA NA NA
Table continues below
LE G_1_1_1 G_2_1_1 WD WS USTAR ZL MO_LENGTH W_SIGMA V_SIGMA
NA 27.45 37.32 NA NA NA NA NA NA NA
NA 26.92 34.11 NA NA NA NA NA NA NA
NA 33.71 41.3 NA NA NA NA NA NA NA
Table continues below
U_SIGMA T_SONIC T_SONIC_SIGMA PA RH TA TS_1_1_1 TS_2_1_1
NA NA NA NA 92.34 11.18 3.468 3.131
NA NA NA NA 87.63 11.69 3.606 3.155
NA NA NA NA 83.31 12.37 3.727 3.277
WTD SWC NETRAD PPFD_IN SW_IN SW_OUT LW_IN LW_OUT P
-0.9133 45.13 7.064 0 0 0 368.5 360.6 0
-0.8939 45.05 2.375 0 0 0 365.8 361.4 0.254
-0.8885 45 2.923 0 0 2.812 369.7 365.2 0

# read the BASE from a csv file, with additional parsed time-keeping columns
base2 <- amf_read_base(
  file = floc3,
  unzip = FALSE,
  parse_timestamp = TRUE
)
pander::pandoc.table(base2[c(1:3), c(1:10)])
Table continues below
YEAR MONTH DAY DOY HOUR MINUTE TIMESTAMP
2011 1 1 1 0 15 2011-01-01 00:15:00
2011 1 1 1 0 45 2011-01-01 00:45:00
2011 1 1 1 1 15 2011-01-01 01:15:00
TIMESTAMP_START TIMESTAMP_END CO2
2.011e+11 2.011e+11 NA
2.011e+11 2.011e+11 NA
2.011e+11 2.011e+11 NA

Parse and interpret data

The details of the BASE data product’s format and variable definitions can be found on AmeriFlux website. In short, the BASE data product contains flux, meteorological, and soil observations that are reported at regular intervals of time, generally half-hourly or hourly, for a certain time period. TIMESTAMP_START and TIMESTAMP_END columns (i.e., YYYYMMDDHHMM 12 digits) denote the starting and ending time of each reporting interval (i.e., row).

All other variables use the format of {base name}_{qualifier}, e.g., FC_1, CO2_1_1_1. Base names indicate fundamental quantities that are either measured or calculated / derived. Qualifiers are suffixes appended to variable base names that provide additional information (e.g., gap-filling, position) about the variable. In some cases, qualifiers are omitted if only one variable is provided for a site.

The amf_variable() retrieves the latest list of base names and default units. For sites that have relatively fewer variables and less complicated qualifiers, the users could easily interpret variables and qualifiers. The amf_variable() also returns the expected maximal and minimal values based on physically plausible ranges or network reported values.

# get a list of latest base names and units. 
FP_ls <- amf_variables()
pander::pandoc.table(FP_ls[c(11:20), ])
Table continues below
  Name Description Units Min
8 DBH Diameter of tree measured at breast height (1.3m) with continuous dendrometers cm 0
9 LEAF_WET Leaf wetness, range 0-100 % 0
10 SAP_DT Difference of probes temperature for sapflow measurements deg C -10
11 SAP_FLOW Sap flow mmolH2O m-2 s-1 NA
12 T_BOLE Bole temperature deg C -50
13 T_CANOPY Temperature of the canopy and/or surface underneath the sensor deg C -50
14 FETCH_70 Distance at which cross-wind integrated footprint cumulative probability is 70% m 0
15 FETCH_80 Distance at which cross-wind integrated footprint cumulative probability is 80% m 0
16 FETCH_90 Distance at which cross-wind integrated footprint cumulative probability is 90% m 0
17 FETCH_FILTER Footprint quality flag (i.e., 0, 1): 0 and 1 indicate data measured when wind coming from direction that should be discarded and kept, respectively nondimensional 0
  Max
8 500
9 100
10 10
11 NA
12 70
13 70
14 10000
15 12000
16 15000
17 1

Alternatively, the amf_parse_basename() can programmatically parse the the variable names into base names and qualifiers. This function can be helpful for sites with many variables and relatively complicated qualifiers, as a prerequisite for handling data from many sites. The function returns a data frame with information about each variable’s base name, qualifier, and whether a variable is gap-filled, layer-aggregated, or replicate aggregated.

# parse the variable name
basename_decode <- amf_parse_basename(var_name = colnames(base1))
pander::pandoc.table(basename_decode[c(1, 2, 3, 4, 6, 11, 12),])
Table continues below
  variable_name basename qualifier_gf qualifier_pi
1 TIMESTAMP_START TIMESTAMP_START NA NA
2 TIMESTAMP_END TIMESTAMP_END NA NA
3 CO2 CO2 NA NA
4 H2O H2O NA NA
6 NEE_PI NEE NA _PI
11 G_1_1_1 G NA NA
12 G_2_1_1 G NA NA
Table continues below
  qualifier_pos qualifier_ag layer_index H_index V_index
1 NA NA NA NA NA
2 NA NA NA NA NA
3 NA NA NA NA NA
4 NA NA NA NA NA
6 NA NA NA NA NA
11 _1_1_1 NA NA 1 1
12 _2_1_1 NA NA 2 1
Table continues below
  R_index is_correct_basename is_pi_provide is_gapfill is_fetch
1 NA TRUE FALSE FALSE FALSE
2 NA TRUE FALSE FALSE FALSE
3 NA TRUE FALSE FALSE FALSE
4 NA TRUE FALSE FALSE FALSE
6 NA TRUE TRUE FALSE FALSE
11 1 TRUE FALSE FALSE FALSE
12 1 TRUE FALSE FALSE FALSE
Table continues below
  is_layer_aggregated is_layer_SD is_layer_number
1 FALSE FALSE FALSE
2 FALSE FALSE FALSE
3 FALSE FALSE FALSE
4 FALSE FALSE FALSE
6 FALSE FALSE FALSE
11 FALSE FALSE FALSE
12 FALSE FALSE FALSE
Table continues below
  is_replicate_aggregated is_replicate_SD is_replicate_number
1 FALSE FALSE FALSE
2 FALSE FALSE FALSE
3 FALSE FALSE FALSE
4 FALSE FALSE FALSE
6 FALSE FALSE FALSE
11 FALSE FALSE FALSE
12 FALSE FALSE FALSE
  is_quadruplet
1 FALSE
2 FALSE
3 FALSE
4 FALSE
6 FALSE
11 TRUE
12 TRUE

Data filtering

While BASE data products are quality-checked before release, the data may not be filtered for all outliers. The amf_filter_base() can be use to filter the data based on the expected physically ranges (i.e., obtained through amf_variables()). By default, a ±5% buffer is applied to account for possible edge values near the lower and upper bounds, which are commonly observed for certain variables like radiation, relative humidity, and snow depth.

# filter data, using default physical range +/- 5% buffer
base_f <- amf_filter_base(data_in = base1)

Measurement height information

Measurement height information contains height/depth and instrument model information of the BASE data products. The info can be downloaded directly using the amf_var_info() function. The function returns a data frame for all available sites, and can be subset using the “Site_ID” column. The “Height” column refers to the distance from the ground surface in meters. Positive values are heights, and negative values are depths. See the web page for explanation.

# obtain the latest measurement height information
var_info <- amf_var_info()

# subset the variable by target Site ID
var_info <- var_info[var_info$Site_ID == "US-CRT", ]
pander::pandoc.table(var_info[c(1:10), ])
Table continues below
  Site_ID Variable Start_Date Height Instrument_Model
4561 US-CRT CH4 NA 2 GA_OP-LI-COR LI-7700
4562 US-CRT CO2 NA 2 GA_OP-LI-COR LI-7500
4563 US-CRT FC NA 2 GA_OP-LI-COR LI-7500
4564 US-CRT FCH4 NA 2 GA_OP-LI-COR LI-7700
4565 US-CRT G_1_1_1 NA -0.1 SOIL_H-Plate
4566 US-CRT H NA 2 SA-Campbell CSAT-3
4567 US-CRT H2O NA 2 GA_OP-LI-COR LI-7500
4568 US-CRT MO_LENGTH NA 2 NA
4569 US-CRT LE NA 2 GA_OP-LI-COR LI-7500
4570 US-CRT NEE_PI NA NA GA_OP-LI-COR LI-7500
  Instrument_Model2 Comment BASE_Version
4561 NA NA 5-5
4562 NA NA 5-5
4563 SA-Campbell CSAT-3 NA 5-5
4564 SA-Campbell CSAT-3 NA 5-5
4565 NA NA 5-5
4566 NA NA 5-5
4567 NA NA 5-5
4568 NA NA 5-5
4569 SA-Campbell CSAT-3 NA 5-5
4570 SA-Campbell CSAT-3 NA 5-5

BADM data product

Import BADM data

Biological, Ancillary, Disturbance, and Metadata (BADM) are non-continuous information that describe and complement continuous flux and meteorological data (e.g., BASE data product). BADM include general site description, metadata about the sensors and their setup, maintenance and disturbance events, and biological and ecological data that characterize a site’s ecosystem. See link for details.

The amf_read_bif() can be used to import the BADM data file. The function returns a data frame for all available sites, and can subset using the “SITE_ID” column.

# read the BADM BIF file, using an example data file
bif <- amf_read_bif(file = floc1)

# subset by target Site ID
bif <- bif[bif$SITE_ID == "US-CRT", ]
pander::pandoc.table(bif[c(1:15), ])
Table continues below
SITE_ID GROUP_ID VARIABLE_GROUP VARIABLE
US-CRT 12764 GRP_ACKNOWLEDGEMENT ACKNOWLEDGEMENT
US-CRT 12764 GRP_ACKNOWLEDGEMENT ACKNOWLEDGEMENT_COMMENT
US-CRT 12765 GRP_CLIM_AVG MAT
US-CRT 12765 GRP_CLIM_AVG MAP
US-CRT 12765 GRP_CLIM_AVG CLIMATE_KOEPPEN
US-CRT 27000537 GRP_COUNTRY COUNTRY
US-CRT 15683 GRP_DOI DOI
US-CRT 15683 GRP_DOI DOI_CITATION
US-CRT 15683 GRP_DOI DOI_DATAPRODUCT
US-CRT 88075 GRP_DOI_CONTRIBUTOR DOI_CONTRIBUTOR_DATAPRODUCT
US-CRT 88075 GRP_DOI_CONTRIBUTOR DOI_CONTRIBUTOR_NAME
US-CRT 88075 GRP_DOI_CONTRIBUTOR DOI_CONTRIBUTOR_ROLE
US-CRT 88075 GRP_DOI_CONTRIBUTOR DOI_CONTRIBUTOR_ORDINAL
US-CRT 88075 GRP_DOI_CONTRIBUTOR DOI_CONTRIBUTOR_EMAIL
US-CRT 88075 GRP_DOI_CONTRIBUTOR DOI_CONTRIBUTOR_INSTITUTION
DATAVALUE
Supported by NOAA (NA10OAR4170224) & NSF (NSF1034791)
Acknowledgement Walter Berger for fully support
10.1
849
Dfa
USA
10.17190/AMF/1246156
Jiquan Chen, Housen Chu (2011-2013) AmeriFlux US-CRT Curtice Walter-Berger cropland, Dataset. https://doi.org/10.17190/AMF/1246156
AmeriFlux
AmeriFlux
Jiquan Chen
Author
1
University of Toledo / Michigan State University

# get a list of all BADM variable groups and variables
unique(bif$VARIABLE_GROUP)

[1] “GRP_ACKNOWLEDGEMENT” “GRP_CLIM_AVG” “GRP_COUNTRY”
[4] “GRP_DOI” “GRP_DOI_CONTRIBUTOR” “GRP_DOI_ORGANIZATION” [7] “GRP_DOM_DIST_MGMT” “GRP_FLUX_MEASUREMENTS” “GRP_HEADER”
[10] “GRP_HEIGHTC” “GRP_IGBP” “GRP_LAND_OWNERSHIP”
[13] “GRP_LOCATION” “GRP_NETWORK” “GRP_REFERENCE_PAPER”
[16] “GRP_SHIPPING_ADDRESS” “GRP_SITE_CHAR” “GRP_SITE_DESC”
[19] “GRP_SITE_FUNDING” “GRP_STATE” “GRP_TEAM_MEMBER”
[22] “GRP_TOWER_POWER” “GRP_TOWER_TYPE” “GRP_URL”
[25] “GRP_URL_AMERIFLUX” “GRP_UTC_OFFSET”

length(unique(bif$VARIABLE))

[1] 59

As shown above, BADM data contain information from a variety of variable groups (i.e., GRP_{BADM_GROUPS}). Browse the definitions of all available variable groups here.

To get the BADM data for a certain variable group, use amf_extract_badm() function. The function also renders the data format (i.e., display all variables by columns) for human readability.

# extract the FLUX_MEASUREMENTS group
bif_flux <- amf_extract_badm(bif_data = bif, select_group = "GRP_FLUX_MEASUREMENTS")
pander::pandoc.table(bif_flux)
Table continues below
GROUP_ID SITE_ID FLUX_MEASUREMENTS_METHOD FLUX_MEASUREMENTS_VARIABLE
12767 US-CRT Eddy Covariance CO2
12782 US-CRT Eddy Covariance H2O
12786 US-CRT Eddy Covariance H
12787 US-CRT Eddy Covariance CH4
Table continues below
FLUX_MEASUREMENTS_DATE_START FLUX_MEASUREMENTS_DATE_END
20110101 20131231
20110101 20131231
20110101 20131231
20110511 20120520
FLUX_MEASUREMENTS_OPERATIONS
Continuous operation
Continuous operation
Continuous operation
Continuous operation

# extract the HEIGHTC (canopy height) group
bif_hc <- amf_extract_badm(bif_data = bif, select_group = "GRP_HEIGHTC")
pander::pandoc.table(bif_hc)
GROUP_ID SITE_ID HEIGHTC HEIGHTC_STATISTIC HEIGHTC_DATE
88202 US-CRT 0.12 Mean 20121121
88203 US-CRT 1.125 Mean 20120928
88204 US-CRT 0.375 Mean 20120627
88205 US-CRT 0 Mean 20121002
88206 US-CRT 0.52 Mean 20130515
88207 US-CRT 0.15 Mean 20110705
88208 US-CRT 0 Mean 20121001
88209 US-CRT 0 Mean 20111024
88210 US-CRT 1.14 Mean 20130703
88211 US-CRT 0.35 Mean 20130716
88212 US-CRT 0.45 Mean 20110726
88213 US-CRT 0.975 Mean 20110819
88214 US-CRT 0 Mean 20120520
88215 US-CRT 0.975 Mean 20120820
88216 US-CRT 0.18 Mean 20121214
88217 US-CRT 0.18 Mean 20130104
88218 US-CRT 0.675 Mean 20110810
88219 US-CRT 0.72 Mean 20130605
88220 US-CRT 1.125 Mean 20120912
88221 US-CRT 0.3 Mean 20130422
88222 US-CRT 0.075 Mean 20110617
88223 US-CRT 0.18 Mean 20130212
88224 US-CRT 0.18 Mean 20130319
88225 US-CRT 0.225 Mean 20120612
88226 US-CRT 1 Mean 20130620
88227 US-CRT 1.125 Mean 20111007
88228 US-CRT 0.3 Mean 20130816
88229 US-CRT 0.35 Mean 20130725
88230 US-CRT 0.525 Mean 20120720
88231 US-CRT 0.18 Mean 20130124
88232 US-CRT 0 Mean 20110611
88233 US-CRT 0.06 Mean 20121022

Note: amf_extract_badm() returns all columns in characters. Certain groups of BADM variables contain columns of time stamps (i.e., ISO format) and data values, and need to be converted before further use.

# convert HEIGHTC_DATE to POSIXlt
bif_hc$TIMESTAMP <- strptime(bif_hc$HEIGHTC_DATE, format = "%Y%m%d", tz = "GMT")

# convert HEIGHTC column to numeric
bif_hc$HEIGHTC <- as.numeric(bif_hc$HEIGHTC)

# plot time series of canopy height
plot(bif_hc$TIMESTAMP, bif_hc$HEIGHTC, xlab = "TIMESTAMP", ylab = "canopy height (m)")

Last, the contacts of the site members and data DOI can be obtained from the BADM data. The AmeriFlux data policy requires proper attribution (e.g., data DOI).

In some case, for example, using data shared under Legacy Data Policy for publication, data users are required to contact data contributors directly, so that they have the opportunity to contribute substantively and become a co-author.

# get a list of contacts
bif_contact <- amf_extract_badm(bif_data = bif, select_group = "GRP_TEAM_MEMBER")
pander::pandoc.table(bif_contact)
Table continues below
GROUP_ID SITE_ID TEAM_MEMBER_NAME TEAM_MEMBER_ROLE
12777 US-CRT Housen Chu FluxContact
12784 US-CRT Jiquan Chen PI
TEAM_MEMBER_EMAIL TEAM_MEMBER_INSTITUTION TEAM_MEMBER_ADDRESS
University of Toledo / University of California, Berkeley NA
University of Toledo / Michigan State University 202 Manly Miles Bldg. 1405 South Harrison Road Michigan State University, East Lansing, MI 48823

# get data DOI
bif_doi <- amf_extract_badm(bif_data = bif, select_group = "GRP_DOI")
pander::pandoc.table(bif_doi)
Table continues below
GROUP_ID SITE_ID DOI
15683 US-CRT 10.17190/AMF/1246156
DOI_CITATION DOI_DATAPRODUCT
Jiquan Chen, Housen Chu (2011-2013) AmeriFlux US-CRT Curtice Walter-Berger cropland, Dataset. https://doi.org/10.17190/AMF/1246156 AmeriFlux