Data Import • amerifluxr

library(amerifluxr)
library(pander)

amerifluxr is a programmatic interface to the AmeriFlux. This vignette demonstrates examples to import data and metadata downloaded from AmeriFlux, and to parse and clean data for further use. A companion vignette for site selection is available as well.

Download data

AmeriFlux data and metadata can be downloaded using amf_download_base() and amf_download_bif(). Users will need to create a personal AmeriFlux account here before download.

The following downloads AmeriFlux flux/met data (aka BASE data product) from a single site: US-CRT.

## When running, replace user_id and user_email with a real AmeriFlux account
floc2 <- amf_download_base(
  user_id = "my_user",
  user_email = "my_email@mail.com",
  site_id = "US-CRT",
  data_product = "BASE-BADM",
  data_policy = "CCBY4.0",
  agree_policy = TRUE,
  intended_use = "other",
  intended_use_text = "amerifluxr package demonstration",
  verbose = TRUE,
  out_dir = tempdir()
)

The downloaded file is a zipped file saved in tempdir() (e.g., AMF_{SITE_ID}_BASE-BADM_{VERSION}.zip), which contains a BASE data file (e.g., AMF_{SITE_ID}_BASE_{RESOLUTION}_{VERSION}.csv, RESOLUTION = HH (half-hourly) or HR (hourly)) and a metadata file (aka BADM data product, e.g., AMF_{SITE_ID}_BIF_{VERSION}.xlsx). The amf_download_base() also returns the file path to the downloaded file, which can be used later to read the file into R.

The following downloads a single file containing all AmeriFlux sites’ metadata (i.e., BADM data product) for sites under the CC-BY-4.0 data use policy.

## When running, replace user_id and user_email with a real AmeriFlux account
floc1 <- amf_download_bif(
  user_id = "my_user",
  user_email = "my_email@mail.com",
  data_policy = "CCBY4.0",
  agree_policy = TRUE,
  intended_use = "other",
  intended_use_text = "amerifluxr package demonstration",
  out_dir = tempdir(),
  verbose = TRUE,
  site_w_data = TRUE
)

The downloaded file is a Excel file saved to tempdir() (e.g., AMF_{SITES}_BIF_{POLICY}_{VERSION}.xlsx, SITES = AA-Net (all registered sites) or AA-Flx (all sites with flux/met data available); POLICY = CCBY4 (shared under AmeriFlux CC-BY-4.0 data use policy) or LEGACY (shared under AmeriFlux Legacy data use policy)). Similarly, the amf_download_bif() also returns the file path to the downloaded file, which can be used later to read the file into R.

For this vignette, we will use following example data files [files are truncated to limit package size, for demonstration purposes only].

# An example of BASE zipped files downloaded for US-CRT site
floc2 <- system.file("extdata", "AMF_US-CRT_BASE-BADM_2-5.zip", package = "amerifluxr")

# An example of unzipped BASE files from the above zipped file
floc3 <- system.file("extdata", "AMF_US-CRT_BASE_HH_2-5.csv", package = "amerifluxr")

# An example of all sites' BADM data
floc1 <- system.file("extdata", "AMF_AA-Flx_BIF_CCBY4_20201218.xlsx", package = "amerifluxr")

BASE data product

Import data

The amd_read_base() imports a BASE file, either from a zipped file or an unzipped comma-separated file (.csv). The parse_timestamp parameter can be used if additional time-keeping columns (e.g., year, month, day, hour) are desired.

# read the BASE from a zip file, without additional parsed time-keeping columns
base1 <- amf_read_base(
  file = floc2,
  unzip = TRUE,
  parse_timestamp = FALSE
)
pander::pandoc.table(base1[c(1:3),])

Table continues below
TIMESTAMP_START	TIMESTAMP_END	CO2	H2O	FC	NEE_PI	CH4	FCH4	H
2.011e+11	2.011e+11	NA	NA	NA	NA	NA	NA	NA
2.011e+11	2.011e+11	NA	NA	NA	NA	NA	NA	NA
2.011e+11	2.011e+11	NA	NA	NA	NA	NA	NA	NA

Table continues below
LE	G_1_1_1	G_2_1_1	WD	WS	USTAR	ZL	MO_LENGTH	W_SIGMA	V_SIGMA
NA	27.45	37.32	NA	NA	NA	NA	NA	NA	NA
NA	26.92	34.11	NA	NA	NA	NA	NA	NA	NA
NA	33.71	41.3	NA	NA	NA	NA	NA	NA	NA

Table continues below
U_SIGMA	T_SONIC	T_SONIC_SIGMA	PA	RH	TA	TS_1_1_1	TS_2_1_1
NA	NA	NA	NA	92.34	11.18	3.468	3.131
NA	NA	NA	NA	87.63	11.69	3.606	3.155
NA	NA	NA	NA	83.31	12.37	3.727	3.277

WTD	SWC	NETRAD	SW_OUT	LW_IN	LW_OUT	P
-0.9133	45.13	7.064	0	368.5	360.6	0
-0.8939	45.05	2.375	0	365.8	361.4	0.254
-0.8885	45	2.923	2.812	369.7	365.2	0


# read the BASE from a csv file, with additional parsed time-keeping columns
base2 <- amf_read_base(
  file = floc3,
  unzip = FALSE,
  parse_timestamp = TRUE
)
pander::pandoc.table(base2[c(1:3), c(1:10)])

Table continues below
YEAR	MONTH	DAY	DOY	HOUR	MINUTE	TIMESTAMP
2011	1	1	1	0	15	2011-01-01 00:15:00
2011	1	1	1	0	45	2011-01-01 00:45:00
2011	1	1	1	1	15	2011-01-01 01:15:00

TIMESTAMP_START	TIMESTAMP_END	CO2
2.011e+11	2.011e+11	NA
2.011e+11	2.011e+11	NA
2.011e+11	2.011e+11	NA

Parse and interpret data

The details of the BASE data product’s format and variable definitions can be found on AmeriFlux website. In short, the BASE data product contains flux, meteorological, and soil observations that are reported at regular intervals of time, generally half-hourly or hourly, for a certain time period. TIMESTAMP_START and TIMESTAMP_END columns (i.e., YYYYMMDDHHMM 12 digits) denote the starting and ending time of each reporting interval (i.e., row).

All other variables use the format of {base name}_{qualifier}, e.g., FC_1, CO2_1_1_1. Base names indicate fundamental quantities that are either measured or calculated / derived. Qualifiers are suffixes appended to variable base names that provide additional information (e.g., gap-filling, position) about the variable. In some cases, qualifiers are omitted if only one variable is provided for a site.

The amf_variable() retrieves the latest list of base names and default units. For sites that have relatively fewer variables and less complicated qualifiers, the users could easily interpret variables and qualifiers. The amf_variable() also returns the expected maximal and minimal values based on physically plausible ranges or network reported values.

# get a list of latest base names and units. 
FP_ls <- amf_variables()
pander::pandoc.table(FP_ls[c(11:20), ])

Table continues below
	Name	Description	Units	Min
8	DBH	Diameter of tree measured at breast height (1.3m) with continuous dendrometers	cm	0
9	LEAF_WET	Leaf wetness, range 0-100	%	0
10	SAP_DT	Difference of probes temperature for sapflow measurements	deg C	-10
11	SAP_FLOW	Sap flow	mmolH2O m-2 s-1	NA
12	T_BOLE	Bole temperature	deg C	-50
13	T_CANOPY	Temperature of the canopy and/or surface underneath the sensor	deg C	-50
14	FETCH_70	Distance at which cross-wind integrated footprint cumulative probability is 70%	m	0
15	FETCH_80	Distance at which cross-wind integrated footprint cumulative probability is 80%	m	0
16	FETCH_90	Distance at which cross-wind integrated footprint cumulative probability is 90%	m	0
17	FETCH_FILTER	Footprint quality flag (i.e., 0, 1): 0 and 1 indicate data measured when wind coming from direction that should be discarded and kept, respectively	nondimensional	0

	Max
8	500
9	100
10	10
11	NA
12	70
13	70
14	10000
15	12000
16	15000
17	1

Alternatively, the amf_parse_basename() can programmatically parse the the variable names into base names and qualifiers. This function can be helpful for sites with many variables and relatively complicated qualifiers, as a prerequisite for handling data from many sites. The function returns a data frame with information about each variable’s base name, qualifier, and whether a variable is gap-filled, layer-aggregated, or replicate aggregated.

# parse the variable name
basename_decode <- amf_parse_basename(var_name = colnames(base1))
pander::pandoc.table(basename_decode[c(1, 2, 3, 4, 6, 11, 12),])

Table continues below
	variable_name	basename	qualifier_gf	qualifier_pi
1	TIMESTAMP_START	TIMESTAMP_START	NA	NA
2	TIMESTAMP_END	TIMESTAMP_END	NA	NA
3	CO2	CO2	NA	NA
4	H2O	H2O	NA	NA
6	NEE_PI	NEE	NA	_PI
11	G_1_1_1	G	NA	NA
12	G_2_1_1	G	NA	NA

Table continues below
	qualifier_pos	qualifier_ag	layer_index	H_index	V_index
1	NA	NA	NA	NA	NA
2	NA	NA	NA	NA	NA
3	NA	NA	NA	NA	NA
4	NA	NA	NA	NA	NA
6	NA	NA	NA	NA	NA
11	_1_1_1	NA	NA	1	1
12	_2_1_1	NA	NA	2	1

Table continues below
	R_index	is_correct_basename	is_pi_provide	is_gapfill	is_fetch
1	NA	TRUE	FALSE	FALSE	FALSE
2	NA	TRUE	FALSE	FALSE	FALSE
3	NA	TRUE	FALSE	FALSE	FALSE
4	NA	TRUE	FALSE	FALSE	FALSE
6	NA	TRUE	TRUE	FALSE	FALSE
11	1	TRUE	FALSE	FALSE	FALSE
12	1	TRUE	FALSE	FALSE	FALSE

Table continues below
	is_layer_aggregated	is_layer_SD	is_layer_number
1	FALSE	FALSE	FALSE
2	FALSE	FALSE	FALSE
3	FALSE	FALSE	FALSE
4	FALSE	FALSE	FALSE
6	FALSE	FALSE	FALSE
11	FALSE	FALSE	FALSE
12	FALSE	FALSE	FALSE

Table continues below
	is_replicate_aggregated	is_replicate_SD	is_replicate_number
1	FALSE	FALSE	FALSE
2	FALSE	FALSE	FALSE
3	FALSE	FALSE	FALSE
4	FALSE	FALSE	FALSE
6	FALSE	FALSE	FALSE
11	FALSE	FALSE	FALSE
12	FALSE	FALSE	FALSE

	is_quadruplet
1	FALSE
2	FALSE
3	FALSE
4	FALSE
6	FALSE
11	TRUE
12	TRUE

Data filtering

While BASE data products are quality-checked before release, the data may not be filtered for all outliers. The amf_filter_base() can be use to filter the data based on the expected physically ranges (i.e., obtained through amf_variables()). By default, a ±5% buffer is applied to account for possible edge values near the lower and upper bounds, which are commonly observed for certain variables like radiation, relative humidity, and snow depth.

# filter data, using default physical range +/- 5% buffer
base_f <- amf_filter_base(data_in = base1)

Measurement height information

Measurement height information contains height/depth and instrument model information of the BASE data products. The info can be downloaded directly using the amf_var_info() function. The function returns a data frame for all available sites, and can be subset using the “Site_ID” column. The “Height” column refers to the distance from the ground surface in meters. Positive values are heights, and negative values are depths. See the web page for explanation.

# obtain the latest measurement height information
var_info <- amf_var_info()

# subset the variable by target Site ID
var_info <- var_info[var_info$Site_ID == "US-CRT", ]
pander::pandoc.table(var_info[c(1:10), ])

Table continues below
	Site_ID	Variable	Start_Date	Height	Instrument_Model
5150	US-CRT	CH4	NA	2	GA_OP-LI-COR LI-7700
5151	US-CRT	CO2	NA	2	GA_OP-LI-COR LI-7500
5152	US-CRT	FC	NA	2	GA_OP-LI-COR LI-7500
5153	US-CRT	FCH4	NA	2	GA_OP-LI-COR LI-7700
5154	US-CRT	G_1_1_1	NA	-0.1	SOIL_H-Plate
5155	US-CRT	G_2_1_1	NA	-0.1	SOIL_H-Plate
5156	US-CRT	H	NA	2	SA-Campbell CSAT-3
5157	US-CRT	H2O	NA	2	GA_OP-LI-COR LI-7500
5158	US-CRT	LE	NA	2	GA_OP-LI-COR LI-7500
5159	US-CRT	LW_IN	NA	2	RAD-Pyrrad-SW+LW

	Instrument_Model2	Comment	BASE_Version
5150	NA	NA	5-5
5151	NA	NA	5-5
5152	SA-Campbell CSAT-3	NA	5-5
5153	SA-Campbell CSAT-3	NA	5-5
5154	NA	NA	5-5
5155	NA	NA	5-5
5156	NA	NA	5-5
5157	NA	NA	5-5
5158	SA-Campbell CSAT-3	NA	5-5
5159	NA	NA	5-5

BADM data product

Import BADM data

Biological, Ancillary, Disturbance, and Metadata (BADM) are non-continuous information that describe and complement continuous flux and meteorological data (e.g., BASE data product). BADM include general site description, metadata about the sensors and their setup, maintenance and disturbance events, and biological and ecological data that characterize a site’s ecosystem. See link for details.

The amf_read_bif() can be used to import the BADM data file. The function returns a data frame for all available sites, and can subset using the “SITE_ID” column.

# read the BADM BIF file, using an example data file
bif <- amf_read_bif(file = floc1)

# subset by target Site ID
bif <- bif[bif$SITE_ID == "US-CRT", ]
pander::pandoc.table(bif[c(1:15), ])

Table continues below
SITE_ID	GROUP_ID	VARIABLE_GROUP	VARIABLE
US-CRT	12764	GRP_ACKNOWLEDGEMENT	ACKNOWLEDGEMENT
US-CRT	12764	GRP_ACKNOWLEDGEMENT	ACKNOWLEDGEMENT_COMMENT
US-CRT	12765	GRP_CLIM_AVG	MAT
US-CRT	12765	GRP_CLIM_AVG	MAP
US-CRT	12765	GRP_CLIM_AVG	CLIMATE_KOEPPEN
US-CRT	27000537	GRP_COUNTRY	COUNTRY
US-CRT	15683	GRP_DOI	DOI
US-CRT	15683	GRP_DOI	DOI_CITATION
US-CRT	15683	GRP_DOI	DOI_DATAPRODUCT
US-CRT	88075	GRP_DOI_CONTRIBUTOR	DOI_CONTRIBUTOR_DATAPRODUCT
US-CRT	88075	GRP_DOI_CONTRIBUTOR	DOI_CONTRIBUTOR_NAME
US-CRT	88075	GRP_DOI_CONTRIBUTOR	DOI_CONTRIBUTOR_ROLE
US-CRT	88075	GRP_DOI_CONTRIBUTOR	DOI_CONTRIBUTOR_ORDINAL
US-CRT	88075	GRP_DOI_CONTRIBUTOR	DOI_CONTRIBUTOR_EMAIL
US-CRT	88075	GRP_DOI_CONTRIBUTOR	DOI_CONTRIBUTOR_INSTITUTION

DATAVALUE
Supported by NOAA (NA10OAR4170224) & NSF (NSF1034791)
Acknowledgement Walter Berger for fully support
10.1
849
Dfa
USA
10.17190/AMF/1246156
Jiquan Chen, Housen Chu (2011-2013) AmeriFlux US-CRT Curtice Walter-Berger cropland, Dataset. https://doi.org/10.17190/AMF/1246156
AmeriFlux
AmeriFlux
Jiquan Chen
Author
1
jqchen@msu.edu
University of Toledo / Michigan State University


# get a list of all BADM variable groups and variables
unique(bif$VARIABLE_GROUP)

[1] “GRP_ACKNOWLEDGEMENT” “GRP_CLIM_AVG” “GRP_COUNTRY”
[4] “GRP_DOI” “GRP_DOI_CONTRIBUTOR” “GRP_DOI_ORGANIZATION” [7] “GRP_DOM_DIST_MGMT” “GRP_FLUX_MEASUREMENTS” “GRP_HEADER”
[10] “GRP_HEIGHTC” “GRP_IGBP” “GRP_LAND_OWNERSHIP”
[13] “GRP_LOCATION” “GRP_NETWORK” “GRP_REFERENCE_PAPER”
[16] “GRP_SHIPPING_ADDRESS” “GRP_SITE_CHAR” “GRP_SITE_DESC”
[19] “GRP_SITE_FUNDING” “GRP_STATE” “GRP_TEAM_MEMBER”
[22] “GRP_TOWER_POWER” “GRP_TOWER_TYPE” “GRP_URL”
[25] “GRP_URL_AMERIFLUX” “GRP_UTC_OFFSET”

length(unique(bif$VARIABLE))

[1] 59

As shown above, BADM data contain information from a variety of variable groups (i.e., GRP_{BADM_GROUPS}). Browse the definitions of all available variable groups here.

To get the BADM data for a certain variable group, use amf_extract_badm() function. The function also renders the data format (i.e., display all variables by columns) for human readability.

# extract the FLUX_MEASUREMENTS group
bif_flux <- amf_extract_badm(bif_data = bif, select_group = "GRP_FLUX_MEASUREMENTS")
pander::pandoc.table(bif_flux)

Table continues below
GROUP_ID	SITE_ID	FLUX_MEASUREMENTS_METHOD	FLUX_MEASUREMENTS_VARIABLE
12767	US-CRT	Eddy Covariance	CO2
12782	US-CRT	Eddy Covariance	H2O
12786	US-CRT	Eddy Covariance	H
12787	US-CRT	Eddy Covariance	CH4

Table continues below
FLUX_MEASUREMENTS_DATE_START	FLUX_MEASUREMENTS_DATE_END
20110101	20131231
20110101	20131231
20110101	20131231
20110511	20120520

FLUX_MEASUREMENTS_OPERATIONS
Continuous operation
Continuous operation
Continuous operation
Continuous operation


# extract the HEIGHTC (canopy height) group
bif_hc <- amf_extract_badm(bif_data = bif, select_group = "GRP_HEIGHTC")
pander::pandoc.table(bif_hc)

GROUP_ID	SITE_ID	HEIGHTC	HEIGHTC_STATISTIC	HEIGHTC_DATE
88202	US-CRT	0.12	Mean	20121121
88203	US-CRT	1.125	Mean	20120928
88204	US-CRT	0.375	Mean	20120627
88205	US-CRT	0	Mean	20121002
88206	US-CRT	0.52	Mean	20130515
88207	US-CRT	0.15	Mean	20110705
88208	US-CRT	0	Mean	20121001
88209	US-CRT	0	Mean	20111024
88210	US-CRT	1.14	Mean	20130703
88211	US-CRT	0.35	Mean	20130716
88212	US-CRT	0.45	Mean	20110726
88213	US-CRT	0.975	Mean	20110819
88214	US-CRT	0	Mean	20120520
88215	US-CRT	0.975	Mean	20120820
88216	US-CRT	0.18	Mean	20121214
88217	US-CRT	0.18	Mean	20130104
88218	US-CRT	0.675	Mean	20110810
88219	US-CRT	0.72	Mean	20130605
88220	US-CRT	1.125	Mean	20120912
88221	US-CRT	0.3	Mean	20130422
88222	US-CRT	0.075	Mean	20110617
88223	US-CRT	0.18	Mean	20130212
88224	US-CRT	0.18	Mean	20130319
88225	US-CRT	0.225	Mean	20120612
88226	US-CRT	1	Mean	20130620
88227	US-CRT	1.125	Mean	20111007
88228	US-CRT	0.3	Mean	20130816
88229	US-CRT	0.35	Mean	20130725
88230	US-CRT	0.525	Mean	20120720
88231	US-CRT	0.18	Mean	20130124
88232	US-CRT	0	Mean	20110611
88233	US-CRT	0.06	Mean	20121022

Note: amf_extract_badm() returns all columns in characters. Certain groups of BADM variables contain columns of time stamps (i.e., ISO format) and data values, and need to be converted before further use.

# convert HEIGHTC_DATE to POSIXlt
bif_hc$TIMESTAMP <- strptime(bif_hc$HEIGHTC_DATE, format = "%Y%m%d", tz = "GMT")

# convert HEIGHTC column to numeric
bif_hc$HEIGHTC <- as.numeric(bif_hc$HEIGHTC)

# plot time series of canopy height
plot(bif_hc$TIMESTAMP, bif_hc$HEIGHTC, xlab = "TIMESTAMP", ylab = "canopy height (m)")

Last, the contacts of the site members and data DOI can be obtained from the BADM data. The AmeriFlux data policy requires proper attribution (e.g., data DOI).

In some case, for example, using data shared under Legacy Data Policy for publication, data users are required to contact data contributors directly, so that they have the opportunity to contribute substantively and become a co-author.

# get a list of contacts
bif_contact <- amf_extract_badm(bif_data = bif, select_group = "GRP_TEAM_MEMBER")
pander::pandoc.table(bif_contact)

Table continues below
GROUP_ID	SITE_ID	TEAM_MEMBER_NAME	TEAM_MEMBER_ROLE
12777	US-CRT	Housen Chu	FluxContact
12784	US-CRT	Jiquan Chen	PI

TEAM_MEMBER_EMAIL	TEAM_MEMBER_INSTITUTION	TEAM_MEMBER_ADDRESS
chu.housen@gmail.com	University of Toledo / University of California, Berkeley	NA
jqchen@msu.edu	University of Toledo / Michigan State University	202 Manly Miles Bldg. 1405 South Harrison Road Michigan State University, East Lansing, MI 48823


# get data DOI
bif_doi <- amf_extract_badm(bif_data = bif, select_group = "GRP_DOI")
pander::pandoc.table(bif_doi)

Table continues below
GROUP_ID	SITE_ID	DOI
15683	US-CRT	10.17190/AMF/1246156

DOI_CITATION	DOI_DATAPRODUCT
Jiquan Chen, Housen Chu (2011-2013) AmeriFlux US-CRT Curtice Walter-Berger cropland, Dataset. https://doi.org/10.17190/AMF/1246156	AmeriFlux