amerifluxr is a programmatic interface to the AmeriFlux. This vignette demonstrates examples to query a list of target sites based on sites’ general information and availability of metadata and data. A companion vignette for Data import is available as well.

Get a site list with general info

AmeriFlux data are organized by individual sites. Typically, data query begins with site search and selection. A full list of AmeriFlux sites with general info can be obtained using the amf_site_info() function.

Convert the site list to a data.table for easier manipulation. Also see link for variable definition.

# get a full list of sites with general info
sites <- amf_site_info()
sites_dt <- data.table::as.data.table(sites)

pander::pandoc.table(sites_dt[c(1:3), ])
Table continues below
SITE_ID SITE_NAME COUNTRY STATE IGBP
AR-CCa Carlos Casares agriculture Argentina Buenos Aires CRO
AR-CCg Carlos Casares grassland Argentina Buenos Aires GRA
AR-Cel CELPA Mar Chiquita BA Argentina Buenos Aires WET
Table continues below
TOWER_BEGAN URL_AMERIFLUX TOWER_END
2012 https://ameriflux.lbl.gov/sites/siteinfo/AR-CCa NA
2018 https://ameriflux.lbl.gov/sites/siteinfo/AR-CCg NA
2018 https://ameriflux.lbl.gov/sites/siteinfo/AR-Cel 2018
Table continues below
LOCATION_LAT LOCATION_LONG LOCATION_ELEV CLIMATE_KOEPPEN MAT MAP
-35.62 -61.32 83 Cfa 16.1 1060
-35.92 -61.19 84 Cfa 16.1 1060
-37.7 -57.42 1 Cfb 14 926
DATA_POLICY DATA_START DATA_END
LEGACY 2012 2020
LEGACY 2018 2020
LEGACY NA NA

The site list provides a quick summary of all registered sites and sites with available data.

It’s often important to understand the data use policy under which the data are shared. In 2021, the AmeriFlux community moved to the AmeriFlux CC-BY-4.0 License. Most site PIs now share their sites’ data under the CC-BY-4.0 license. Data for some sites are shared under the historical AmeriFlux data-sharing policy, now called the AmeriFlux Legacy Data Policy.

Check link for data use policy and attribution guidelines.

# total number of registered sites
pander::pandoc.table(sites_dt[, .N])
634

# total number of sites with available data
pander::pandoc.table(sites_dt[!is.na(DATA_START), .N])
469

# get number of sites with available data, grouped by data use policy
pander::pandoc.table(sites_dt[!is.na(DATA_START), .N, by = .(DATA_POLICY)])
DATA_POLICY N
LEGACY 92
CCBY4.0 377

Further group sites based on IGBP.

# get a summary table of sites grouped by IGBP
pander::pandoc.table(sites_dt[, .N, by = "IGBP"])
IGBP N
CRO 129
GRA 85
WET 114
DNF 1
EBF 12
WSA 12
MF 16
ENF 101
DBF 59
OSH 40
WAT 10
CSH 13
URB 15
BSV 7
CVM 11
SAV 8
SNO 1

# get a summary table of sites with available data, & grouped by IGBP
pander::pandoc.table(sites_dt[!is.na(DATA_START), .N, by = "IGBP"])
IGBP N
CRO 85
GRA 67
WET 61
DNF 1
WSA 8
EBF 8
ENF 94
DBF 55
MF 12
OSH 32
CSH 11
BSV 5
WAT 8
CVM 8
URB 7
SAV 6
SNO 1

# get a summary table of sites with available data, 
#  & grouped by data use policy & IGBP
pander::pandoc.table(sites_dt[!is.na(DATA_START), .N, by = .(IGBP, DATA_POLICY)][order(IGBP)])
IGBP DATA_POLICY N
BSV CCBY4.0 3
BSV LEGACY 2
CRO LEGACY 15
CRO CCBY4.0 70
CSH LEGACY 5
CSH CCBY4.0 6
CVM CCBY4.0 8
DBF CCBY4.0 49
DBF LEGACY 6
DNF CCBY4.0 1
EBF CCBY4.0 6
EBF LEGACY 2
ENF CCBY4.0 78
ENF LEGACY 16
GRA LEGACY 14
GRA CCBY4.0 53
MF CCBY4.0 8
MF LEGACY 4
OSH CCBY4.0 25
OSH LEGACY 7
SAV CCBY4.0 6
SNO CCBY4.0 1
URB CCBY4.0 5
URB LEGACY 2
WAT CCBY4.0 8
WET CCBY4.0 44
WET LEGACY 17
WSA CCBY4.0 6
WSA LEGACY 2

Once decided, users can query a target site list based on the desired criteria, e.g., IGBP, data availability, data policy, geolocation.


# get a list of cropland and grassland sites with available data,
#  shared under CC-BY-4.0 data policy,
#  located within 30-50 degree N in latitude,
# returned a site list with site ID, name, data starting/ending year
crop_ls <- sites_dt[IGBP %in% c("CRO", "GRA") &
                      !is.na(DATA_START) &
                      LOCATION_LAT > 30 &
                      LOCATION_LAT < 50 &
                      DATA_POLICY == "CCBY4.0",
                    .(SITE_ID, SITE_NAME, DATA_START, DATA_END)]
pander::pandoc.table(crop_ls[c(1:10),])
SITE_ID SITE_NAME DATA_START DATA_END
CA-ER1 Elora Research Station 2015 2021
US-A32 ARM-SGP Medford hay pasture 2015 2017
US-A74 ARM SGP milo field 2015 2017
US-AR1 ARM USDA UNL OSU Woodward Switchgrass 1 2009 2012
US-AR2 ARM USDA UNL OSU Woodward Switchgrass 2 2009 2012
US-ARb ARM Southern Great Plains burn site- Lamont 2005 2006
US-ARc ARM Southern Great Plains control site- Lamont 2005 2006
US-ARM ARM Southern Great Plains site- Lamont 2003 2023
US-Bi1 Bouldin Island Alfalfa 2016 2023
US-Bi2 Bouldin Island corn 2017 2023

Get metadata availability

In some cases, users may want to know if certain types of metadata are available for the selected sites. The amf_list_metadata() function provides a quick summary of metadata availability before actually downloading the data and metadata.

By default, amf_list_metadata() returns a full site list with the available entries (i.e., counts) for all BADM groups. Check AmeriFlux webpage for definitions of all BADM groups.

# get data availability for selected sites
metadata_aval <- data.table::as.data.table(amf_list_metadata())
pander::pandoc.table(metadata_aval[c(1:3), c(1:10)])
Table continues below
SITE_ID GRP_ACKNOWLEDGEMENT GRP_CLIM_AVG GRP_COUNTRY GRP_DOI
AR-CCa 1 1 1 1
AR-CCg 1 1 1 1
AR-Cel 1 1 1 0
Table continues below
GRP_DOI_CONTRIBUTOR GRP_DOI_ORGANIZATION GRP_DOM_DIST_MGMT
1 2 1
1 2 2
0 0 1
GRP_FLUX_MEASUREMENTS GRP_HEADER
2 1
2 1
6 1

The site_set parameter of the amf_list_metadata() can be used to subset the sites of interest.

metadata_aval_sub <- as.data.table(amf_list_metadata(site_set = crop_ls$SITE_ID))

# down-select cropland & grassland sites by interested BADM group,
#  e.g., canopy height (GRP_HEIGHTC)
crop_ls2 <- metadata_aval_sub[GRP_HEIGHTC > 0, .(SITE_ID, GRP_HEIGHTC)][order(-GRP_HEIGHTC)]
pander::pandoc.table(crop_ls2[c(1:10), ])
SITE_ID GRP_HEIGHTC
US-Ne2 196
US-Tw3 162
US-Twt 133
US-Ne3 128
US-Ne1 119
US-Bi1 112
US-Var 105
US-Snd 70
US-Bi2 54
US-ARM 45

Get data availability

Users can use amf_list_data() to query the availability of specific variables in the data (i.e., flux/met data, so-called BASE data product). The amf_list_data() provides a quick summary of variable availability (per site/year) before downloading the data.

By default, amf_list_data() returns a full site list of variable availability (data percentages per year) for all variables. The site_set parameter of amf_list_data() can be used to subset the sites of interest.

# get data availability for selected sites
data_aval <- data.table::as.data.table(amf_list_data(site_set = crop_ls2$SITE_ID))
pander::pandoc.table(data_aval[c(1:10), ])
Table continues below
SITE_ID VARIABLE BASENAME GAP_FILLED Y1990 Y1991 Y1992 Y1993
US-AR1 CO2 CO2 FALSE 0 0 0 0
US-AR1 FC FC FALSE 0 0 0 0
US-AR1 G G FALSE 0 0 0 0
US-AR1 H H FALSE 0 0 0 0
US-AR1 H2O H2O FALSE 0 0 0 0
US-AR1 LE LE FALSE 0 0 0 0
US-AR1 LW_IN LW_IN FALSE 0 0 0 0
US-AR1 LW_OUT LW_OUT FALSE 0 0 0 0
US-AR1 NETRAD NETRAD FALSE 0 0 0 0
US-AR1 P P FALSE 0 0 0 0
Table continues below
Y1994 Y1995 Y1996 Y1997 Y1998 Y1999 Y2000 Y2001 Y2002 Y2003
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
Table continues below
Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012
0 0 0 0 0 0.5905 0.9866 0.9941 0.6654
0 0 0 0 0 0.6082 0.976 0.9886 0.6621
0 0 0 0 0 0.6421 0.9965 0.9999 0.9961
0 0 0 0 0 0.6123 0.9867 0.9938 0.6666
0 0 0 0 0 0.6092 0.971 0.9792 0.6633
0 0 0 0 0 0.6101 0.9816 0.9936 0.6647
0 0 0 0 0 0.6416 0.9965 0.9999 0.9961
0 0 0 0 0 0.6416 0.9965 0.9999 0.9961
0 0 0 0 0 0.5447 0.9964 0.9996 0.996
0 0 0 0 0 0.6422 0.9965 0.9999 0.9961
Y2013 Y2014 Y2015 Y2016 Y2017 Y2018 Y2019 Y2020 Y2021 Y2022 Y2023
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0

The variable availability can be used to subset sites that have certain variables in specific years. The BASENAME column indicates the variable’s base name (i.e., ignoring position qualifier), and can be used to get a coarse-level variable availability.

See AmeriFlux website for definitions of base names and qualifiers.

# down-select cropland & grassland sites based on the available wind speed (WS) and 
# friction velocity (USTAR) data in 2015-2018, regardless their qualifiers
data_aval_sub <- data_aval[data_aval$BASENAME %in% c("WS","USTAR"),
                           .(SITE_ID, BASENAME, Y2015, Y2016, Y2017, Y2018)]

# calculate mean availability of WS and USTAR in each site and each year
data_aval_sub <- data_aval_sub[, lapply(.SD, mean), 
                               by = .(SITE_ID),
                               .SDcols = c("Y2015", "Y2016", "Y2017", "Y2018")]

# sub-select sites that have WS and USTAR data for > 75%
#  during 2015-2018
crop_ls3 <- data_aval_sub[(Y2015 + Y2016 + Y2017 + Y2018) / 4 > 0.75]
pander::pandoc.table(crop_ls3)
SITE_ID Y2015 Y2016 Y2017 Y2018
US-ARM 0.5772 0.9871 0.9683 0.9826
US-Ne1 0.77 0.7861 0.756 0.7167
US-Ne2 0.7636 0.7878 0.7594 0.7442
US-SRG 0.9669 0.9851 0.9775 0.9997
US-Tw3 0.9689 0.9569 0.9763 0.4005
US-Var 0.9983 1 0.9455 1
US-Wkg 0.9973 0.9909 0.9965 0.9848

Last, sometimes users would look for sites with multiple measurements of similar variables (e.g., multilevel wind speed, soil temperature). The VARIABLE column in the variable availability can be used to get a fine-level variable availability.


# down-select cropland & grassland sites by available wind speed (WS) data,
#  mean availability of WS during 2015-2018
data_aval_sub2 <- data_aval[data_aval$BASENAME %in% c("WS"),
                            .(SITE_ID, VARIABLE, Y2015_2018 = (Y2015 + Y2016 + Y2017 + Y2018)/4)]

# calculate number of WS variables per site, for sites that 
#  have any WS data during 2015-2018
data_aval_sub2 <- data_aval_sub2[Y2015_2018 > 0, .(.N, Y2015_2018 = mean(Y2015_2018)), .(SITE_ID)]
pander::pandoc.table(crop_ls4 <- data_aval_sub2[N > 1, ])
SITE_ID N Y2015_2018
US-ARM 3 0.8766
US-Ne1 4 0.7027
US-Ne2 4 0.709
US-Ne3 4 0.7287
US-Wkg 2 0.9942

A companion function amf_plot_datayear() can be used for visualizing the data availability in an interactive figure. However, it is strongly advised to subset the sites, variables, and/or years for faster processing and better visualization.

#### not evaluated so to reduce vignette size
# plot data availability for WS & USTAR
#  for selected sites in 2015-2018
amf_plot_datayear(
  site_set = crop_ls4$SITE_ID,
  var_set = c("WS", "USTAR"),
  nonfilled_only = TRUE,
  year_set = c(2015:2018)
)

Get data summary

In addition, users can use amf_summarize_data() to query the summary statistics of specific variables in the BASE data. The amf_summarize_data() provides summary statistics for each variable (e.g., percentiles) before downloading the data.

By default, amf_summarize_data() returns variable summary (selected percentiles) for all variables and sites. The site_set and var_set parameters can be used to subset the sites or variables of interest.

## get data summary for selected sites & variables
data_sum <- amf_summarize_data(site_set = crop_ls3$SITE_ID,
                     var_set = c("WS", "USTAR"))
pander::pandoc.table(data_sum[c(1:10), ])
Table continues below
  SITE_ID VARIABLE BASENAME GAP_FILLED DATA_RECORD
4165 US-ARM WS_1_1_1 WS FALSE 353556
4168 US-ARM USTAR_1_1_1 USTAR FALSE 353556
4221 US-ARM WS_1_2_1 WS FALSE 353556
4224 US-ARM USTAR_1_2_1 USTAR FALSE 353556
4248 US-ARM WS_1_3_1 WS FALSE 353556
4251 US-ARM USTAR_1_3_1 USTAR FALSE 353556
10404 US-Ne1 USTAR_1_1_1 USTAR FALSE 175320
10476 US-Ne1 WS_1_1_1 WS FALSE 175320
10477 US-Ne1 WS_1_2_1 WS FALSE 175320
10478 US-Ne1 WS_1_3_1 WS FALSE 175320
Table continues below
  DATA_MISSING Q01 Q05 Q10 Q15 Q20
4165 23542 0.5092 1.06 1.444 1.749 2.021
4168 23086 0.02901 0.05339 0.07707 0.1014 0.1271
4221 32209 0.8247 1.725 2.389 2.894 3.331
4224 31081 0.03152 0.05623 0.07908 0.1017 0.1259
4248 51797 0.988 2.126 3.002 3.689 4.278
4251 45785 0.03332 0.05838 0.08043 0.1016 0.1243
10404 15578 0.024 0.049 0.071 0.093 0.116
10476 118560 0.55 0.94 1.19 1.37 1.53
10477 10493 0.8 1.2 1.49 1.72 1.95
10478 11666 0.52 0.77 0.99 1.19 1.39
Table continues below
  Q25 Q30 Q35 Q40 Q45 Q50 Q55
4165 2.274 2.531 2.796 3.075 3.361 3.669 3.999
4168 0.1533 0.1794 0.2052 0.2305 0.2556 0.281 0.3068
4221 3.732 4.113 4.476 4.838 5.207 5.575 5.951
4224 0.1517 0.1788 0.2056 0.2325 0.2597 0.2868 0.3142
4248 4.799 5.289 5.763 6.229 6.685 7.143 7.611
4251 0.1495 0.1775 0.2063 0.2361 0.2657 0.2951 0.3253
10404 0.14 0.164 0.188 0.213 0.238 0.263 0.289
10476 1.68 1.81 1.96 2.1 2.26 2.44 2.63
10477 2.17 2.4 2.64 2.89 3.16 3.44 3.75
10478 1.59 1.79 2 2.22 2.45 2.71 3.01
Table continues below
  Q60 Q65 Q70 Q75 Q80 Q85 Q90
4165 4.353 4.736 5.155 5.628 6.174 6.836 7.701
4168 0.3334 0.3613 0.3919 0.4253 0.4635 0.509 0.5668
4221 6.34 6.757 7.216 7.738 8.362 9.123 10.16
4224 0.342 0.3711 0.4012 0.4341 0.4714 0.5148 0.5697
4248 8.09 8.577 9.086 9.631 10.23 10.96 11.94
4251 0.3567 0.3886 0.4227 0.4596 0.5006 0.5491 0.6111
10404 0.315 0.343 0.373 0.406 0.4438 0.49 0.551
10476 2.85 3.1 3.38 3.69 4.04 4.48 5.06
10477 4.1 4.49 4.93 5.42 6 6.7 7.62
10478 3.34 3.73 4.17 4.69 5.28 5.98 6.87
  Q95 Q99
4165 8.963 11.3
4168 0.6528 0.8272
4221 11.71 14.47
4224 0.6519 0.8205
4248 13.43 16.32
4251 0.706 0.9164
10404 0.645 0.852
10476 5.93 7.61
10477 9.01 11.65
10478 8.19 10.61

Alternatively, a companion function amf_plot_datasummary() provides interactive visualization to the data summary.

#### not evaluated so to reduce vignette size
## plot data summary of USTAR for selected sites, 
amf_plot_datasummary(
  site_set = crop_ls3$SITE_ID,
  var_set = c("USTAR")
)
#### not evaluated so to reduce vignette size
## plot data summary of WS for selected sites, 
#  including clustering information
amf_plot_datasummary(
  site_set = crop_ls3$SITE_ID,
  var_set = c("WS"),
  show_cluster = TRUE
)

Once having a target site list, users can download these sites’ data and metadata using the site IDs. See Data import for data download and import examples.