Basic data analyses

Author

Irena Axmanová

In this tutorial, we will show you how to sort and filter the data. We will also train how to append traits, indicator values or information about the status of species and calculate proportions of selected plant groups, community weighted and unweighted means or other summary statistics. If you are interested in a more general overview of the functions, you are encouraged to check also our study materials here.

library(tidyverse)
library(janitor)

2.1 Data import

We will play a bit with measured variables from our plots and try few useful functions. For more see study materials and links there. We will first read the data newly, to be sure we know what we are working with. At this point we can actually clean the environment, remove everything.

2.1.1 env dataset

First I want to add some more variables that were measured separately. In addition this file is already filtered to a subset I want to use.

env_extra<- read_csv("data/axmanova_forest_env_extra.csv")%>%      
  clean_names()

Rows: 65 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): ForestTypeName
dbl (21): Releve_nr, ForestType, Herbs, Juveniles, CoverE1, Biomass, Soil_de...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

For the env dataset, I want to import the clean version saved in the previous script, but I will directly filter it to the same subset as in the dataset above. I can do it with semi_join. But I decided to merge both datasets at this point - to keep only the matching rows, but all the information I want. For more on joins, see here.

env<- read_csv("data/env.csv") %>%       
  inner_join(env_extra %>%                                 
               select (releve_nr, forest_type, forest_type_name,                                                soil_ph=p_h_k_cl, biomass)%>%                           
               unite(forest, forest_type, forest_type_name))

New names:
Rows: 137 Columns: 21
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(6): coverscale, field_nr, country, author, syntaxon, locality dbl (15): ...1,
releve_nr, date, altitude, exposition, inclinatio, cov_trees...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
Joining with `by = join_by(releve_nr)`
• `` -> `...1`

2.1.2 spe dataset

I want to work with the spe dataset, and I want to keep information about layers, as I want to focus on the herb-layer.

spe <- read_csv ("data/spe_merged_covers.csv")%>%    
  semi_join(env)

Rows: 4581 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): species
dbl (3): releve_nr, layer, cover_perc

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Joining with `by = join_by(releve_nr)`

2.1.3 traits

I also want to add some information about traits. So I will check what is available in my folder.

list.files("data")

 [1] "axmanova_forest_env_extra.csv"       "cover_20230402.csv"                 
 [3] "CzechVeg_tutorial.zip"               "data.zip"                           
 [5] "env.csv"                             "indicator_values.csv"               
 [7] "nomenclature_20251108.csv"           "nomenclature_check.csv"             
 [9] "plant_height.csv"                    "readme.txt"                         
[11] "remarks.cdx"                         "remarks.dbf"                        
[13] "spe.csv"                             "spe_merged_covers.csv"              
[15] "spe_merged_covers_across_layers.csv" "species.csv"                        
[17] "status.csv"                          "traits.csv"                         
[19] "tvabund.cdx"                         "tvabund.csv"                        
[21] "tvabund.dbf"                         "tvabund.xlsx"                       
[23] "TvAdmin.cdx"                         "TvAdmin.dbf"                        
[25] "tvhabita.cdx"                        "tvhabita.csv"                       
[27] "tvhabita.dbf"                        "tvhabita.xlsx"                      
[29] "tvwin.dbf"

I will import plant height data, Ellenberg-type indicator values and status (origin and Red List categories). All these data are available in the PLADIAS database of the Czech flora and vegetation. I used the datasets from the download section and adjusted the format for easier data handling.

plant_height<- read_csv("data/plant_height.csv")

Rows: 3258 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): species
dbl (3): height_min, height_max, height_mean

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

indicator_values <- read_csv("data/indicator_values.csv")

Rows: 3076 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): species
dbl (6): eiv_light, eiv_temperature, eiv_moisture, eiv_reaction, eiv_nutrien...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

status <- read_csv("data/status.csv")

Rows: 5752 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): species, origin, redlist

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.2 Basic data handling

Select extracts columns/variables based on their names or position. Select can be also used in combination with stringr package to identify the pattern in the names: try several options: starts_with, ends_with or more general one contains. With the use of select in your script you can order the variables always in the same way e.g. ID, forest type, species number, productivity, even during import. And you can also rename the variables using select, to get exactly what you need.

env_extra %>% 
  select(ID=releve_nr, forest_type=forest_type_name, species_nr=herbs, productivity=biomass)

# A tibble: 65 × 4
      ID forest_type         species_nr productivity
   <dbl> <chr>                    <dbl>        <dbl>
 1     1 oak hornbeam forest         26         12.8
 2     2 oak forest                  13          9.9
 3     3 oak forest                  14         15.2
 4     4 oak forest                  15         16  
 5     5 oak forest                  13         20.7
 6     6 oak forest                  16         46.4
 7     7 oak forest                  17         49.2
 8     8 oak hornbeam forest         21         48.7
 9     9 oak hornbeam forest         15         13.8
10    10 oak forest                  14         79.1
# ℹ 55 more rows

Distinct is a function that takes your data and remove all the duplicate rows, keeping only the unique ones. There are many cases where you will really appreciate this elegant and easy way. For example, I want a list of unique PlotIDs= releve_nr, unique combinations of two categories etc. Here I want to prepare list of forest types codes and names.

env_extra %>%
  arrange(forest_type) %>%
  distinct(forest_type,forest_type_name)

# A tibble: 4 × 2
  forest_type forest_type_name   
        <dbl> <chr>              
1           1 oak forest         
2           2 oak hornbeam forest
3           3 ravine forest      
4           4 alluvial forest

When we have a large dataset, we sometimes need to create a subset of the rows/cases by filter. First we have to define upon which variable we are going to filter the rows (e.g. Forest type, soil pH…) and which values are acceptable and which are not. Here I keep only plots from a type that exactly match.

env_extra %>% 
  filter(forest_type_name =="alluvial forest")

# A tibble: 11 × 22
   releve_nr forest_type forest_type_name herbs juveniles cover_e1 biomass
       <dbl>       <dbl> <chr>            <dbl>     <dbl>    <dbl>   <dbl>
 1       101           4 alluvial forest     28         6       90    91.1
 2       103           4 alluvial forest     35         5       80   114. 
 3       104           4 alluvial forest     25        11       85   188. 
 4       110           4 alluvial forest     37         4       95   126. 
 5       111           4 alluvial forest     26        10       85    84.8
 6       113           4 alluvial forest     37        10       70    74.5
 7       125           4 alluvial forest     25         8       95   123. 
 8       127           4 alluvial forest     53         6        0   176  
 9       129           4 alluvial forest     31         7       85   100. 
10       131           4 alluvial forest     59         3       95   163. 
11       132           4 alluvial forest     60         7       90   287. 
# ℹ 15 more variables: soil_depth_categ <dbl>, p_h_k_cl <dbl>, slope <dbl>,
#   altitude <dbl>, canopy_e3 <dbl>, radiation <dbl>, heat <dbl>,
#   trans_dir <dbl>, trans_dif <dbl>, trans_tot <dbl>, eiv_light <dbl>,
#   eiv_moisture <dbl>, eiv_soilreaction <dbl>, eiv_nutrients <dbl>, twi <dbl>

or those that are listed

env_extra %>% 
  filter(forest_type_name %in% c("alluvial forest", "ravine forest"))

# A tibble: 21 × 22
   releve_nr forest_type forest_type_name herbs juveniles cover_e1 biomass
       <dbl>       <dbl> <chr>            <dbl>     <dbl>    <dbl>   <dbl>
 1       101           4 alluvial forest     28         6       90    91.1
 2       103           4 alluvial forest     35         5       80   114. 
 3       104           4 alluvial forest     25        11       85   188. 
 4       105           3 ravine forest       12        10       80    65.5
 5       106           3 ravine forest       16         0       75    33.6
 6       108           3 ravine forest       28         7       70    82.3
 7       110           4 alluvial forest     37         4       95   126. 
 8       111           4 alluvial forest     26        10       85    84.8
 9       113           4 alluvial forest     37        10       70    74.5
10       114           3 ravine forest       12         6       50    71.5
# ℹ 11 more rows
# ℹ 15 more variables: soil_depth_categ <dbl>, p_h_k_cl <dbl>, slope <dbl>,
#   altitude <dbl>, canopy_e3 <dbl>, radiation <dbl>, heat <dbl>,
#   trans_dir <dbl>, trans_dif <dbl>, trans_tot <dbl>, eiv_light <dbl>,
#   eiv_moisture <dbl>, eiv_soilreaction <dbl>, eiv_nutrients <dbl>, twi <dbl>

Or I want to filter based on some values. If I have at the same time NAs in the variable, I should specify if I want to keep these rows or not. In the example below I will keep all rows/plots where biomass is higher than 80g/m2 but I will also keep all the NAs, i.e. plots where it was not measured.

env_extra %>% 
  filter(biomass>80 |is.na(biomass))

# A tibble: 17 × 22
   releve_nr forest_type forest_type_name    herbs juveniles cover_e1 biomass
       <dbl>       <dbl> <chr>               <dbl>     <dbl>    <dbl>   <dbl>
 1       101           4 alluvial forest        28         6       90    91.1
 2       103           4 alluvial forest        35         5       80   114. 
 3       104           4 alluvial forest        25        11       85   188. 
 4       108           3 ravine forest          28         7       70    82.3
 5       110           4 alluvial forest        37         4       95   126. 
 6       111           4 alluvial forest        26        10       85    84.8
 7       120           2 oak hornbeam forest    36         6       75    98.5
 8       121           2 oak hornbeam forest    30        12       80    84.7
 9       122           3 ravine forest          29         6       60   102. 
10       124           2 oak hornbeam forest    34        15       75   102. 
11       125           4 alluvial forest        25         8       95   123. 
12       127           4 alluvial forest        53         6        0   176  
13       128           3 ravine forest          20         4       70   121. 
14       129           4 alluvial forest        31         7       85   100. 
15       130           3 ravine forest          18         5       85   238  
16       131           4 alluvial forest        59         3       95   163. 
17       132           4 alluvial forest        60         7       90   287. 
# ℹ 15 more variables: soil_depth_categ <dbl>, p_h_k_cl <dbl>, slope <dbl>,
#   altitude <dbl>, canopy_e3 <dbl>, radiation <dbl>, heat <dbl>,
#   trans_dir <dbl>, trans_dif <dbl>, trans_tot <dbl>, eiv_light <dbl>,
#   eiv_moisture <dbl>, eiv_soilreaction <dbl>, eiv_nutrients <dbl>, twi <dbl>

You can also filter plots in a specified range of some value.

env_extra %>% 
  filter((biomass >= 40 & biomass <= 80) | is.na(biomass))

# A tibble: 20 × 22
   releve_nr forest_type forest_type_name    herbs juveniles cover_e1 biomass
       <dbl>       <dbl> <chr>               <dbl>     <dbl>    <dbl>   <dbl>
 1         6           1 oak forest             16         3       60    46.4
 2         7           1 oak forest             17         5       70    49.2
 3         8           2 oak hornbeam forest    21         1       70    48.7
 4        10           1 oak forest             14         4       75    79.1
 5        18           2 oak hornbeam forest    13         3       85    72.2
 6        86           2 oak hornbeam forest    32         5       40    46.9
 7        87           2 oak hornbeam forest    34         9       65    42.9
 8        99           2 oak hornbeam forest    18         6       85    64.8
 9       102           2 oak hornbeam forest    37        13       65    57.4
10       105           3 ravine forest          12        10       80    65.5
11       109           2 oak hornbeam forest    25         9       50    60.7
12       112           2 oak hornbeam forest    32        14       60    58.4
13       113           4 alluvial forest        37        10       70    74.5
14       114           3 ravine forest          12         6       50    71.5
15       115           3 ravine forest          23         2       70    54.6
16       116           3 ravine forest          16         3       45    64.5
17       117           2 oak hornbeam forest    31         7       55    54.7
18       118           2 oak hornbeam forest     8         5       45    66.2
19       123           3 ravine forest          34        11       70    53.6
20       126           2 oak hornbeam forest    27         7       65    72.9
# ℹ 15 more variables: soil_depth_categ <dbl>, p_h_k_cl <dbl>, slope <dbl>,
#   altitude <dbl>, canopy_e3 <dbl>, radiation <dbl>, heat <dbl>,
#   trans_dir <dbl>, trans_dif <dbl>, trans_tot <dbl>, eiv_light <dbl>,
#   eiv_moisture <dbl>, eiv_soilreaction <dbl>, eiv_nutrients <dbl>, twi <dbl>

Similarly, you can for example filter plots based on their plot size. The example below is not working on our dataset, but it comes useful in large EVA exports. Here I keep grassland plots in the range 10-100 m2, forest plots between 100-1000 m2 and I keep the plots without specified plot size.

env %>% 
  filter(
    (habitat == "grassland" & plot_size >= 10 & plot_size <= 100) |
    (habitat == "forest" & plot_size >= 100 & plot_size <= 1000) |
    is.na(plot_size)
  )

2.3 Summarise

I will a bit explore the data and ask about which factors influence species richness. Here are just few examples, we will read together the scripts and you will adopt them later to other tasks. First I want to produce a table with summary statistics like min, mean, max for two variables plant height, and biomass per forest type. The first one is a trait, so we need to append it to species file and calculate community means, while the other is a variable measured in each plot.

Combine and summarise different variables / explanation will be added

spe %>% 
  filter(layer == "6") %>%
  left_join(plant_height %>% 
              select(species, height = height_mean)) %>%
  left_join(env) %>%
  arrange(forest) %>%
  summarise(across(c(height,cov_herbs, biomass, cov_trees,soil_ph ),
      list(
        min  = ~min(.x, na.rm = TRUE),
        mean = ~mean(.x, na.rm = TRUE),
        max  = ~max(.x, na.rm = TRUE))),.by = forest)

Joining with `by = join_by(species)`
Joining with `by = join_by(releve_nr)`

# A tibble: 4 × 16
  forest          height_min height_mean height_max cov_herbs_min cov_herbs_mean
  <chr>                <dbl>       <dbl>      <dbl>         <dbl>          <dbl>
1 1_oak forest         0.075       0.495       1.45             8           37.2
2 2_oak hornbeam…      0.055       0.497      15                7           55.6
3 3_ravine forest      0.085       0.505       1.45            45           67.8
4 4_alluvial for…      0.055       0.707       8.5              0           76.2
# ℹ 10 more variables: cov_herbs_max <dbl>, biomass_min <dbl>,
#   biomass_mean <dbl>, biomass_max <dbl>, cov_trees_min <dbl>,
#   cov_trees_mean <dbl>, cov_trees_max <dbl>, soil_ph_min <dbl>,
#   soil_ph_mean <dbl>, soil_ph_max <dbl>

env %>%
  arrange(forest) %>%
  select(forest, cover_herbs=cov_herbs, biomass, cover_trees =cov_trees, soil_ph) %>%
  tidyr::pivot_longer(
    cols = c(cover_herbs, biomass, cover_trees, soil_ph),
    names_to = "variable",
    values_to = "value")%>% 
  ggplot(aes(x = forest, y = value, fill=forest)) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = "free_y") +
  theme_bw() +
  theme(legend.position = 'none')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  labs(x = NULL) +
  labs (y = NULL)

I might be interested in species richness relationship with several factors. Below is an example of herb-layer species richness and biomass. At first I need to calculate the species richness in the spe file. And as an output I want to plot this with different colours for different forest types but keep the regression line for the whole dataset.

spe %>% 
  filter(layer %in% c(6, 7)) %>% 
  count(releve_nr, name = "herblayer_richness") %>% 
  left_join(env %>% 
              select(releve_nr, biomass, forest), by = "releve_nr") %>% 
  ggplot(aes(x = sqrt(biomass), y = herblayer_richness)) +
  geom_point(aes(colour = forest), size = 2, alpha = 0.8) +
  geom_smooth(method = "lm", se = TRUE, colour = "black") +
  theme_bw() +
  labs(
    x = "sqrt Biomass",
    y = "Herb-layer species richness",
    colour = "Forest type"
  )

`geom_smooth()` using formula = 'y ~ x'

Proportions of endangered or alien species in forest type. I will first look at the numbers of these species, here on example of redlist i.e. endangered species. I just one to see average number of redlist species per forest type.

spe%>% 
  left_join(spe %>% 
              left_join(status)%>% 
              filter(!is.na(redlist))%>%
              count(releve_nr, name="richness_redlist"))%>%
  left_join(spe %>% 
              summarise(richness_all = n_distinct(species), 
                        .by = releve_nr))%>%
  distinct(releve_nr,richness_redlist,richness_all)%>%
  mutate(richness_redlist = replace_na(richness_redlist, 0))%>%
  mutate(richness_redlist_perc= richness_redlist/richness_all*100)%>%
  left_join(env%>% select(releve_nr, forest))%>%
  summarise(redlist_mean_perc = mean(richness_redlist_perc), .by=forest)%>%
  arrange(forest)

Joining with `by = join_by(species)`
Joining with `by = join_by(releve_nr)`
Joining with `by = join_by(releve_nr)`
Joining with `by = join_by(releve_nr)`

# A tibble: 4 × 2
  forest                redlist_mean_perc
  <chr>                             <dbl>
1 1_oak forest                       6.25
2 2_oak hornbeam forest              6.24
3 3_ravine forest                    5.03
4 4_alluvial forest                  1.71

But I can also ask about diferences in abundances of some plant groups. What is the proportion of cover of alien species relative to total cover in different forests? I will again use the function we used for cover combination during import from Turboveg.

combine_cover <- function(x){
  while (length(x)>1){
    x[2] <- x[1]+(100-x[1])*x[2]/100
    x <- x[-1]
  }
  return(x)
}

spe%>% 
  left_join(spe %>% 
              left_join(status)%>% 
              filter(origin %in% c("arch","neo"))%>%
              summarize(cover_alien = combine_cover(cover_perc), 
                        .by=releve_nr))%>%
  left_join(spe %>% 
              summarize(cover_total = combine_cover(cover_perc), 
                        .by=releve_nr))%>%
  distinct(releve_nr, cover_alien, cover_total)%>%
  mutate(cover_alien = replace_na(cover_alien, 0))%>%
  mutate(cover_alien_perc= cover_alien/cover_total*100)%>%
  left_join(env%>% select(releve_nr, forest))%>%
  summarise(alien_cover_mean_perc = mean(cover_alien_perc), .by=forest)%>%
  arrange(forest)

Joining with `by = join_by(species)`
Joining with `by = join_by(releve_nr)`
Joining with `by = join_by(releve_nr)`
Joining with `by = join_by(releve_nr)`

# A tibble: 4 × 2
  forest                alien_cover_mean_perc
  <chr>                                 <dbl>
1 1_oak forest                         0.0188
2 2_oak hornbeam forest                1.24  
3 3_ravine forest                      3.11  
4 4_alluvial forest                    3.44

2.4 Community weighted means

Summarise is often used for getting so-called community means or community weighted means. For example we have traits for individual species and we want to calculate a mean for each site and compare it. We can also consider to use abundance to weight the result. This is called community weighted mean and it simply gives higher importance to those species that are more abundant and lower importance to the rare once. Here the abundance is approximated as percentage cover in the site. The example below compares mean plant height both weighted and unweighted. To have meaningful values, I looked just at the herb-layers species.

spe %>%
  left_join(plant_height) %>%
  filter(layer==6)%>%
  summarise(meanHeight= mean(height_mean, na.rm=T),
            meanHeight_weighted = weighted.mean(height_mean, cover_perc, na.rm = TRUE),
            .by=releve_nr)

Joining with `by = join_by(species)`

# A tibble: 65 × 3
   releve_nr meanHeight meanHeight_weighted
       <dbl>      <dbl>               <dbl>
 1         1      0.364               0.322
 2         2      0.459               0.504
 3         3      0.408               0.388
 4         4      0.570               0.515
 5         5      0.566               0.528
 6         6      0.577               0.573
 7         7      0.498               0.559
 8         8      0.423               0.710
 9         9      0.543               0.494
10        10      0.5                 0.527
# ℹ 55 more rows

2.5 Ellenberg indicator values

In the same way we can calculate community means across more variables. Here we will try the Ellenberg indicator values (abbreviated as EIV, measures of species demands for the particular factors, the higher value mean higher demands or affinity to habitats with higher values of these environmental factors). We will apply them just for presences of species, so no weights. Therefore we do not need to care that much for the differences among layers. However, we will keep each species only once (list of unique species for each plot), so we will remove information about layers first and use distinct. Alternative is to add #filter(Layer==6)%>% if we want to focus on herb-layer only

spe %>%
  left_join(indicator_values) %>%
  select(-c(layer,cover_perc))%>%
  distinct()%>%
  group_by(releve_nr)%>%
  summarise(across(starts_with("eiv"), ~ mean(.x, na.rm = TRUE)))

Joining with `by = join_by(species)`

# A tibble: 65 × 7
   releve_nr eiv_light eiv_temperature eiv_moisture eiv_reaction eiv_nutrients
       <dbl>     <dbl>           <dbl>        <dbl>        <dbl>         <dbl>
 1         1      5.08            5.55         4.63         6.53          4.95
 2         2      4.87            5.2          4.8          5             4.33
 3         3      4.87            5.13         4.8          5             4.4 
 4         4      5.32            5.53         4.47         5.79          4.26
 5         5      6.07            5.57         4.14         5.57          3.86
 6         6      5.69            5.56         4.44         6.56          5.25
 7         7      5.74            5.47         4.42         6.26          4.95
 8         8      5               5.38         4.67         5.43          4.62
 9         9      5.26            5.74         4.53         6             4.58
10        10      6.33            5.87         3.73         6.13          3.8 
# ℹ 55 more rows
# ℹ 1 more variable: eiv_salinity <dbl>

Boxplots of Ellenberg-type indicator values can be produced for example like this:

spe %>% 
  left_join(indicator_values) %>%
  select(-c(layer, cover_perc)) %>%
  distinct() %>%
  group_by(releve_nr) %>%
  summarise(across(starts_with("EIV"), ~ mean(.x, na.rm = TRUE))) %>%
  left_join(env %>% select(releve_nr, forest))%>%
  pivot_longer(
    cols = starts_with("EIV"),
    names_to = "EIV_variable",
    values_to = "value")%>%
  ggplot(aes(x = forest, y = value, fill = forest)) +
  geom_boxplot() +
  facet_wrap(~ EIV_variable, scales = "free_y") +
  theme_bw()+
  theme(legend.position = 'none')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  labs(x = NULL) +
  labs (y = NULL)

Joining with `by = join_by(species)`
Joining with `by = join_by(releve_nr)`