R for Geoscience, R for O&G

Paulina Wozniakowska
5 min readMay 29, 2021

Part 2: Data manipulation

This is Part 2 of the tutorial series in which I introduce geoscientific data processing workflows using R. I used data downloaded from BCOGC Data Centre to present how R can help to manipulate this dataset to get more insight.

In this tutorial dplyr, data manipulation package is used to preprocess raw dataframe introduced in Part 1. The complete code including next steps and input data can be found here.

Dplyr functions used in this tutorial include:

  • filter() — filter specific rows by values
  • arrange() — sort dataframe by column(s)
  • select() — select specific columns of dataframe
  • mutate() — create new variables using functions of existing ones
  • summarize() — collapse many values down to a single summary
  • pipelines* (%>%) —not a function but operator used to perform multiple operations at once (within single chunk of code)

All the operations will be performed on the wells dataset created in Part 1.

Let’s load dplyr, along with ggplot2 first:

library(dplyr)
library(ggplot2)

Here’s the example how filter() function can be used to select from folders associated with formation “SLAVE POINT”. We filter the dataframe and assign the output to a new dataframe:

slave_pnt <- filter(wells, Formtn_name == "SLAVE POINT")

We can confirm if the operation was successful by checking the unique values of the Formtn_name column in the new dataframe:

unique(slave_pnt$Formtn_name)
head(slave_pnt)

If we want to filter using multiple values within the column, we can use %in% operator and specify the list of values to filter on:

muskwa <- filter(wells, Formtn_name %in% c("MUSKWA", "MUSKWA-OTTER PARK"))

Alternatively, the same result can be achieved by using the logical operator | (or) and listing the conditions separately:

muskwa <- filter(wells, Formtn_name == "MUSKWA" | Formtn_name == "MUSKWA-OTTER PARK")unique(muskwa$Formtn_name)

Next function is arrange() which is used sort our dataframe according to one or more columns.

By default, the dataframe will be sorted in ascending order. Here we use Wa_num column to sort wells dataframe by Well Authorization Number:

Wa_num_sorted <- arrange(wells, Wa_num) # ascending
head(Wa_num_sorted)

Similarly, in order to sort dataframe in descending order, we need to add desc() before the column name.

Wa_num_sorted <- arrange(wells, desc(Wa_num)) # descending
head(Wa_num_sorted)

Finally, we can sort dataframe using multiple columns, here’s the example how to sort it by Well Authorization Number (Wa_num, descending) and production period (Prod_period).

Wa_num_sorted <- arrange(wells, desc(Wa_num), Prod_period) # by multiple conditions
head(Wa_num_sorted)

We can now remove the newly created dataframe from the memory using rm() since it is no longer needed:

rm(Wa_num_sorted)

The next function is select() which allows to select specific columns from dataframe.

Let’s check the column names in wells dataframe first:

names(wells)

We can use code below to select only Area_name and Formtn_name from dataframe:

If we want to select multiple adjacent columns (inclusively) we can use “:” to specify the range, for example:

head(select(wells, Gas_prod_vol_e3m3:Cond_prod_vol_m3, Area_name, Formtn_name))

Using mutate() function we can create new columns by performing operations on existing ones. Let’s use Prod_period column to generate some new columns.

head(wells$Prod_period)

Prod_period column includes values of production year and production month joined into one string, so we can use substr() function to extract year and month values separately. To do that, we need to specify first and last position of the string and save them as new columns, Prod_year and Prod_month.

One interesting feature is the possibility to use new columns to create another column, Prod_ym simply by joining columns Prod_year and Prod_month. We can do that by using paste() function and specifying separator (here I used “-”). It can be further used for the plotting for example.

Note: Here, pipeline operator (%>%) is used to perform multiple steps within one chunk of code.

Code below presents all these steps together:

wells <- wells %>%
mutate(Prod_year = substr(Prod_period, 1, 4), # first 4 characters
Prod_month = substr(Prod_period, 5, 6), # 5th and 6th character
Prod_ym = paste(Prod_year, Prod_month, sep=”-”))

Last function in we will look into in this tutorial is summarise(), which is used to generate summary statistics (e.g. total sums, averages, medians). Summarise() is frequently used to calculate statistics within some specific groups of dataframes. For our dataset, we can summarize the total number of production days for each area.

head(prod_days_by_area)

First, we need to group our dataset by area using group_by(). In the next step, we apply summarize function on respective column and specify the name of the new column (total_prod_days) as well as the function to generate the statistics. We save the result as a new, 2-column tibble.

Note: A tibble is a subtype of dataframe with some additional features, generally more efficient for data analysis. You can read more about tibbles here.

prod_days_by_area <- wells %>%
group_by(Area_name) %>%
summarize(
total_prod_days = sum(Prod_days)
)
head(prod_days_by_area)

We can quickly plot the results using ggplot.

Here I visualized only the areas with total production days > 500000 using a bar chart. Since we want the height of the bars to represent values of the statistics, we use geom_col() geometry.

prod_days_by_area %>% filter(total_prod_days > 500000) %>% 
ggplot(aes(x = Area_name, y = total_prod_days)) +
geom_col() +
ggtitle("Total Production Days for most productive areas", ) +
ylab("Production days") +
xlab("Area")

That was a brief overview of the main dplyr() functions. In the next tutorial, I will present examples of how this dataset can be visualized using several ggplot geometries.

You can also check the complete code on my github page.

--

--