Introduction
A simple pointblank tutorial
The pointblank package provides a wide range of validation rules that we can use to check our data.
Consider this data set age_dataset consisting of the patient’s age printed using reactable.
Check if positive
We start by validating if the column Age is positive. We can use the function col_vals_gt() to do that.
In the code below, we check if the column Age is greater than 0 (a positive number). When there are no issues, the data set will be returned.
This gives us the leverage to carry on with the data pipeline, such as changing the name of a column.
Here is the example when we validate column Age Invalid which has at least one non-positive row. In this case, an error will be provided, indicating how many rows do not meet the criteria.
Pointblank workflow (Simplified)
In practice, you may have to identify the rows that fails the validation test and send them to your collaborators for clarification. This is how it can be done in three main steps.
We first create an agent object using the function create_agent() on age_dataset. Next, we assign the agent a validation plan using validation functions such as col_vals_gt(). Lastly, we run the function interrogate() for the agent to run the validation plan and gather some information.
Once the agent has finished the validation, it can be printed in the Viewer.
You can identify rows in the data that fail a validation step by clicking on the CSV button located at the last (EXT) column.
We can use the get_agent_report() function to further customise the report display.
Post validation workflow (Simplified)
We can extract the row data that didn’t pass a specific row-based validation steps with the get_data_extracts() function. The input i = 2 means we are only looking at data that failed during Step 2 (which is validating if the values in column Age Invalid is greater than 0).
After data validation, you may want to split the data that pass and fail the validation plan into two separate groups. This can be done using the function get_sundered_data()
This ends our introduction. Check out the other chapters to see how more complicated data validation can be done.