Process data in Protobi

Updated at April 5th, 2020

Protobi lets you write and execute programs that modifies your data prior to analysis.   This tutorial shows how to set up a data process in Protobi.

Why process data?

Even for the best designed surveys, you may need to do additional data refactoring and cleaning 

  • remove respondents
  • merge in translations
  • combine month and year into a single date
  • combine waves
  • stack patient cases, choice cards, etc.
  • create a new variable

Process data in Protobi or externally

One approach is to use SPSS, SAS, R, Excel or something else to make those changes and upload the revised data to Protobi as a data table.  This can be just fine for one-time changes. 

But -- and you might like this a lot better, you can do serious data processing in Protobi as well.  One advantage of processing data in Protobi is that all your processing code is in one place, neatly organized, with changes tracked.   So if you get new data, or later need to review the changes you made, it's all visible one place.

Protobi uses Javascript for data processing

Protobi can execute data process code written in Javascript. Javascript is a powerful first-class language comparable to C, Perl, and Python which can do many things (and possibly do them more easily than you could in SPSS or SAS).  And it is the language of the web, and runs in every browser.   

So if you're new to Javascript, welcome, and keep in mind you're learning a language that's not specific to Protobi but broadly applicable to many data visualization applications.

Processing in other languages

You can process Protobi data in other languages too. Use the Protobi REST API to download a file from Protobi into your preferred language, then upload the result.

If you prefer to work in R specifically, Protobi R library wraps the REST API to load data into or from an R data frame, with value labels and variable labels as factors.

Dynamic versus static data processes

Protobi has two types of data processes, dynamic and static:

  • Dynamic: Code in the "Pre-calculate" admin tab runs dynamically, and is executed every time you open the project.
  • Static: Code in the "Data" admin tab runs only on command, and creates a data table that's accessible to the project

Pre-calculate

The Pre-Calculate tab lets you define a function that runs every time you open the project,   It's best suited for lightweight data processing and or code that otherwise needs to run when the project opens.

The tab defines the body of a function that receives a data value rows that represents the entire "primary" data table for the project (i.e. the data table marked as "Primary" on the Data tab) and returns an array of row objects representing the final data at the end.

Minimal 

By default, Pre-calculate starts out as the the simplest possible process: it simply returns the input data table as the final data:

return rows;

Example : Combine dates

Let's say your study asks for a date, but asks it as three separate columns:

  • F4_1 month of year (1 to 12)
  • F4_2 day of month (1 to 31)
  • F4_3 year (four digit integer)

For analysis we might want to combine it into a single ISO date string for analysis.



This editor starts with the simplest possible code return rows which just returns an array containing your dataset.

In this page we can write a program that iterates over this array and creates a new data column F4_date that concatenates these into an ISO date string with format YYYY-MM-DD:

rows.forEach( function(row) { 
       row.F4_date = [row.F4_3, row.F4_1, row.F4_2].join('-')
})
return rows;


The page should now look like:


In the project you now have a new data column F4_date .  Create a new element with the same key to see it as a chart:


The data column itself is a string value.  Select "More properties..." from the context menu and set its type "Date" so that it will be interpreted as a date value, allowing you to display it in more interesting ways: 


How data is represented in Protobi

Protobi stores your data table as a CSV data file, e.g.:

country,s0,s1,s2
US,Oncology,30,Children's Hospital,...
IN,Ophthamology,Lilati Hospital,...
DE,Pulmonology,99,Universitätsklinikum Tübingen

In the browser, your raw data is parsed and stored in memory as a Javascript array variable called rows.

Each element of the array represents one row of your data. For example

[
  {   
    "country": "US", 
    "s0": "Oncology", 
    "s1": 30, 
    "s2": 
    "Children's Hospital", 
    ... 
  }, 
  {   
    "country": "IN", 
    "s0": "Ophthamology",
    "s1": 22,
    "s2": "Lilavati Hospital", 
    ... 
  },
  {   
    "country": "DE", 
    "s0": "Pulmonology",  
    "s1": 99, 
    "s2": "Universitätsklinikum Tübingen", 
     ... 
    },
  ...
]

Read the data

In Pre-Calculate, you can access rows by integer index (starting from zero). For instance, the first row is an object:

rows[0] // {   "country": "US", "s0": "Oncology", ... }

Column values within a row can be accessed by column name. If the column name is simple (starts with a character and uses only alphanumeric characters or underscores after that), you can just use dot notation. Or you can always use reference notation as below:

rows[0].s0    // dot notation, "Oncology"
rows[0]["s0"] // index notation, "Oncology"

Modify the data

You can modify the data directly. For instance, in the example below we're setting the variable "region" to "North America" for the first row. The variable region need not be defined in advance, you can just create it:

row[0]["region"] = "North America"

Iterate over rows

You can iterate over rows in several ways. One way is to pedantically type the row number:

rows[0]["s0"]  // "Oncology"
rows[1]["s0"]  // "Ophthamology"
rows[2]["s0"]  // "Pulmonology"

Easier is to iterate using a for(...) {...} loop:

for (var i=0; i<rows.length; i++) {
   var row = rows[i];
   if (row.s0 === 'Oncology') row.specialist = true;
   if (row.s1==99) row.s1 = null;  // recode `99` to missing
}

Javascript arrays have a method .forEach(fn) whose argument is a function that iterates over each row and calls a function with each row and its index as arguments.  So a more typical way of iterating over rows is as follows:

rows.forEach(function(row) { 
   if (row.s0 === 'Oncology') row.specialist = true; 
   if (row.s1 == 99) row.s1 = null;
}


Define new variables

You can define new variables as you wish. Let's say that we wish to calculate a sum of questions A4_a, A4_b, and A4_c. You can do it as follows:

rows.forEach(function(row, index) {
   row.S1_seg = row.S1 
   row.A4_sum = (+row.A4_a) + (+row.A4_b) + (+row.A4_c) 
});

Filter or remove respondents

You can exclude rows from your data by using the Javascript array method filter. For instance, if our dataset includes all respondents who took the screener, but we wish to focus analysis only on people who completed the survey, we might filter for completes as follows:

rows = rows.filter(function(row) { return row.complete == true; })

If there are respondents that you want to remove from the Protobi view, you can filter out the respondents as follows:

rows = rows.filter(function(row) 
{     
    if (row.respid == "101") return false // remove respondent with respid ="101"     
    if (row.respid == "102") return false      // remove respondent with respid ="102"     
    if (row.status != "COMPLETE") return false // remove non-complete respondents     
    return true // do not remove anyone else 
}) 
return rows;



Static data processes

Another approach is to define a data process.  This executes code and saves the result so that it can be readily access.

This is better when the code is complex, or you want to be able to download the file to CSV or SPSS with the results of the code.  A particular advantage of static data processes is the ability to operate on multiple files, for tasks like merging and stacking.

A data process is only executed when you choose to "Edit/run" it. Its result is stored as a data file which can be the primary data file for the project. 

Simple vs Asynchronous 

Static data processes come in two forms, "Simple" and "Asynchronous".   Simple is for when you want to operate only on data tables (blue icons) in the same project, and return a single modified table. 

Asynchronous is for when you need to do more sophisticated processing, including merging external data, operate on results of other processes, or data in other projects.

Minimal example

A "Simple" data process receives an variable data which is an object containing the contents of each data table as an attribute. 

If a project has a data table with  key "main" a minimal process would simply return the results that data table is:

var rows= data["main"]
return rows;

Simple example

A slightly more complicated example iterates over each row and defines three new row variables:

  • row.area is set to the value of row.Q1
  • row.radius is calculated from row.area
  • row.rituximab is either 0 or 1 depending on if the value of row.A4 matches the text fragment "uximab"
var rows = rows.main;
rows.forEach(function(row) {  
  row.area = row.Q1;
  row.radius = Math.sqrt(row.area / Math.PI)   
  

  // Use Regular Expressions to test for string matches
  row.rituximab = row.A4.test(/uximab/i) ? 1 : 0;
})

Combine data tables vertically ("stack")

You can combine multiple data tables. If a project has data tables named "wave1", "wave2", and "wave3" these can be referenced in the simple data process as data.wave1, data.wave2, and data.wave3.

The code below catenates these arrays together.

var stacked = [].concat([ data.wave1, data.wave2, data.wave3])
return stacked;

Merge data files (" join")

You can merge data files matching by one or more merge keys using the function Protobi.left_join(left, rows, keys, fields):

  • left: the first argument is an array of row objects representing the main data (the "left" table).
  • rows: the second arguments is an array of row objects to be merged in (the "left" table).
  • keys: the third argument is an array of column names to be used to identify matching rows.
  • fields: the fourth argument is an array of column names to be retained from the "right" table.

The example below merges data from table translations into table main using columns responseid and wave as join keys, and pulling in columns q19_OE and q27_OE from the second table:

var rows= data["main"]
var translations = data["translations"]

var keys = ["responseid", "wave"]
var keep = ['q19_OE','q27_OE']


merged = Protobi.left_join(rows, translations, keys, keep);
return merged;


If the column names in the right file differ from those in the left file, the merge keys can be a hash instead of an array. 

Here this merges by matching responseid in the left table to RESPID in the right table, and wave in the left table to WAVE in the right table.  It also keeps column Q19_OE from the right table and saves it as q19 in the result.

var rows= data["main"]
var OE = data["OE"]

var keys = {"responseid": "RESPID", "wave":"WAVE"}
var keep = {'q19': "Q19_OE'", "q27": "Q27_OE"}


merged = Protobi.left_join(rows, OE, keys, keep);
return merged.


Tricky point: Numeric vs string values

Protobi represents your data as a CSV text file, and on load, all data in the file are represented as strings, e.g. the string "2" rather than the number 2.   To treat numbers as numbers, you have to explicitly cast values to numbers.

In the above example we used the + operator to convert the value row.A4_a from a string to a number.

Protobi loads your data from CSV files, and CSV files represent all data as string values. The Protobi app is tolerant of number types, so 5 and "5" are both interpreted as the number five. Normally you don't need to think about it.

But when processing data files, you may need to be aware of the difference because Javascript adds strings and numbers differently.

Adding two numbers yields a number (e.g., 2 + 2 is 4) but adding two strings yields a string (e.g., "2" + "2" is "22").

We can cast a string to a number by using the + sign. E.g., (+"2") + (+"2") yields the number 4 instead of the string 22.


Asynchronous data processes

The most advanced case is for when you need to go outside the lines and combine data from multiple Protobi projects, non-Protobi projects, or just go crazy with code.   For this asynchronous data processes allow you to draw data from multiple independent sources. call a wider array of functions, and take precise control over how the data is returned..

The main difference is it is up to the process code to pull in the input data required, and explicitly call a return function callback when done.

In the Data tab under "Project settings..." chose a data process and set it to "Asynchronous" rather than "Simple":



For asynchronous processes, the code defines the body of a function that at the end calls a function callback(error, rows).  The data process is responsible for retrieving any and all information required, and need not return any data at all.

As it as asynchronous it can get data from a wide variety of sources including the project itself, other Protobi projects, and other external sources.  It need not even return any data at all.

Example

This is a minimal example that gets the results from one data table and returns it unchanged:

Protobi.get_table("main", function(err, rows) {
  return callback(null, rows)
})

Example

Asynchronous data processes can access not just data tables but can also get the elements configuration.  The following example does three things:

  • gets the elements configuration for this project,
  • gets the data from data table main from another project
  • returns that table as the data table for this process
Protobi.get_elements(function(err, tabular) {
    if (err) return callback(err)
    
    var q1 = tabular.getDimension('q1') 

    Protobi.get_tables("5cf6a9ae9468bab783ebc277","main", function(err, rows) {
        if (err) return callback(err)
        console.log(rows)

        rows.forEach(function(row) {
            row.q1 = q1.getValue(row)   // use the element q1 to calculate a value for row.q1 
        }) 
        
        return callback(null, rows)
    })
})

    

Here is an example of an asynchronous data process in context:

Utility functions 

Asynchronous data processes can access the following utility functions:

Protobi.get_table( key, function(err, rows) { ... })

Gets data from a data table or process within this project.  The callback function contains a value rows which is an array of row objects representing the table data.  If an error occurs, err will have a truthy value.

Protobi.get_table( datasetId, key, function(err, rows) { ... })

Gets data from a data table or process within project represented by datasetId.  The callback function contains a value rows which is an array of row objects representing the table data.  

Protobi.get_tables( [keys], function(err, data) { ... })

Gets data from multiple tables or process within this project.  The callback function contains a value rows which is a a hash object with keys corresponding to each data table.

Protobi.get_table( datasetId, [keys], function(err, data) { ... })

Gets data from multiple tables or process within project represented by datasetId.  The callback function contains a value rows which is a a hash object with keys corresponding to each data table.

Protobi.set_table( key, rows, filename, function(err, rows) { ... })

Saves an array of row values as a data table to this project, with a given filename.    This allows you to save data to one or more existing or new data tables.  If key is equal to the key for the process itself, this is akin to calling callback(err, rows) .

Protobi.set_table( datasetId, key, filename,  function(err, rows) { ... })

Saves an array of row values as a data table to another project specified by datasetId, with a given filename.   If datasetId and key are equal to the datasetId and key for the process itself, this is akin to calling callback(err, rows) .

Protobi.get_elements( function(err, tabular) { ... })

Returns elements configuration for this project as a collection of live element objects. See the Protobi JSON API for more details.

Protobi.get_elements( datasetId, function(err, tabular) { ... })

Returns elements configuration for a specified project as a collection of live element objects. See the Protobi JSON API for more details.

Protobi.left_join( left, right, [keys], [keeps])

Merge two arrays of row objects, left and right, using columns keys  as merge keys, and retaining specified keep columns from the right table.  This assumes keys have the same names in the left and right tables, and keeps have retain the same names in the right table and result.  This returns a new array of row objects and does change the inputs.

Protobi.left_join( left, right, {keys}, {keeps})

Merge two arrays of row objects, left and right, using columns in keys  as merge keys, and retaining specified keep columns from the right table.  The hash structure allows you to refer to different column names for merge keys in the left and right table and change column names when retaining them from the right table into the results.  This returns a new array of row objects and does change the inputs.

Global objects

Data processes can also access a number of global libraries:

  • d3 (d3js.com)     Diverse collection of data utilities
  • _ (lodash.com)   Diverse collection of common utility functions
  • moment (momentjs.com)   Advanced date processing functions


Advanced support

Protobi can handle data processing that's a lot more complicated than what's mentioned above. Rather than try to put it all in a tutorial, our support team is ready to help you with your specific goals. We can give you code that does what you need and show you how to modify it from there.

Please contact us at support@protobi.com

Was this article helpful?