Working with DataFrames

DataFrame Structure

A DataFrame is the primary structure used to work with tabular data in Rawlytics Notebook.

Whether data comes from a CSV file, a dataset, or a previous analysis step, it is typically represented as a DataFrame before being explored or transformed.

A DataFrame is organized into rows, columns, and cells.

┌────┬─────────┬─────────┐
│ id │ product │ revenue │
├────┼─────────┼─────────┤
│ 1  │ A       │ 120     │
│ 2  │ B       │ 95      │
│ 3  │ C       │ 180     │
└────┴─────────┴─────────┘

DataFrames provide a consistent way to read, update, filter, combine, and export tabular data.

All methods for DataFrames are available into the DataFrame API.

Creating a DataFrame

A DataFrame can be created from existing data.

For example:

var sales = [
  [ 'id', 'product', 'revenue' ],
  [   1 ,       'A',      120  ],
  [   2 ,       'B',       95  ],
  [   3 ,       'C',      180  ],
];

var df = new DataFrame(sales);

DataFrames are also commonly produced when importing CSV files or processing existing datasets.

Rows

Rows represent individual records.

[ 1, 'A', 120 ]

Each row typically describes a single entity, event, or observation.

Header

The header represent the first row of the dataFrame used for column labels.

[ 'id', 'product', 'revenue' ]

Columns

Columns represent attributes shared by all rows.

Each column contains values of the same type or meaning.

Values of Column id:

[ 1, 2, 3 ]

Cells

A cell is the intersection between a row and a column.

Row 1 + Column "revenue" = 120

Cells contain the individual values stored in the DataFrame.

Exploring a DataFrame

Once a DataFrame has been created, its contents can be inspected and queried.

Shape

Before exploring individual values, it is often useful to understand the overall size of the DataFrame.

The shape of a DataFrame describes its dimensions: rows and columns.

var ds = data.item('sales').shape();

notebook.log(ds);

Result:

{
  columns: 3,
  rows: 3,
}

Reading Header

Retrieve information about columns names:

var columnNames = df.header.get();

Reading Columns

Retrieve information about available columns:

var column = df.column('revenue').get();

Reading Rows

Access rows stored in the DataFrame:

var rows = df.rows.get();

Access a single row:

var row = df.row(1).get();

Reading Cells

Retrieve a specific value:

var cell = df.cell(1, 'revenue').get();

Selecting Data

DataFrames provide methods to extract subsets of data.

For example, selecting specific columns:

var columns = df.columns('product', 'revenue').get();

Selections allow analyses to focus on the data that matters.

Modifying a DataFrame

DataFrames are mutable and can be updated as analysis progresses.

Updating Cells

Modify a single value:

df.cell(1, 'revenue').set(1000);

Updating Rows

Replace the contents of a row:

df.row(1).set([ 1, 'A', 1000 ]);

Adding Rows

Append new records to the DataFrame:

df.rows.add([
  [ 4, 'B',  900 ],
  [ 5, 'D', 1200 ]
]);

Removing Rows

Delete rows from the DataFrame:

df.row(A).remove();

Updating Columns

Replace the contents of a column:

df.column('revenue').set([ 1000, 800, 1100, 950, 1050 ]);

If the column does not exist, it will be created.

Removing Columns

Delete columns from the DataFrame:

df.column('revenue').remove();

Updating the Header

Replace the contents of the header:

df.header.set([ 'id', 'product', 'amount' ]);

Combining DataFrames

Merge multiple DataFrames into a single structure:

df.union(otherDataFrame);

Combining DataFrames is useful when working with data coming from multiple sources.

DataFrames and Datasets

DataFrames are often stored inside datasets managed by the Data API.

data.item('sales').set(df);

Later:

var sales = data.item('sales').get();

This allows DataFrames to persist within the notebook and be reused across multiple cells.

Final Step

You now understand the core concepts of Rawlytics Notebook.