Working with DataFrames
DataFrame Structure
A DataFrame is the primary structure used to work with tabular data in Rawlytics Notebook.
Whether data comes from a CSV file, a dataset, or a previous analysis step, it is typically represented as a DataFrame before being explored or transformed.
A DataFrame is organized into rows, columns, and cells.
┌────┬─────────┬─────────┐
│ id │ product │ revenue │
├────┼─────────┼─────────┤
│ 1 │ A │ 120 │
│ 2 │ B │ 95 │
│ 3 │ C │ 180 │
└────┴─────────┴─────────┘DataFrames provide a consistent way to read, update, filter, combine, and export tabular data.
All methods for DataFrames are available into the DataFrame API.
Creating a DataFrame
A DataFrame can be created from existing data.
For example:
var sales = [
[ 'id', 'product', 'revenue' ],
[ 1 , 'A', 120 ],
[ 2 , 'B', 95 ],
[ 3 , 'C', 180 ],
];
var df = new DataFrame(sales);DataFrames are also commonly produced when importing CSV files or processing existing datasets.
Rows
Rows represent individual records.
[ 1, 'A', 120 ]Each row typically describes a single entity, event, or observation.
Header
The header represent the first row of the dataFrame used for column labels.
[ 'id', 'product', 'revenue' ]Columns
Columns represent attributes shared by all rows.
Each column contains values of the same type or meaning.
Values of Column id:
[ 1, 2, 3 ]Cells
A cell is the intersection between a row and a column.
Row 1 + Column "revenue" = 120Cells contain the individual values stored in the DataFrame.
Exploring a DataFrame
Once a DataFrame has been created, its contents can be inspected and queried.
Shape
Before exploring individual values, it is often useful to understand the overall size of the DataFrame.
The shape of a DataFrame describes its dimensions: rows and columns.
var ds = data.item('sales').shape();
notebook.log(ds);Result:
{
columns: 3,
rows: 3,
}Reading Header
Retrieve information about columns names:
var columnNames = df.header.get();Reading Columns
Retrieve information about available columns:
var column = df.column('revenue').get();Reading Rows
Access rows stored in the DataFrame:
var rows = df.rows.get();Access a single row:
var row = df.row(1).get();Reading Cells
Retrieve a specific value:
var cell = df.cell(1, 'revenue').get();Selecting Data
DataFrames provide methods to extract subsets of data.
For example, selecting specific columns:
var columns = df.columns('product', 'revenue').get();Selections allow analyses to focus on the data that matters.
Modifying a DataFrame
DataFrames are mutable and can be updated as analysis progresses.
Updating Cells
Modify a single value:
df.cell(1, 'revenue').set(1000);Updating Rows
Replace the contents of a row:
df.row(1).set([ 1, 'A', 1000 ]);Adding Rows
Append new records to the DataFrame:
df.rows.add([
[ 4, 'B', 900 ],
[ 5, 'D', 1200 ]
]);Removing Rows
Delete rows from the DataFrame:
df.row(A).remove();Updating Columns
Replace the contents of a column:
df.column('revenue').set([ 1000, 800, 1100, 950, 1050 ]);If the column does not exist, it will be created.
Removing Columns
Delete columns from the DataFrame:
df.column('revenue').remove();Updating the Header
Replace the contents of the header:
df.header.set([ 'id', 'product', 'amount' ]);Combining DataFrames
Merge multiple DataFrames into a single structure:
df.union(otherDataFrame);Combining DataFrames is useful when working with data coming from multiple sources.
DataFrames and Datasets
DataFrames are often stored inside datasets managed by the Data API.
data.item('sales').set(df);Later:
var sales = data.item('sales').get();This allows DataFrames to persist within the notebook and be reused across multiple cells.
Final Step
You now understand the core concepts of Rawlytics Notebook.