Log Analysis using Jupyter

Preamble
Transforming log streams into data frames
Using Syskit data
Plotting using Vega
Annotating plots with Syskit task data
Limitations

The rock.jupyter generates a Jupyter configuration to have a Ruby kernel matching the version from the package set. To add it to your workspace, add the package set in the package_sets section of your autoproj/manifest, the rock.jupyter.osdeps metapackage to the manifest itself.

package_sets:
- ...
- github: rock-core/rock.jupyter-package_set

manifest:
- ...
- rock.jupyter.osdeps

and run autoproj osdeps

To start using Jupyter within your Rock workspace, either start jupyter-notebook after having loaded the workspace's env.sh, or start it with autoproj exec, e.g.

.autoproj/bin/autoproj exec jupyter-notebook

At this point, you may create a notebook that uses a Ruby kernel. If you have set the SYSKIT_LOG_DATASTORE environment variable, this kernel will use that datastore by default.

TODO: screencast

This support is aimed at relatively simple plotting and analysis. If your goal is to perform very advanced processing (such as e.g. machine learning), we recommend to use functionality described in this guide to extract the data you need, but then export it into something other languages with bigger data analysis ecosystems (e.g. Python, R, …) can load. Everything that is being done here can be done in a single Ruby script.

The general workflow is:

select a dataset
restrict the processing time interval and/or specify a subsampling
convert the streams of interest into a daru dataframe
plot/analyze the extracted data or save it for further analysis in other languages

In addition, the DSL supports using Syskit event data to find sections of the dataset that are of interest. This is a very powerful mean to provide context to the raw data, as Syskit tasks/compositions, in the end, describe the system's intents.

Preamble

Add the following in the first cell of the notebook, to load the notebook support:

require "syskit/log"
require "syskit/log/dsl"
extend Syskit::Log::DSL

Syskit::Log::DSL is designed to ease analysis of a single dataset. Load the dataset by its ID with the following statement. We strongly recommend that this statement is the last of a cell, as it will display details about the dataset, such as the list of tasks, ports and properties it contains.

dataset_select "SOME_DATASET_ID"

After this statement, all other statements will use that dataset.

Transforming log streams into data frames

For a single stream

The to_daru_frame statement is a one-stop command to convert many log streams into a single data frame, aligning the data using the log streams' timestamps. We will first see how it works for a single stream, and go on with multiple streams just afterwards.

Within the DSL, reference to ports are of the form task_name_task.port_name_port when passed as argument to to_daru_frame. For instance, assuming we had a pose estimator task called pose_estimator in the system, with a pose_samples port, one would do the following to create a frame from it.

frame = to_daru_frame pose_estimator_task.pose_samples_port do |p|
    ...
end

The list of tasks and ports is displayed by dataset_select when it is last in a cell.

The p object yield above by to_daru_frame represents a sample generated by the port, and allows to create columns in the generated dataframe, based on fields in the data type. The data type is the typelib type (i.e. C++ type or intermediate type in a case of opaques), not the associated ruby type. See this page

In our example, assuming the pose is a RigidBodyState, to get the X, Y and Z coordinates of the position:

frame = to_daru_frame pose_estimator_task.pose_samples_port do |p|
    p.add("x") { |s| s.position.data[0] }
    p.add("y") { |s| s.position.data[1] }
    p.add("z") { |s| s.position.data[2] }
end

RigidBodyState being derived from opaque types, the data structure we are manipulating here is the one showed as "Logging type" on the type page in the Syskit IDE:

Logging type for RigidBodyState

Alternatively, within the Jupyter notebook, one can use summarize to get details about a port's data type:

summarize pose_estimator_task.pose_samples_port

Summarize output port

If you want to use some methods from the Ruby type that is used to represent a Rock type, you may use the transform method. For instance, to create a column containing a RigidBodyState heading, do

frame = to_daru_frame pose_estimator_task.pose_samples_port do |p|
    p.add("yaw") { |s| s.orientation.transform(&:yaw) }
end

RididBodyState.orientation is converted first to its Ruby equivalent (Eigen::Quaternion) for which Rock defines a yaw method. Note that you should do this only if it is not possible to perform the data on the column afterwards. If you wanted to scale a column from radians to degress (for plotting later), it is significantly faster to compute the scaling afterwards:

frame = to_daru_frame trajectory_controller_task.heading_command_port do |p|
    p.add("target_angle") { |s| s.angle.rad }
end

frame["target_angle"] = frame["target_angle"] * 180 / Math::PI ~~~

Daru, the underlying dataframe handling library supports most basic numerical computations on columns, as well as some more useful sliding-window operations such as mean/stdev, …

For multiple streams

When given more than one port, to_daru_frame will align them all to the first port. In practice, it means that it generates new lines in the frame whenever there is a new sample for the first port, and fills the rest of the columns with the last sample of the other streams it found just before. The selection of the first port is therefore, rather obviously, critical.

Right now, alignment is done with the log stream's logical time, which is the time when the sample was received by the logger.

For instance:

df = to_daru_frame \
    trajectory_controller_task.heading_command_port, \
    pose_estimator_task.pose_samples_port do |cmd, pose|

    cmd.add("cmd_heading") { |s| s.orientation.transform(&:yaw) }
    pose.add("x") { |s| s.position.data[0] }
    pose.add("y") { |s| s.position.data[1] }
end

Dealing with missing data

By default, when aligning multiple streams, to_daru_frame will always use the last received sample for any given stream, regardless of how old it is. The optional timeout parameter allows to "reset" this value after a time (in seconds) and use the column's no-data value (which is nil for categorical data and NAN for numerical data)

Let's assume for instance that the pose stream we have used in the above examples have a straing 10Hz period, without skip. However, the heading command has "holes" whenever the heading controller was not in use (for instance). The following definition will make sure that the cmd_heading column is filled with NaNs whenever this was the case:

df = to_daru_frame \
    pose_estimator_task.pose_samples_port,
    trajectory_controller_task.heading_command_port, timeout: 1 do |pose, cmd|
    pose.add("x") { |s| s.position.data[0] }
    pose.add("y") { |s| s.position.data[1] }
    cmd.add("cmd_heading") { |s| s.orientation.transform(&:yaw) }
end

Note that the alignment happens on the first stream given as parameter. If we were to change the order of the ports in the call to to_daru_frame, the sections without data in that stream would instead not appear at all.

Subsampling

To do rough analysis or plotting of long datasets, the analysis DSL allows to configure subsampling parameter using the interval_sample_every statement. It can be given in number of samples (e.g. interval_sample_every samples: 10) or time between samples (e.g. interval_sample_every seconds: 0.2).

One thing to consider when aligning streams is that subsampling happens to streams independently (before alignment)

Using Syskit data

Syskit execution data is also available within the notebooks, to provide context. This allows to find where things are in the timeline, a critical aspect when one has to deal with hours or even days of data in a single dataset.

The simplest form is to list instances of a given task (task context, ruby task context, composition) model using summarize.

Given an orogen model aceinna_imu::Task, one does

summarize roby.OroGen.imu_aceinna_openimu.Task

which lists the instances of that particular task. It works for task contexts as well as compositions and ruby tasks.

openimu instances

Then, to restrict further processing to the interval of an instance of that particular task, one does

interval_select roby.OroGen.imu_aceinna_openimu.Task.by_id(767)

Alternatively, particular events might be of more interest, for instance

summarize roby.OroGen.imu_aceinna_openimu.Task.exception_event

Gives

openimu exception event emission

which can then used to restrict the processing interval.

Plotting using Vega

The syskit-log Jupyter helpers have functionality geared towards plotting using vega-lite (through a thin Ruby wrapper).

The simplest helper, vega_simple_plot, does a X/Y plot based on a Daru frame, e.g.

vega_simple_plot(df, x: "t", y: "velocity")

Under the scenes, daru and vega have very different data representations. The dataframe is first converted in an array of hashes and then plotted. If multiple plot are to be made based on the same data, it is best to convert them first with daru_to_vega:

df_vega = daru_to_vega(df)
vega_simple_plot(df_vega, x: "t", y: "velocity")

Layering simple plots can be done with

Vega.lite.data(df_vega).layer([
    vega_simple_view(x: "t", y: "velocity"),
    vega_simple_view(x: "t", y: "x_velocity"),
])

layer can be replaced by vconcat to have plots vertically stacked and hconcat to have them horizontally aligned.

Generally speaking, the vega-rb layer maps the vega-lite specification in a rather straightforward way. Use Vega Lite's documentation as a reference.

For reference, vega_simple_plot and vega_simple_view are roughly equivalent to

Vega.lite.encoding(x: { field: "t", type: "quantitative" },v
                   y: { field: "velocity", type: "quantitative" })
         .mark(type: "line").width(800)

Annotating plots with Syskit task data

The ruby_vega_mark_tasks call is a very powerful way to tag vega data with the ID and name of a Syskit task. This would allow to e.g. change the color or style based on the system's currently selected behavior.

For instance, let's assume we have a df dataframe converted into a df_vega Vega data array, and we want to visualize with of the tasks contained in a current_missions array is currently active. One would do:

roby_vega_mark_tasks("current_mission", df_vega, *current_missions, time_field: "t")

The call will add a "current_mission" field in each sample with a string that describes the task running at the time of the sample (contained in field "t"). Then the following:

vega_simple_plot(x: "t", y: "rudder_angle", color: nil)
    .encoding(color: { field: "current_mission" })

would create a X/Y plot with the color of the line determined by the current task. The task list is displayed in the legend. For instance:

color-coded current task

The roby_task_timeline also allows to simply plot a timeline of the given tasks (as a horizontal bar graph).

Limitations

all processing to data frames is done in the sample's logical time (time when it reached the logger) instead of the sample's own time. The long term objective is to write the latter into the former during normalization.