Specify Data Features
Data Sources containing features can be added on the Configure > Source Features page. You can use features during modeling to help enhance prediction accuracy.
The Features page lets you specify multiple feature data sources. You can see previews of the features contained in each feature data source. Users can also commit or uncommit the features from the project build, as well as modify settings for each individual feature.
NOTE: If you are not using a features data source in your project, you can skip this page and continue by verifying your data sets.
Define Your Feature Data Source Connection
This page shows what your data definition looks like before configuring the definitions. Panel on the right changes after you configure it.
Click Configure > Source Features to open the Features page. This first time you access this page, the Feature Data Source pane shows no feature data set information.
You can use the Source Features page to configure your feature source data to use in your model. Like specifying targets and defining your target data set, this is a two-step process.
-
Define the database connection and data tables to use.
-
Specify feature data source dimensions by selecting fields that contain the desired feature dimensions, value dimension, date dimension, and location dimension (optional). You can specify multiple feature data tables to be used during this step.
IMPORTANT: You should have detailed knowledge of your feature data sources and how the sourced data in the data columns match to dimensions used to store that data. See the Sensible Machine Learning Data Quality Guide for information on data planning for your project.
Specify the Data Source Connection
The first part to specifying features and defining your feature data set for Sensible Machine Learning is to define the source connection for your data set.
-
In the Feature Data Sources pane, click Add to add a feature data source to your Sensible Machine Learning model. The Add Feature Data Set Connection dialog box displays.
-
In the Source Connection field, select the connection of your Feature Data Set.
-
In the Table Name field, select the names of the initial set of feature tables you created for importing into your Sensible Machine Learning project. If you imported multiple feature data sources for the first model prediction, select the import table name for each. Select the check box next to each feature table you are using for the first model prediction.
TIP: Only the first selected table name displays in the list after selecting. You can click the field to see all the selected import data files.
-
In the Data Source Name field, type a name for the feature data source you are creating.
-
Click Preview.
A default Source Connection Name, the first imported Table Name and the Data Source Name display at the top of the Add Feature Data Set Connection dialog box.
IMPORTANT: Use the information in the Preview pane to verify that the data in the correct feature data source is being used. This includes data from the source shown in the Preview table and the target data source tables shown in the upper-right list in the Preview pane.
If you cannot verify the data in the Preview pane is the correct source data, or the source tables are incorrect, you can click Update to change the selected source connection or target data source tables.
Once you are sure the correct source connection and data tables are being used and the source data shown in the Preview pane are verified, you can select the dimensions being used for the target data source connection.
Select Feature Data Source Dimensions
Continue specifying features and defining the data set by specifying any feature dimensions, value dimension, date dimension or location dimension for the data set to use in your Sensible Machine Learning model. This is basically matching the dimensions to be used for the feature data in your Sensible Learning Machine model to the dimensions reserved while creating a cube for the feature data source.
NOTE: Only the specific dimensions reserved for Sensible Machine Learning should be selected for each of the dimension types. If a dimension type does not correlate to data in the feature data set, leave the field blank. If the location dimension is not being used, select None.
-
In the Feature Dimensions field, select the feature dimensions that have been defined to store data from your feature data sources.
The columns in the source feature data set define all the feature variables that are used for predictions. The distinct combination of values across the feature dimensions define a feature.
Select the check box next to each applicable dimension. For example, if the user-defined dimensions UD1, UD2, UD3 and UD4 were reserved for source data and mapped to specific data columns in the data source, select UD1, UD2, UD3, and UD4 from the list.
If dimensions are selected for the feature data source that have the same name as a dimension in the target data source, then those dimensions are used to map features to targets. For example, UD1 is in the feature dimensions and the target dimensions, features with a value in the UD1 dimension are only mapped to targets with that same value in the UD1 dimension.
NOTE: Setting the feature dimensions to the exact same dimensions specified for the target data set causes an error when running the job to validate the data and add the feature data set to the model.
TIP: Selecting more feature dimensions leads to a higher number of unique intersections (or features) you can use in forecasting.
-
In the Value Dimension field, select the dimension used for the value data coming from the feature data source. Typically, this dimension is used to store source data values such as sales numbers.
NOTE: Only numeric values can be used to aid in predictions. Other types of values such as text are ignored.
-
In the Date Dimension field, select the dimension reserved for date data coming from the feature data source.
-
In the Location Dimension field, select the dimension reserved for location data coming from the feature data source (optional).
Select None if your feature data source does not include location information. The Location dimension is used during modeling to automatically map features to targets that have a location that is geographically inside of or equivalent to a given feature’s location. For example, a feature with the location Michigan is mapped to a target with the location Rochester, Michigan, but is not mapped to a target with the location USA.
TIP: The location dimension can also be a feature dimension that adds uniqueness.
NOTE: A location can be selected as both a feature dimension and a location dimension. Selecting a column as a location dimension ensures that, when configuring locations for your project, all the locations within that source column are pre-populated. Selecting a location as a feature dimension adds further uniqueness and more granularity to the feature intersection.
-
Click Run after making your feature dimension selections. This adds a job to the job queue to validate the data and add the feature data set to the model. The job runs tasks to complete the data definitions. A progress bar shows task progress. You can click Cancel Task at any time while the task is running to stop running the data definitions.
-
When the task completes, click Refresh Current Page .
The Features page displays the added feature data source listed in the All Feature Data Source pane. The Feature Data Source Preview pane displays below the All Feature Data Source pane, showing information for the top 100 feature records.
NOTE: Once the Data Source Preview pane displays in the Features page, the Configure button no longer shows on the page, as it is the default view for the page.
You can also edit, delete, commit, or add a new feature set.
Edit Feature Data Source Attributes
Once a feature data source has been added to the project, the All Feature Data Sources pane displays it at the top of the Features page. Sensible Machine Learning lets you set specific attributes for each feature in the data set.
Editing feature data source attributes is optional. Each feature's attributes have a default setting. Review the selections for each attribute. If you are satisfied with the defaults, click Cancel in the Feature Attributes dialog box without making changes, then commit the feature data source.
-
Click to select the data source whose attributes you want to edit.
TIP: Data information for the selected feature data set displays in the Data Source Preview pane.
-
Click the Edit the Selected Feature Data Source's Attributes button at the bottom of the pane. The Feature Attributes dialog box defaults to Custom view, which lists all the selected data source's feature attributes, and shows whether each attribute is selected (Yes) or not selected (No).
Each feature data set listed includes the following attributes:
Allow Feature Selection: The default value Yes allows the attribute to be filtered out during the feature selection process. Select No to ensure the feature is not filtered out during the feature selection process.
If too many features for a given target are set to No, then they still go through the feature selection process. This is to prevent too many features from being fed into any one model. This limit depends on which models are being run.
Allow Feature Engineering: The default value Yes indicates the feature can be engineered. Selecting No ensures that a feature cannot be engineered, such as lagging temperature by two weeks.
Scenario Modeling Feature: Select Yes if the feature should be included when defining custom Scenarios in Utilization and the intention of the project is to run predictions on different Scenarios. Otherwise, select No.
NOTE: When selecting Yes for any event:
- The project will be considered a Scenario Modeling project by the Xperiflow Engine.
- Altering a Scenario Modeling project after the job has run requires a Restart or Manual Rebuild.
- Known In Advance automatically changes to Yes.Known In Advance (KIA): The default value No indicates that this feature does not have data that extends past the last actual data point (such as weather forecast for the next two weeks). Known-in-advance features cannot have any missing data past the forecast range (for example, five weeks for a five week forecast). Select Yes for the attributes that you know have data that extends beyond the forecast range.
IMPORTANT: The prediction job cannot run if this setting is set to Yes and the feature is not available through the forecast when trying to run predictions.
KIA Date Range (Days): For features with Known In Advance set to Yes, this attribute allows the user to specify how many days of data will be known in advance. This attribute defaults to being blank. If a value is given to this attribute, but Known In Advance is not set to Yes, Xperiflow will automatically update the feature to Known In Advance set to Yes.
IMPORTANT: The prediction job cannot run if this setting is configured and the number of days specified part of the feature data source.
Aggregation Method: This attribute allows a user to specify a preferred method of aggregating the feature data. By default, this will be set to None and the following options can be selected: Sum, Mean, Median, Last, Max, Min, and Mode.
Data Cleansing Method: This attribute allows a user to specify a preferred method of cleaning missing feature data. By default, this will be set to None and the following options can be selected: Mean, Zero, Interpolate, Kalman, and Local Median.
Frequency Override: This attribute allows a user to override the frequency of the feature data. By default, this will be set to None. If None is selected, Xperiflow will automatically determine the frequency of the feature data.
ML Type: This attribute allows a user to specify the data type of the feature data. By default, this will be set to None and the following options can be selected: Binary Categorical, DateTime, Multi Categorical, Numerical, and Text.
-
In the Feature Attributes dialog box, edit the feature's attributes in one of the following ways:
Custom: Allows you to modify individual attribute values for features as desired.
-
Select a feature, then select the attributes values for that feature by clicking in each of the attribute selection fields and selecting Yes or No depending on the desired value.
-
Click the Save button in the button bar to save your feature attribute changes.
Modify All: Allows you to apply an individual attribute value to all features in a given feature data set.
-
Select the attribute option to apply the value.
-
Select the value of the attribute to apply.
-
Click the Save button at the bottom of the Feature Attributes dialog box, to save your feature attribute change and apply the selected value to the selected attribute for all features.
-
The data in the Data Source Preview pane displays information on the dimensions used to run the preview, as well as the number of data intersections in the Sensible Machine Learning data sources to be used for the model.
Verify Data Source Information
Review the information in the Data Source Preview pane to verify the data features are correctly defined for the model.
The list on the right side of the Data Source Preview pane lists the imported feature data files. Click it to see the fill list of all feature data files imported for the data source that is currently selected in the Data Source Preview pane. This is useful for verifying that the correct files were imported.
If any data in the Data Source Preview pane is not as expected, you can select the feature data source in the Selected Feature Data Source pane and do the following:
-
Click the Update the Selected Feature Data Source button. This opens the Update Feature Database Connection dialog box so you can reselect feature data source dimensions.
-
Select the feature data source in the Selected Feature Data Source pane and click the Delete button, then click Delete again to remove the selected feature data source from the list.
Once you verify the data in the Data Source Preview pane, you can continue by specifying data features. If features are not included in your data sources, continue by verifying your data sets in Sensible Machine Learning.
Commit or Decommit a Feature Data Source
You must commit any feature data sources to use in the Sensible Machine Learning project. You can also decommit any committed feature data source.
-
In the Selected Feature Data Source pane, select the feature data source and click the Commit button.
-
A message box informs you that the selected data source's commit status has changed. Click OK to close the message box.
-
Commit any other feature data sources as needed by repeating the previous steps.
Once you have committed your data sets, continue by processing Feature Data Sources to be used with your Sensible Machine Learning project.
NOTE: Feature data sources can only be committed for a full build and not for a partial build.
Process Feature Data Sources
After a feature data source has been committed, the user must process the feature data source. To process the data source:
-
Upon the initial steps defined above when configuring Feature Data Sources, the Process Data Sources button will be disabled.
-
After committing one or multiple Feature Data Sources, the Process Data Sources button will become enabled.
-
When enabled the All Feature Data Sources grid will also display the Requires Processing field as on for any committed data sources. Upon these conditions, the user should click the Process Data Sources button, which will start a Feature Data Load job in the Xperiflow engine.
-
Upon completion of the Feature Data Load job, the Requires Processing field will be updated to off for all committed data sources and the Process Data Sources button will be disabled.
The Feature Data Load job is required to be run for any new changes to the Feature Data Sources. The above example is for configuring, committing, and loading a new Feature Data Source, but a Feature Data Load job will also be required for the following conditions:
-
A Feature Data Source that has been committed and been included in a Feature Data Load job is uncommitted.
-
A Feature Data Source that has been committed and been included in a Feature Data Load job has updates made to its Data Source Attributes.
NOTE: A user will not be able to navigate to the Pipeline Section of Model Build if any Feature Data Sources require processing. The Feature Data Load job can be run as many times as required to process all of the Feature Data Sources.