Predictive Analysis

[PDF]Predictive Analysis - Rackcdn.com848a5c47863f10b60520-3488c35d3ab28aac7529e703b5435d94.r68.cf1.rackcdn.co...

3 downloads 156 Views 17MB Size

Predictive Analysis User Guide Version 3.0

Table of Contents 1. About This Guide .................................................................................................................................. 7 1.1. Document History ......................................................................................................................... 7 1.2. Overview ........................................................................................................................................ 7 1.3. Target Audience ............................................................................................................................ 7 2. Introducing BizViz Predictive Analysis Tool ..................................................................................... 7 2.1. Introduction to the BizViz Predictive Analysis ......................................................................... 7 2.2. Prerequisites ................................................................................................................................. 8 2.2.1. Pre-requisites for Predictive Analysis ............................................................................... 8 2.2.2. R Server Requirements ......................................................................................................... 8 2.2.3. Predictive Spark Application Deployment Details .......................................................... 8 3. Getting Started with the BizViz Predictive Analysis ..................................................................... 11 4. Predictive Analysis Home Page ........................................................................................................ 14 4.1. Tree-node Menu .......................................................................................................................... 14 4.2. Header Menu- Options ............................................................................................................... 15 4.3. Tabbed Menu Strip - Options .................................................................................................... 17 5. Acquiring Data from a Data Source ................................................................................................. 22 5.1. Acquiring Data from a CSV File ................................................................................................ 22 5.2. Acquiring Data from a Data Service ........................................................................................ 25 5.3. Acquiring Data from Cassandra Reader .................................................................................. 28 5.4. Removing a Data Source from the Workspace ....................................................................... 31 6. Data Preparation ................................................................................................................................ 32 6.1. Data Type Definition .................................................................................................................. 32 6.2. Filter ............................................................................................................................................. 34 6.3. Missing Value Replacement ....................................................................................................... 38 6.4. Formula ........................................................................................................................................ 41 6.5. Normalization .............................................................................................................................. 42 6.6. Sample .......................................................................................................................................... 48 6.7. R Split Data .................................................................................................................................. 53 6.8. Spark Split Data .......................................................................................................................... 56 6.9. Spark Filter .................................................................................................................................. 59 6.10. Spark Data Type Definition ....................................................................................................... 62

www.bdbizviz.com

Page | 2

7. Data Transformation .......................................................................................................................... 65 7.1. String Indexer .............................................................................................................................. 65 7.2. Spark R Formula .......................................................................................................................... 67 7.3. Spark PCA ..................................................................................................................................... 69 7.4. Spark Chi Square ......................................................................................................................... 71 7.5. Spark Index to String .................................................................................................................. 72 7.6. Spark SQL Transformer .............................................................................................................. 75 7.7. Spark Group By ............................................................................................................................ 77 8. Algorithms ............................................................................................................................................ 78 8.1. Clustering ..................................................................................................................................... 81 8.1.1. R-K Means ............................................................................................................................... 81 8.1.2. Spark-K- Means ..................................................................................................................... 84 8.2. Regression Analysis ..................................................................................................................... 89 8.2.1. R-Linear Regression ............................................................................................................. 89 8.2.2. R-Multiple Linear Regression ............................................................................................. 92 8.2.3. R-Logistic Regression ........................................................................................................... 94 8.2.3.1. Spark K-Means Connected to the Pipeline Components ......................................... 97 8.3. Forecasting .................................................................................................................................. 98 8.3.1. Triple Exponential Smoothing ........................................................................................... 99 8.3.2. Single Exponential Smoothing ......................................................................................... 104 8.3.3. Double Exponential Smoothing ....................................................................................... 107 8.3.4. R-Auto ARIMA ....................................................................................................................... 109 8.3.5. R- Auto Forecasting ........................................................................................................... 111 8.3.6. Result View of Forecasting Algorithms when the selected output mode is ‘Trend’: 113 8.4. Association ................................................................................................................................. 120 8.4.1. Market Basket Analysis ...................................................................................................... 120 8.5. Outliers ....................................................................................................................................... 126 8.5.1. Interquartile Range ............................................................................................................ 126 8.6. Classification ............................................................................................................................. 130 8.6.1. R-CNR Tree ........................................................................................................................... 130 8.6.2. R-Naive Bayes ...................................................................................................................... 145 8.6.3. Spark-Naive Bayes .............................................................................................................. 150

www.bdbizviz.com

Page | 3

8.6.4. Spark Decision Tree ........................................................................................................... 155 8.6.5. Spark Random Forest ........................................................................................................ 164 8.7. Correlation ................................................................................................................................. 173 8.7.1. R- Correlation ...................................................................................................................... 173 8.8. Recommendation Engine ......................................................................................................... 175 8.8.1. Spark ALS .............................................................................................................................. 175 9. Apply Model ....................................................................................................................................... 180 9.1. Spark Apply Model .................................................................................................................... 180 9.2. R Apply Model ............................................................................................................................ 183 10. Performance ...................................................................................................................................... 186 10.1. Spark Performance ................................................................................................................... 186 10.2. R Performance .......................................................................................................................... 192 11. Data Writer(s) ................................................................................................................................... 197 11.1. File Writer .................................................................................................................................. 197 11.1.1.

CSV Writer ........................................................................................................................ 197

11.1.2.

JSON Writer ..................................................................................................................... 198

11.2. Database Writer ........................................................................................................................ 199 11.2.1. 11.2.1.1. 11.2.2.

Internal Data Writer ....................................................................................................... 199 Delta Load in Internal Data Writer (for MySQL connector) .............................. 202 Cassandra Writer ............................................................................................................ 207

12. Custom R Script ................................................................................................................................ 213 12.1. Creating a New R Script ........................................................................................................... 213 12.2. Saved R-Scripts ......................................................................................................................... 217 12.2.1.

Viewing a Saved R Script .............................................................................................. 217

12.2.2.

Editing a Saved R Script ................................................................................................ 218

12.2.3.

Sharing a Saved R Script ............................................................................................... 218

12.2.4.

Deleting a Saved R Script .............................................................................................. 220

12.2.5.

Connecting Saved R Script with a Data Source ....................................................... 221

13. Custom Scala Script .......................................................................................................................... 223 13.1. Creating a New Script .............................................................................................................. 223 13.2. Saved Scala Scripts ................................................................................................................... 228 13.2.1.

Viewing a Saved Scala Script ....................................................................................... 228

13.2.2.

Editing a Saved Scala Script ......................................................................................... 228

www.bdbizviz.com

Page | 4

13.2.3.

Sharing a Saved Scala Script ........................................................................................ 228

13.2.4.

Deleting a Saved Scala Script ...................................................................................... 230

13.2.5.

Connecting Saved Scala Script with a Data Source ................................................ 231

14. Scheduler ........................................................................................................................................... 233 14.1. New Schedule ............................................................................................................................ 233 14.1.1.

Configuring General Tab ............................................................................................... 234

14.1.2.

Configuring Data Source ............................................................................................... 235

14.1.3.

Configuring a Data Writer ............................................................................................. 238

14.1.4.

Scheduling a New job .................................................................................................... 240

14.1.5.

Notification ...................................................................................................................... 244

14.2. Status .......................................................................................................................................... 246 15. Live Job Status .................................................................................................................................. 247 16. Saved Workflows ............................................................................................................................... 251 16.1. Opening a Workflow ................................................................................................................. 251 16.2. Deleting a Workflow ................................................................................................................. 252 16.2.1.

Delete Connection for a Workflow ................................................................................ 253

16.3. Renaming a Workflow .............................................................................................................. 253 16.4. Sharing a Workflow .................................................................................................................. 254 16.5. Deploying a Workflow .............................................................................................................. 255 16.6. Result of Each Component ...................................................................................................... 258 16.7. Stop Button on the Progress Bar ............................................................................................ 258 17. Saved Spark Models .......................................................................................................................... 259 17.1. Saving a Spark Model ................................................................................................................ 259 17.2. Reading a Spark Model ............................................................................................................. 260 17.3. Renaming a Spark Model .......................................................................................................... 262 17.4. Deleting a Spark Model ............................................................................................................ 263 17.5. Sharing a Spark Model .............................................................................................................. 264 18. Saved R Models .................................................................................................................................. 266 18.1. Saving an R Model ..................................................................................................................... 266 18.2. Reading an R Model .................................................................................................................. 267 18.3. Renaming an R Model ............................................................................................................... 270 18.4. Deleting an R Model ................................................................................................................. 270 19. Signing Out ......................................................................................................................................... 271

www.bdbizviz.com

Page | 5

www.bdbizviz.com

Page | 6

1. About This Guide 1.1.

Document History The following table gives an overview of the most recent document updates: Product Version BizViz Predictive Analysis 1.0 BizViz Predictive Analysis 2.0 BizViz Predictive Analysis 2.0 BizViz BizViz 2.5.1 BizViz 2.5.3 BizViz

1.2.

Date (Release date) June 9th, 2015 Feb 18th, 2016 May 31st, 2016

Description

Predictive Analysis 2.5 Predictive Analysis

November 9th, 2016 January 3rd, 2017

First Release of the document Updated document Minor Changes and Editing of the document Updated document Updated document

Predictive Analysis

March 16th, 2017

Updated document

Predictive Analysis 3.0

August 31st, 2017

Updated document

Overview This guide covers steps to: • • • •

1.3.

Access the BDB Predictive Analysis Server Requirements and Deployment Details for the BDB Predictive Analysis Designer Part of the BDB Predictive Analysis Result or Analysis Part of the BDB Predictive Analysis

Target Audience This guide is aimed at business professionals, data analysts, data scientists, and statisticians who use BizViz Predictive Analysis tool to conduct various experimentations with data as in a Data Science Lab.

2. Introducing BizViz Predictive Analysis Tool 2.1.

Introduction to the BizViz Predictive Analysis BizViz Predictive Analysis is a statistical analysis tool that empowers its users by providing predictive models. These Predictive Models can be used to envision the future outcomes of business processes based on the past data. It is a user-friendly tool that shields users from the mathematical complexity and offers an interactive graphical interface to provide an easy, intuitive experience. It enables the users to discover hidden insights and relationships in their data by applying various statistical algorithms provided by the popular R statistical language and Spark ML.

www.bdbizviz.com

Page | 7

2.2.

Prerequisites

2.2.1.

2.2.2.

2.2.3.

Pre-requisites for Predictive Analysis 1. Predictive Analysis is a web based service so, the only requirement is a browser. 2. Predictive Analysis can be viewed only in desktops (mobile and tablet views are not supported). 3. R server and Predictive Spark App Settings should be configured from the Administration module. 4. The user should be provided with all the necessary permissions to access and use the Predictive Analysis plugin from the User Management module of the BizViz Platform. 5. The user should be permitted to access Data Management module from the BizViz Platform to use query service and Cassandra reader and writer for Predictive Analysis. 6. Limit of rows for data connectors needs to be configured via the Administration module. R Server Requirements 1. R server should be deployed publically. 2. Port should be open. 3. R server should be configured in Administration page of the BizViz platform. 4. Following packages should be installed in the R Server for predefined algorithms: • stringr • forecast • arules • arulesViz • rpart • e1071 5. In case of Custom R Script, script specific packages should be installed in the R Server. Predictive Spark Application Deployment Details 1. Spark, Hadoop, Cassandra should be running in Cluster. For this application, Cluster should have free resources (Min 3 Core, 2 GB RAM in each executor according to application property). 2. Create a file with name spark_pa.properties in spark’s configuration folder (cd $SPARK_HOME/conf) and provide the following properties:

www.bdbizviz.com

Page | 8

• spark.master #Mandatory • spark.app.name Spark Predictive Application #Mandatory. • spark.scheduler.mode FAIR • spark.eventLog.enabled true • spark.eventLog.dir • spark.serializer org.apache.spark.serializer.KryoSerializer • spark.extraListeners org.apache.spark.ui.jobs.JobProgressListener,org.apache.spar k.PASparkListener #Mandatory ( Custom listener for the PA app) 3. Port Configuration: Any port series is fine provided they are exposed via the firewall. This is for the nodes within the Spark cluster. • spark.ui.port 5003 • spark.history.ui.port 20080 • spark.driver.port 20081 • spark.executor.port 20082 • spark.fileserver.port 20083 • spark.broadcast.port 20084 • spark.replClassServer.port 20085 • spark.blockManager.port 20086 4. Cassandra Configuration • spark.cassandra.input.split.size_in_mb •

spark.cassandra.input.fetch.size_in_rows

16 1000

5. Spark PA Configuration • spark.pa.fs.default.name hdfs://localhost:8020 #Mandatory • spark.pa.process.queue.size 10 #Mandatory Default is 10. Queue size for PA app. • spark.pa.process.pool.size 10 #Mandatory Default is 10. pool size for PA app. • spark.pa.cache.size 100 #Mandatory Default is 100. Cache size for PA app. • spark.pa.cache.timeout_sec 600 #Mandatory Default is 600 sec. Cache timeout for PA app • spark.pa.hdfs.model.dir hdfs://hostname:port/directory name #Mandatory hdfs storage location for the models hdfs://localhost:8020/pa/model

www.bdbizviz.com

Page | 9

• spark.pa.hdfs.tmp.dir hdfs://hostname:port/director name #Mandatory hdfs://localhost:8020/pa/tmp • spark.pa.model.timeout_sec 86400 #Mandatory Default is 86400 (1 day). Time interval for deleting temporary model/s from the temporary hdfs location.

6. Copy shade jar of pa_spark bundle in “spark/jars/” folder • Com.bdbizviz.pa.spark-shade-2.2.0.jar 7. Create a Script file named “start-pa.sh” in Spark’s sbin folder to start application If you need to execute in Kerberos mode, you need to generate the key tab file. Script contents in Kerberos Mode: #!/usr/bin/env bash dir="$(cd "`dirname "$0"`"/..; pwd)" nohup $dir/bin/spark-submit --keytab $dir/conf/hdfs.keytab \ --principal hdfs/ \ --executor-memory 3G --executor-cores 4 --num-executors 1 \ --verbose --properties-file $dir/conf/spark-pa.properties \ --driver-class-path $dir/jars/com.bdbizviz.pa.spark-shade 2.2.0.jar \ --class com.bdbizviz.pa.spark.executor.Executor --master yarn deploy-mode client \ jars/com.bdbizviz.pa.spark-shade-2.2.0.jar 18786 >> $dir/logs/spark-pa.log 2>&1& please note that 18786 is a jetty port and can be changed to suite your needs

Script contents in Normal Mode: #!/usr/bin/env bash dir="$(cd "`dirname "$0"`"/..; pwd)" nohup $dir/bin/spark-submit \ --executor-memory 3G --executor-cores 4 --num-executors 1 \ --verbose --properties-file $dir/conf/spark-pa.properties \

www.bdbizviz.com

Page | 10

--driver-class-path $dir/jars/com.bdbizviz.pa.spark-shade 2.2.0.jar \ --class com.bdbizviz.pa.spark.executor.Executor --master yarn deploy-mode client \ jars/com.bdbizviz.pa.spark-shade-2.2.0.jar 18786 >> $dir/logs/spark-pa.log 2>&1& please note that 18786 is a jetty port and can be changed to suite your needs

Save this file as a shell script (.sh) 8. Start Application with this command- sbin/start-pa.sh 9. Confirm the Spark PA Application is running in YARN:

Note: Confirm that application has sufficient resources by the highlighted columns such as “Cores” and “Memory per Nodes”.

3. Getting Started with the BizViz Predictive Analysis BizViz Predictive analysis is a plugin application provided under BizViz Platform. i) Open BizViz Enterprise Platform Link: http://apps.bdbizviz.com/app/ ii) Enter your credentials to Login. iii) Click ‘LOGIN’.

www.bdbizviz.com

Page | 11

iv) Users will be redirected to the BizViz Platform home page.

v) Click the ‘Apps’ icon to display all the plugin applications. vi) Select ‘Predictive Analysis’ from the Apps menu.

www.bdbizviz.com

Page | 12

vii) Users will be directed to the Predictive Analysis home page.

www.bdbizviz.com

Page | 13

4. Predictive Analysis Home Page

This section describes all the options and icons provided on the Predictive Analysis home page. The Predictive Analysis home page can be described through the following Menus:

4.1.

Tree-node Menu

The Tree-node menu contains all the available component connectors to run a predictive execution. The components will be provided in the hierarchical order via a tree structure menu. All the main categories are included as tree-nodes and sub-categories are committed as petals to the respective tree-nodes. E.g. ‘Data Writer’ is the main category to which ‘File Writer’ is committed as a subcategory and ‘CSV Writer’ is displayed at the second level of the hierarchy.

Note: a. The ‘Search’ option has been provided for the entire tree structure menu. b. Click the ‘Arrow’ provided next to the ‘Search’ box to collapse the tree structure menu from the home page.

www.bdbizviz.com

Page | 14

c. This document is created focusing on each petal of the tree structure menu. All the available major and minor categories are described at length to understand a Predictive process.

4.2.

Header Menu- Options

1. Run: Click ‘Run’ option to run the process and display the result set view. This option can be applied to data source, algorithms, and data preparation components. 2. Reset: The ‘Reset’ option to clean the workspace removing the current component connectors. 3. Refresh: The ‘Refresh’ option is provided on the menu row to fetch fresh data when adding a new component in the Spark workflow. 4. Clear Cache: a. After using the ‘Run’ option, by default data will be cached in the server for the next 10 minutes. For latest results, users need to run workflow again. b. Users need to click the ‘Clear Cache’ option to remove the cached data before running the workflow (again). c. If users change any component parameter which is to be applied to fetch the result then, ‘Clear Cache’ option must be clicked.

www.bdbizviz.com

Page | 15

If you get a message to clear cache to execute your process, follow the below-given steps. i) Click ‘Clear Cache’ option from the header menu. ii) A message will pop-up. iii) Click ‘Ok’.

iv) Another message will pop-up to confirm that the cache data has been cleared.

5. Save: Click the ‘Save’ option to save the created predictive workflow. 6. Save As: Click the ‘Save As’ option to copy a predictive workflow with a desired name. i) Create a workflow by connecting various configured components. ii) Click ‘Save As’. iii) A pop-up window will appear for confirmation. iv) Click ‘Ok’.

www.bdbizviz.com

Page | 16

v)

4.3.

The workflow will be saved by the provided name in the ‘Saved Workflows’ list.

Tabbed Menu Strip - Options

1. Component The ‘Component’ tab displays required configuration fields for the dragged components onto the workspace.

www.bdbizviz.com

Page | 17

Note: The component tab may display various sub-tabs as per the selected components onto the workspace. E.g. If the dragged data source is a CSV file then the component tab will display General and Properties fields while for the Cassandra Reader as a data source, the component tab will display General, Properties, and Column Selection. 2. Console ‘Console’ shows date and recorded time for the entire process. i) Click on ‘Console’ option. ii) The below-mentioned records will be displayed: a. Process b. Data Reader Process (starting and ending time) c. R and Spark Process (starting and ending time)

3. Summary Click the ‘Summary’ tab to display R and Spark Server summary of the process.

www.bdbizviz.com

Page | 18

4. Result Click the ‘Result’ tab to display a result list view based on the selected execution.

Note: The ‘Result’ tab will be displayed for the given data only after data is configured and ‘Run’ or ‘Run Till Here’ option is selected. Up to 50000 cells can be displayed in the Result view. 5. Visualization Click the ‘Visualization’ tab to display a graphical representation of the result data.

www.bdbizviz.com

Page | 19

6. Properties: Click the ‘Properties’ tab to display properties for the current workflow on the Workspace.

7. Status: Click the ‘Status’ tab to view the live job status of a running Spark job.

www.bdbizviz.com

Page | 20

8. Minimize Maximize Button The ‘Minimize/Maximize’ buttons have been provided to the tabbed menu strip to customize the workspace and view space as per the user requirement. The following image represents Predictive home page default view:

a. Click/icon to minimize view space and maximize workspace on the Predictive Analysis home page.

b. Click/icon to maximize view space and minimize workspace on the Predictive Analysis home page.

www.bdbizviz.com

Page | 21

5. Acquiring Data from a Data Source Acquiring data from a data source is the initial step for Predictive Analysis. The ‘Data Source’ tree-node offers 3 types of data connectors: a. CSV File b. Query Service c. Cassandra Reader

5.1.

Acquiring Data from a CSV File

i) Select and drag ‘CSV File’ component onto workspace. ii) Click the ‘CSV File’ component.

www.bdbizviz.com

Page | 22

iii) Configure the following ‘CSV Properties Configuration’ fields: a. Select File: Browse a CSV file b. Delimiter: Mention the delimiter used in the CSV file iv) Click ‘Apply’.

v) Click ‘Run’. vi) Users will be redirected to the ‘Console’ tab.

vii) Follow the below-given steps to display the result view: a. Click the dragged data source component on the workspace. b. Click the ‘Result’ tab.

www.bdbizviz.com

Page | 23

• Rules to be Followed while Uploading a CSV File 1. The first row provided in the CSV file should contain the column headers. 2. The second row of the CSV file should contain the data under all the headers without any ‘null’ or ‘NA’. 3. CSV headers should not have space. It should be a single word or two words concatenated by an underscore (_). 4. CSV headers should not contain any special characters. E.g. - %, #, $, @,*, etc. 5. CSV headers should not contain single or double quotes, dot, brackets, and high-fen. 6. CSV headers should not contain merely numbers. Numerals should be used with at least one alphabet. 7. CSV header should not exceed 50 characters. 8. All rows in a column should have the same data type. Note: a. The supported file types will be .csv, .tsv . b. ‘General’ tab is provided to configure the following information for any tree-node component: i. Alias Name ii. Description (it is an optional field) (E.g. the following image displays ‘General’ tab for a CSV data source.)

www.bdbizviz.com

Page | 24

5.2.

Acquiring Data from a Data Service

i) Select and drag ‘Data Service’ connector onto the workspace. ii) Click the ‘Data Service’ connector.

iii) Users will be redirected to the ‘Properties’ fields provided under ‘Components’ tab on the Tabbed Menu Strip. iv) Configure the ‘Data Service Properties’: a. Select Data Connector: Select a data source from the drop-down menu b. Select Data Service: Select a query service from the drop-down menu c. Fields: The following tables will be displayed: § Column Header § Data Type v) Click ‘Next’.

www.bdbizviz.com

Page | 25

vi) Users will be redirected to the ‘Conditions’ tab. (If the selected data service contains the filter values). vii) Configure the following information: a. Filter Type: Available filter(s) in the data service will be displayed in this space. b. Control Type: Users are provided with the following options to pass the filter values under this option: • Text: By selecting this option users can manually enter multiple filter values separated by comma.

• LOV: By selecting this filter value option users will be directed to select another Data Connector and Data Service available in the space.

www.bdbizviz.com

Page | 26

i. Once the user selects a data service, a list of values will display for the user to select the filter values. ii. Users can select multiple values as filter values from the selected data service.

viii) Click ‘Apply’. ix) Click ‘Run’. viii) Users will be redirected to the ‘Console’ tab.

ix) Follow the below-given steps to display the result view: a. Click the dragged data source component on the workspace. b. Click the ‘Result’ tab.

www.bdbizviz.com

Page | 27

• Rules to be Followed while Creating a Data Service 1. Data service header should not have space. It should be a single word or two words concatenated by an underscore (_). 2. Data service header should not contain any special characters. E.g. - %, #, $, @,*, etc. 3. Data service header should not contain single or double quotes, dot, brackets, and high-fen. 4. Data service header should not contain merely numbers. Numerals should be used with at least one alphabet. 5. Data service header should not exceed 50 characters. Note: d. Users can develop a data service via the Data Management module of the BizViz Platform. e. ‘Fields’ option under ‘Properties’ tab will appear only after selecting the appropriate query service. f. LOV service provided under ‘Conditions’ tab can contain only one column, in case of more than one column, a warning message will appear. g. Users can configure the following information for a data service data source via ‘General’ tab: i. Alias Name ii. Description (it is an optional field) 5.3. i)

Acquiring Data from Cassandra Reader Select and drag ‘Cassandra Reader’ connector onto the workspace.

www.bdbizviz.com

Page | 28

ii) Click on the ‘Cassandra Reader’ connector.

iii) Users will be redirected to the ‘Properties’ tab. iv) Configure the required properties: a. Select Data Connector: Select a data connector using the drop-down menu b. Host Name: Data connector specific hostname will be displayed c. Port Number: Port number will be displayed d. User Name: User name will be displayed e. Password: Enter the password f. Cluster Name: Enter a cluster name g. Select Key Space: Select a key space from the drop-down menu h. Select Table: Select a table from the drop-down menu i. Limit by Row: Select an option using the drop-down menu. Two options will be provided as shown below: a. Select all Rows b. Limit By j. Max. no. of Rows to be fetched: Enter a number to decide maximum fetched rows. (This option will appear only if ‘Limit By’ option has been selected using the ‘Limit by Row’ field. Default value for this field is 1000). v) Click ‘Next’.

www.bdbizviz.com

Page | 29

vi) Users will be redirected to the ‘Column Selection’ tab. vii) Select the required columns from the list. viii) Click ‘Apply’.

ix) Click ‘Run’. x) Users will be redirected to the ‘Console’ tab.

www.bdbizviz.com

Page | 30

xi) Follow the below-given steps to display the result view: a. Click the dragged data source component on the workspace. b. Click the ‘Result’ tab.

Note: The Apache Spark Sworkflows require a ‘Cassandra Reader’ as a data source. The Cassandra Reader can be also used as a data source for the R Workflows. 5.4.

Removing a Data Source from the Workspace

i) Right click on the Data Source connector (on the workspace). ii) A context menu will appear. iii) Click ‘Delete’.

www.bdbizviz.com

Page | 31

iv) The selected Data Source connector will be removed from the workspace. OR Click on the ‘Reset’ option to remove the connector(s) from the workspace. Note: The same set of steps can be followed to remove a Data Service and Cassandra Reader data sourced from the workspace.

6. Data Preparation

Components provided under ‘Data Preparation’ help in preparing the raw data from the data source and make it suitable for analysis. They organize data in order to gain accurate result out of it.

6.1. Data Type Definition The Data Type Definition option can be used to change the name, data type of the data source column. This component helps users to prepare data and make it suitable for further analysis. i) Navigate to the Predictive home page. ii) Click ‘Data Preparation’ tree-node. iii) A context menu will open.

www.bdbizviz.com

Page | 32

iv) Drag ‘Data Type Definition’ component and connect it with a configured data source onto the workspace. v) Click the ‘Data Type Definition’ component (on the workspace).

vi) Users will be redirected to the ‘Properties’ tab. vii) Configure the following ‘Data Type Mapping’ details: a. Column Name: Select a column name which you want to change b.Alias Name: Enter an alias name for the required source column c. Primary Data Type: Select a primary data type column that you want to change d.Date Format: Select a date format that you want to display (Date format is optional for date Data Type) e. ‘Add’ option : Click on this button to add one more row of the ‘Data Type Mapping’ fields viii) Click ‘Apply’.

www.bdbizviz.com

Page | 33

ix) Click ‘Run’. x) Users will be directed to the ‘Console’ tab.

xi) Follow the below-given steps to display the result view: c. Click the dragged data preparation component on the workspace. d. Click the ‘Result’ tab.

6.2. Filter This option is used to filter the data by column or row. i)

Select and Drag ‘Filter’ component onto the workspace.

www.bdbizviz.com

Page | 34

ii) Connect the ‘Filter’ component to a configured datasource component. iii) Click the ‘Filter’ component.

iv) Configure the following information: Column Filter a. Select a column from the ‘Selected Columns’ context menu. b. Click ‘Apply’ to configure the data.

Result View (Column Filter): i) Click ‘Run’ ii) Users will be redirected to the ‘Console’ tab.

iii) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. iv) The filtered data will be displayed via the ‘Result’ tab.

www.bdbizviz.com

Page | 35

Row Filter i) Drag and connect the ‘Filter’ component onto the workspace. ii) Connect the ‘Filter’ component to a configured data source. iii) Click the ‘Filter’ component. iv) The ‘Column Filter’ tab will be displayed (by default). v) Select a column using the context menu.

vi) Select ‘Row Filter’ tab from the ‘Component’ menu list. vii) Configure the required fields: a. Double click on the components from Columns, Functions, and Operators list menus b. A formula will be entered in the given box c. Click ‘Apply’.

www.bdbizviz.com

Page | 36

Result View (Row Filter): i) Click ‘Run’. ii) Users will be redirected to the ‘Console’ tab.

iii) Follow the below-given steps to display the result view: a. Click the dragged data preparation component on the workspace. b. Click the ‘Result’ tab. iv) The filtered data will be displayed via the ‘Result’ tab.

www.bdbizviz.com

Page | 37

Note: a. The expression should retain Boolean output. b. Users can not use Data manipulation functions. 6.3. Missing Value Replacement Users can replace the missing data in the specified variable with the specified value. Users will be provided with a list of options that can be considered for replacement. i) Drag a data source on the workspace and connect with the ‘Missing Value Replacement’ component.

ii) Configure the data source, run it, and check the ‘Result’ view.

www.bdbizviz.com

Page | 38

iii) Select and drag ‘Missing Value Replacement’ component onto the workspace. iv) Connect the ‘Missing Value Replacement’ component to a configured data source. v) Click on the ‘Missing Value Replacement’ component.

vi) Choose the replacement value by configuring the following fields: a. Column Name: Select a column using the drop-down that contains some missing values. b. Replacement Options: Select a replacement option using the dropdown menu. The following replacement options are provided under this field: 1. 2. 3. 4. 5. 6. 7. 8.

www.bdbizviz.com

Mean Median Mode Maximum Minimum Remove Entire Row Remove Entire Column Custom Replacement

Page | 39

vii) Click ‘Apply’.

viii) Click ‘Run’. ix) Users will be redirected to the ‘Console’ tab.

x) Follow the below-given steps to display the result view: a. Click the dragged data preparation component on the workspace. b. Click the ‘Result’ tab.

www.bdbizviz.com

Page | 40

6.4.

Formula Users can create a calculated column using ‘Formula’. A formula can be created by using available columns, functions, and operators. i) Select and drag ‘Formula’ component onto the workspace. ii) Connect the ‘Formula’ component to a configured data source. iii) Click on the ‘Formula’ component.

iv) Configure the required component fields to apply a formula: a. ‘Columns’, ‘Functions’, and ‘Operators': Double click on these lists will enter a formula in the given box. b. Formula Name: Enter a formula name in the given field. c. Click ‘Apply’ to configure the formula.

v) Click ‘Run’. vi) Users will be redirected to the ‘Console’ tab.

www.bdbizviz.com

Page | 41

vii) Follow the below-given steps to display the result view: a. Click the dragged data preparation component on the workspace. b. Click the ‘Result’ tab.

6.5.

Normalization This component controls the relevant data. It attempts to convert the available data from larger range to smaller range.

•

Normalization Methods Normalization contains 3 methods to normalize the vast amounts of data: 1. Min-Max Normalization It implements a linear transformation on the original data values and sets a new range for all the data values to fit in. The user can fix New Maximum and New Minimum Value for the data from the new range. Consequently, each value “v” from the original interval will be mapped into value “new_v” following the below-given formula:

www.bdbizviz.com

Page | 42

2. Zero-Score This normalization also is known as ‘Zero Mean Normalization’ is calculated on the ‘mean’ and ‘standard deviation’ for each attribute. It determines whether a specific value is above or below average. It also signifies the exact proportion of the variance from the fixed limit of aver3age. After applying ‘Zero-Score’ normalization each feature will have a mean value of zero (0). The unit of each value will be the number of (estimated) standard deviations away from the (estimated) mean. Zero score normalization may be sensitive to small values of ‘s X’. A new value ‘new_v’ can be found by using the following expression:

3. Decimal-Scaling The decimal point of the value of each element is moved in accord with its maximum absolute value. A modified value ‘new_v’ can be obtained using the following formula:

Note: In the decimal-scaling expression ‘c’ is the smallest integer so that max(new_v) < 1.

•

Applying Normalization 1. Min-Max i) Select and drag ‘Normalization’ component onto the Workspace. ii) Connect the ‘Normalization’ component to a configured data source. iii) Click the ‘Normalization’ component.

iv) Configure the following component fields: Properties

www.bdbizviz.com

Page | 43

v)

a. Column Selection i. Select a Column: Select a column using drop-down menu (Only the numerical column will be selected) b. Behavior i. Normalization Type: Select ‘Min-Max’ normalization type from the drop- down menu ii. New Maximum Value: Set a new maximum value (Default value for this field is 1) iii. New Minimum Value: Set a new minimum value (Default value for New Minimum field is 0) Click ‘Apply’.

vi) Click ‘Run’. vii) Users will be directed to the ‘Console’ tab.

viii) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab.

www.bdbizviz.com

Page | 44

2. Zero Score i) Select and drag ‘Normalization’ component onto the Workspace. ii) Connect the ‘Normalization’ component to a configured data source. iii) Click the ‘Normalization’ Component. iv) Configure the required component fields:

v)

www.bdbizviz.com

Properties a. Column Selection i. Select a Column: Select a column using drop-down menu (Only the numerical column will be selected). b. Behavior i. Normalization Type: Select ‘Zero-Score’ normalization type from the drop-down menu. Click ‘Apply’ to configure the fields.

Page | 45

vi) Click ‘Run’ vii) Users will be directed to the ‘Console’ tab.

ix) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab.

3. Decimal Scaling i) Select and drag ‘Normalization’ component onto the Workspace. ii) Connect the ‘Normalization’ component to a configured data source. iii) Click the ‘Normalization’ Component. iv) Configure the required component fields: Properties a. Column Selection i. Select a Column: Select a column using drop-down menu (Only the numerical column will be selected). b. Behavior

www.bdbizviz.com

Page | 46

v)

i. Normalization Type: Select ‘Decimal Scaling’ normalization type from the drop-down menu. Click ‘Apply’ to configure the fields:

vi) Click ‘Run’. vii) Users will be directed to the ‘Console’ tab.

x)

Follow the below-given steps to display the result view: c. Click the dragged data preparation component on the workspace. d. Click the ‘Result’ tab.

Note:

www.bdbizviz.com

Page | 47

1. Normalization displays columns containing only numerical data. 2. ‘New Maximum Value’ must be greater than ‘New Minimum Value. 6.6.

Sample This component can be used to select a subsection of data from a large dataset. The following sample types are supported by the Sample component:

•

Sampling Methods 1. First N: It will select first N records from the data source. E.g. If the selected value for “N” is 10, then it will select first 10 records from the data. 2. Last N: It will select last N records from the data source. E.g. If the selected value for “N” is 5, then it will select last 5 records from the data. 3. Every Nth: It will select every Nth record from the data source, wherein “N” indicates an interval. E.g. If N=3, then 3rd, 6th, and 9th records will be selected from the data. 4. Simple Random: It will select records randomly as per the value of “N” or percentage mentioned for “N” from the data source. E.g. If the selected value for “N” is 4 then, it will select randomly any 4 records from the data source. If the selected value for “N” is 4% then, it will select 4% records from the data source. 5. Systematic Random: It will select data based on the bucket size. E.g. If the selected value for the bucket is 2 then, it will select 1st, 3rd, 5th records or 2nd, 4th, 6threcords from the data source.

•

Applying Sampling i) ii) iii)

Select and drag ‘Sample’ component onto the workspace. Connect the ‘Sample’ component to a configured data source. Click the ‘Sample’ component.

www.bdbizviz.com

Page | 48

iv)

Configure the required component fields: Properties a. Sampling Information i. Sampling Type: Select an option from the drop-down menu ii. Limit Rows by Select an option from the drop-down menu. This the field will offer two options as described below: 1. Numbers of Rows: By selecting this option, it will display a new field ‘Number of Rows’. 2. Percentage of Rows: By selecting this option, it will display new field ‘Percentage of Rows’. b. Sample Size Limit i. Maximum Rows: The maximum number of rows that can be viewed in the ‘Result’ tab (It is an optional field). v) Click ‘Apply’. vi) Click ‘Run’. vii) Users will be redirected to the ‘Console’ tab.

viii) While accessing the ‘Result’ tab, Users will be displayed a result view based on the selected Sampling Type. • Check out the following properties tab(s) and result list view(s) for various Sampling options: 1. First N (Where ‘N’ is 1 number of row)

www.bdbizviz.com

Page | 49

2. Last N (‘N’ is 5% and maximum rows are 6 )

www.bdbizviz.com

Page | 50

3. Every Nth (Interval is 3 and maximum rows are 7)

4. Simple Random (Number of Rows selected are 3). Randomly selected any 3 rows will be displayed.

www.bdbizviz.com

Page | 51

5. Systematic Random (Bucket Size is 3).

www.bdbizviz.com

Page | 52

6.7.

R Split Data

The R Split Data component is used to split a dataset into training and testing per percentage and method. Once the most suitable model is determined from the trained data, users can pass test data to validate the model.

data

R Split Data appears as a leaf node under the Data Preparation Tree node. The R Split Data consists of two connector nodes: Upper node for the training set and lower node for the testing data set.

i)

Select the ‘R Split Data’ component and connect it with a valid data source (in this case, select Cassandra reader). ii) Click the ‘R Split Data’ component on the workspace. iii) Users will be directed to the Properties fields provided under the ‘Components’ tab iv) Configure the following Properties: a. Relative (Train): Enter value to decide ratio of train data out of the dataset (Type: Decimal, Range: 0-1 and sum of train and test should be 1). b. Relative (Test): Enter value to decide ratio of train data out of the

www.bdbizviz.com

Page | 53

v)

dataset (Type: Decimal, Range: 0-1 and sum of train and test should be 1). Click ‘Apply’.

vi) Click ‘Run’. vii) Users will be directed to the ‘Console’ tab.

viii) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. The Result tab will contain two data sets separated by a sub-tab. As shown in the below-given images:

www.bdbizviz.com

Page | 54

a. Select the ‘Split 1’ tab to see one set of data (the training dataset).

b. Select the ‘Split 2’ tab to see another set of data (the testing data set).

Note: Current document covers steps to deal with a CSV File data set for all the R Data Preparation components. The similar steps can be followed for a Data Service data set.

www.bdbizviz.com

Page | 55

6.8.

Spark Split Data

The Spark Split Data component is used to split a dataset into training and testing data sets. Once the most suitable model is determined from the trained data, users can pass test data to that model. Spark Split Data appears as a leaf node under the Data Preparation Tree node. data

The Spark Split Data consists of two connector nodes: Upper node for the training set and lower node for the testing data set.

i)

Select the ‘Spark Split Data’ component and connect it to a valid data source (in this case, select Cassandra reader). ii) Click the ‘Spark Split Data’ component on the workspace. iii) Users will be directed to the Properties fields provided under the ‘Components’ tab iv) Configure the following Properties: a. Relative (Train): Enter value to decide ratio of train data out of the dataset (Type: Decimal, Range: 0-1 and sum of train and test should be 1). b. Relative (Test): Enter value to decide ratio of train data out of the dataset (Type: Decimal, Range: 0-1 and sum of train and test should be 1). c. Seeds: Enter a numerical value. Default Value: 10. It is an optional field. Set the seed of Spark’s random number generator, which is useful for creating simulations or random objects that can be

www.bdbizviz.com

Page | 56

reproduced. The random numbers are the same, and they would continue to be the same irrespective of how far in the sequence the users go. Use the seed function when running simulations to ensure all results, figures, etc. are reproducible. v)

Click ‘Apply’.

vi) Click ‘Run’. vii) A message will pop-up to confirm whether users want to enable logging. viii) Click ‘No’.

ix) Users will be directed to the ‘Console’ tab.

www.bdbizviz.com

Page | 57

ix) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. x) The Result tab will contain two datasets separated by a sub-tab. As shown in the below-given images: a. Select the ‘Split 1’ tab to see one set of data (the training data set).

www.bdbizviz.com

Page | 58

b. Select the ‘Split 2’ tab to see another set of data (the testing data set).

Note: • •

6.9.

Users need to click the Spark component and then click the ‘Result’ tab to display result view for any Spark Component. Only Cassandra reader is supported as a data source.

Spark Filter

The Spark Filter has been added as a leaf node to the Data Preparation tree-node. Users can provide a filter condition appended by “@” to filter out data. Users should make sure that the given condition will return only true or false. i) Drag and configure the data source (in this case, select Cassandra reader). ii) Click ‘Run’ and check ‘Result’ for the data source.

www.bdbizviz.com

Page | 59

iii) Drag the ‘Spark Filter’ component onto the workspace. iv) Connect it with the configured data source.

v) Right click on the Spark Filter component. vi) Provide condition for the ‘Row Filter’. vii) Click ‘Next’.

viii) Users will be directed to configure a condition for the ‘Column Filter’. ix) Click ‘Apply’ after configuration.

www.bdbizviz.com

Page | 60

x) Click ‘Run’. xi) A message will pop-up to confirm whether users want to enable logging xii) Click ‘No’.

xiii) Users will be directed to the ‘Console’ tab.

xiv) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. xv) The filtered result data will be displayed.

www.bdbizviz.com

Page | 61

6.10. Spark Data Type Definition This component can be used to type cast data into another form. Users can change the data type of a column, or change the alias name of the column using this component. Spark Data Type definition will appear as a leaf node under the Data Preparation tree node. i) Select the ‘Spark Data Type Definition’ component and connect it with a valid data source (in this case, select Cassandra Reader as the data source).

www.bdbizviz.com

Page | 62

ii) Configure the Properties fields for the Spark Data Type Definition component. iii) Configure the following ‘Data Type Transformation’ details: a. Column Name: Select a column name which you want to change b. Alias Name: Enter an alias name for the required source column c. Primary Data Type: Select a primary data type column that you want to change. d. ‘Add’ option transformed. iv) Click ‘Apply’.

: Click on this button to add more columns to be

v) Click ‘Run’. vi) A message will pop-up to confirm whether users want to enable logging. vii) Click ‘No’.

www.bdbizviz.com

Page | 63

viii) Users will be directed to the ‘Console’ tab.

ix)

Follow the below-given steps to display the result view: a. Click the data preparation component on the workspace. b. Click the ‘Result’ tab.

Note: a. Users cannot typecast the advanced column types (E.g. map, list, UDT), UUID, and timestamp. b. Only Integer, Double, and String data types are supported by the Spark Data Type Definition.

www.bdbizviz.com

Page | 64

7. Data Transformation The Data Transformation components are pipeline components. Users need to connect an Apply Model component with these components to complete a workflow and get the results.

7.1.

String Indexer String Indexer converts a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0. If the input column is numeric, we will cast it to string and index the string values. Users must set the input column of the component to this string-indexed column name when pipeline components such as Estimator or Transformer make use of this string-indexed label. Users can set the input column with setInputCol. i) Users need to select the String Indexer component and connect it with a configured data source.

ii) Configure the required component fields for the String Indexer. a. The Properties tab for Spark Indexer contains an option to select ‘Label Column’ from previous component headers on which a new column was created. b. Users can rename the created label column using the ‘Label Column Name’.

c. The String Indexer, when applied on one dataset, will handle unseen labels using either of the methods provided under the ‘Advanced’ tab: d. Users are provided with two options in the ‘Advanced’ tab to handle the unseen labels.

www.bdbizviz.com

Page | 65

i. Error: The unseen labels will be thrown as an exception. (by default) ii. Skip: The rows containing the unseen labels will be skipped. iii) Click ‘Apply’.

iv) Click ‘Run’. v) A message will pop-up to confirm whether users want to enable logging. vi) Click ‘No’.

vii) Users will be directed to the ‘Console’ tab.

www.bdbizviz.com

Page | 66

7.2.

Spark R Formula

The Spark R Formula can be used to produce a vector column of features and a double a column of labels. i)

Users need to select the Spark R Formula component and connect it with a configured data source. ii) Select the Spark R Formula and configure the following fields under the component tab: a. Column Selection: Select the desired Features and Labels from the column headers provided under the Properties tab. b. Enable Formula: Enable this option to get a formula. (By selecting this option ‘Apply’ button will change into ‘Next’.) c. New Column Information: Provide names for the newly created Feature and Label columns. iii) Click ‘Next’.

www.bdbizviz.com

Page | 67

iv) Users will be directed to the next page to enter a formula. v) Enter a formula in the given box by double clicks on the available values. vi) Click ‘Apply’. vii) Click ‘Run’. viii) A message will pop-up to confirm whether users want to enable logging. ix) Click ‘No’.

x) Users will be directed to the ‘Console’ tab.

www.bdbizviz.com

Page | 68

Note:

a. Spark R Formula can also connect to the Data Preparation component with the prefix ‘Spark’ such as the Spark Split Data and Spark Data Type Definition. b. Users can change the column name by changing the New Column Information values. c. Since the Spark R Formula is a pipeline component, the results can be viewed only after running the R Formula with an ‘Apply Model’ or another pipeline component. d. The ‘Data Preparation’ components cannot be added in between pipeline components in a workflow. e. End of the pipeline component should be an ‘Apply Model’ component. f. A model can be saved from the context menu of an ‘Apply Model’ component.

7.3.

Spark PCA

PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (PCs). A PCA class trains a model to project vectors to a low-dimensional space using PCA. i)

Users need to select the Spark PCA component and connect it with a configured data source.

www.bdbizviz.com

Page | 69

ii) Configure the following component fields for the Spark PCA. a. Input Column i. Features: Select the required features from the drop-down menu. ii. K Value: Enter the number of principal components. b. Output Column i. Predicted Column Name: Enter column header for the predicted column. iii) Click ‘Apply’.

iv) Click ‘Run’. v) A message will pop-up to confirm whether users want to enable logging. vi) Click ‘No’.

www.bdbizviz.com

Page | 70

vii)

7.4.

Users will be directed to the ‘Console’ tab.

Spark Chi Square

In probability theory and statistics, the chi-squared distribution (also chi-square or χ2-distribution) with K degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables. It is a special case of the gamma distribution and is one of the most widely used probability distributions in inferential statistics. E. g. in hypothesis testing or in the construction of confidence intervals. When it is being distinguished from the more general noncentral chi-squared distribution, this distribution is sometimes called the central chi-squared distribution. i)

Users need to select the Spark Chi Square component and connect it with a configured data source.

ii) Configure the following component fields for the Spark Chi Square. a. Input Column i. Features: Select the required features from the drop-down menu. ii. K Value: Enter the number of principal components. b. Output Column i. Predicted Column Name: Enter column header for the predicted column. iii) Click ‘Apply’.

www.bdbizviz.com

Page | 71

iv) Click ‘Run’. v) A message will pop-up to confirm whether users want to enable logging. vi) Click ‘No’.

vii) Users will be directed to the ‘Console’ tab.

7.5.

Spark Index to String

The Spark Index to String component can be used to convert index label column into String column so that it can be applied to certain algorithms that require

www.bdbizviz.com

Page | 72

index column as the Label Column. This component consists of an option to select label column from previous component headers. After selecting a label column user can change column header of the newly Stringed column which will be called ‘Label’ by default. i) Users need to select and drag a configured data source on the workspace. ii) Connect the Spark String Indexer component with the data source and configure it. (Ref. section 7.1)

iii) Connect the Spark Index to String component with the Spark String Indexer component on the workspace.

iv) Configure the following component fields for the Spark Index to String:

www.bdbizviz.com

Page | 73

v)

a. Column Selection i. Label Column: Select a column using the drop-down menu. Make sure that you select the same column that was selected while configuring the String Indexer component (In this case, it is ‘PetalLength’). b. New Column Information: i. Label Column Name: By default, the column name appears as ‘Labels’ user can change the column heard/name using this field. ii. Labels: Click ‘Apply’.

vi) Click ‘Run’. vii) A message will pop-up to confirm whether users want to enable logging. viii) Click ‘No’.

ix) Users will be directed to the ‘Console’ tab.

www.bdbizviz.com

Page | 74

Note: Users need to first connect the data source with the ‘String Indexer’ component and then the combination can be connected to the ‘Index to String’ component.

7.6.

Spark SQL Transformer

Spark SQL Transformer implements the transformations which are defined by an SQL statement. Currently, we only support SQL syntax like "SELECT ... FROM __THIS__ ..." where "__THIS__" represents the underlying table of the input data set. The select clause specifies the fields, constants, and expressions to display in the output. Any select clause supported by Spark SQL can be used. Users can also use Spark SQL built-in function and UDFs. i)

Select the Spark SQL Transformer component and connect it with a configured data source.

ii) Configure the required component fields for the Spark SQL Transformer. a. SQL Statement: Provide an SQL statement. b. Fields: All the available fields under the selected data source will be listed. iii) Click ‘Apply’.

www.bdbizviz.com

Page | 75

iv) Click ‘Run’. v) A message will pop-up to confirm whether users want to enable logging. vi) Click ‘No’.

vii) Users will be directed to the ‘Console’ tab.

www.bdbizviz.com

Page | 76

7.7.

Spark Group By

Spark Group By is a transformation operation. Users can apply ‘Spark Group By’ transformation on data frame of the last node output, so the columns on top of aggregation done are also getting added to the output with the alias name.

i)

Select the Spark Group By component and connect it with a configured data source.

ii) Configure the required component fields for the Spark SQL Transformer. a. Aggregation Columns i. Column Name: Select a Column from the drop-down menu. ii. Alias Name: Enter an alias name for the selected column. iii. Aggregation Type: Select an aggregation type from the drop-down menu iv. Click ‘Add’ icon to add a new series to configure aggregation column. b. Select the required column from the ‘Group By Columns’ and move it to the ‘Selected Columns’. c. Use ‘Up’ and ‘Down’ to change the order of the selected columns. iii) Click ‘Apply’.

www.bdbizviz.com

Page | 77

iv) Click ‘Run’. v) A message will pop-up to confirm whether users want to enable logging. vi) Click ‘No’.

vii) Users will be directed to the ‘Console’ tab.

Note: a. The Data Transformation components can be connected to the Data Preparation components with the prefix ‘Spark’. b. A ‘Data Preparation’ component cannot be added in between the ‘Data Transformation’ and ‘Apply Model’ components in a workflow. c. All the ‘Data Transformation’ components are pipeline components. Results can be viewed only after connecting them to an ‘Apply Model’ component. d. End of the pipeline component should be an ‘Apply Model’ component. e. A model can be saved from the context menu of an ‘Apply Model’ component.

8. Algorithms

Algorithms are a statistical set of rules that help the user analyze large quantities of numerical data and extract appropriate information out of it. BizViz Predictive Analysis allows the user to apply more than one algorithm to manage the vast amount of data. •

Applying an Algorithm to a Data Source:

www.bdbizviz.com

Page | 78

i) ii) sub iii) iv) v)

Click the ‘Algorithms’ tree-node on the Predictive Analysis home page. Click the Algorithm Category tree-node

to display the available algorithm

categories. Select and drag an algorithm component onto the workspace. Connect the algorithm component to a configured data source. Click on the algorithm component.

vi) Configure the following ‘Components’ fields for the dragged algorithm component. vii) Click ‘Apply’ to save the information.

viii) Click ‘Run’. ix) Users will be directed to the ‘Console’ tab.

www.bdbizviz.com

Page | 79

x) Click the algorithm component on the workspace and click the ‘Result’ tab. xi) The resulting view will be displayed.

xii) Click the ‘Visualization’ tab to see a graphical representation of the result data.

www.bdbizviz.com

Page | 80

xiii) Click ‘Delete’ or ‘Reset’ option to remove the selected algorithm component from the workspace.

Note: a. Users can follow the above-mentioned steps to configure all the available R- algorithms. b. Users can configure alias name for the algorithm component via the ‘General’ tab. c. Basic configuration for all the algorithms is done through the ‘Properties’ tab. Users are required to manually configure this tab while applying an algorithm component. d. Users can avail all the default values under ‘Advanced’ tab. Users can manually set the ‘Advanced’ tab, only if the advanced level configuration is required. e. After execution, users can click on the respective component to get data. Pipeline component will not have any result set, the only summary will be available. Users need to connect the pipeline components with an ‘Apply Model’ component and test data set to view the result.

8.1.

Clustering

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

8.1.1. R-K Means K- means clustering is one of the most commonly used clustering methods.It clusters data points into a predefined number of clusters. It

www.bdbizviz.com

Page | 81

first clusters observations into ‘K’ groups, wherein ‘K’ is an input parameter. The algorithm then assigns each observation to a cluster based on the proximity of the observation. Applying R-K Means to a Data Source Users will be redirected to the ‘Component’ tabs when applying the ‘R-K Means’ algorithm component to a configured data source. i)

Drag the R-K Means to the Workspace and connect it to a configured Data Source.

ii) The Component tabs will be displayed on the Viewspace. iii) Configure the following fields in the ‘Properties’ tab: a. Output Information i. Number of Clusters: Enter number of groups for clustering. The the default value for this field is 5. Range should be between 1 and A total number of clusters. b. Column Selection i. Feature: Select the input columns with which you want to perform the Analysis. c. New Column Information i. Cluster Name: Enter a name for the new column displaying cluster number.

• Rules for Naming a New Column i) Do not use space in the name of a new column. It should be in a single word or two words should be connected by an underscore (_). E.g. SampleData or Sample_Data.

www.bdbizviz.com

Page | 82

ii) Do not use any special symbol alone or with any character as the name of a new column. Eg. %, #, $, @,* or Sample# are not acceptable. iii) Do not use single or double quotes, dot, and brackets to name a new column. iv) Do not use numbers alone to name a new column. Numbers can be used with at least one character of the alphabet and the name should not begin with a numeral. v) Name given to a new column should not exceed 50 characters. Note: Click the information icon provided next to the ‘New Column Information’ tab. A list of rules for naming a new column will be displayed. iv) Click the ‘Advanced’ tab. a. Configure the required ‘Behavior’ fields: i. Maximum Iterations: Enter the number of iterations allowed for discovering clusters. (The default value for this field is 100). ii. Number of Initial Centroids: Enter the number of random initial centroid sets for clustering (The default value for this field is 1). iii. Algorithm type: Select an algorithm type from the drop-down menu iv. Initial Cluster Center Seed: Enter a number indicating initial cluster center seed (The default value for this field is 10).

v) Click ‘Apply’. vi) Click ‘Run’.

www.bdbizviz.com

Page | 83

vii) Users will be redirected to the ‘Console’ tab. viii) ix) Users will be redirected to the ‘Result’ tab. x) A new column ‘Cluster Number’ will be displayed in the result view.

xi) Click the ‘Visualization’ tab. xii) The result data will be displayed via the Scatter Plot Matrix Chart.

8.1.2.

www.bdbizviz.com

Spark-K- Means

Page | 84

The Spark K-Means algorithm is provided as an option under the clustering algorithm category. The spark.ml implementation includes a parallelized variant of the k-means++ method called k-means||. Applying Spark-K-Means to a Data Source i) Drag the Spark-K-Means to the workspace and connect to a configured data source.

ii) Configure the following fields in the ‘Properties’ tab: a. Output Information i. Number of Clusters: Enter number of groups for clustering. The the default value for this field is 5. Range should be between one and A total number of clusters. b. Column Selections i. Feature: Select the input columns with which you want to perform the Analysis. c. New Column Information i. Cluster Name: Enter a name for the new column displaying cluster number.

www.bdbizviz.com

Page | 85

iii) Select the ‘Advanced’ tab. a. Configure the following ‘Behavior’ fields: i. Maximum Iterations: Enter the number of iterations allowed for discovering clusters (The default value for this field is 20). ii. Initialization Mode: Select any one option at the beginning of the algorithm out of: ‘Random’ or ‘k-means||’ (default). iii. Initialization Steps: Set number for the initialization mode as random (The default value for this field is 5). iv. Convergence Tolerance: Set tolerance level to include clusters (The

the default value for this field is in exponential form. (the default value for this field is 1.0e-4).

v.

www.bdbizviz.com

Initial Cluster Center Seed: Enter a number indicating initial cluster center seed (The default value for this field is 10).

Page | 86

iv) Click ‘Apply’. v) Click ‘Run’ to run the execution. vi) Users will be directed to the ‘Console’ tab. A message will pop-up to confirm whether users want to enable logging. vii) Click ‘No’. viii) Users will be directed to the ‘Console’ tab.

ix) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. x) A new column ‘ClusterNumber’ will be added in the displayed result data.

www.bdbizviz.com

Page | 87

xi) Click the ‘Visualization’ tab. xii) The result data will be displayed via the Scatter Plot Matrix Chart.

Note: Users can click the ‘Summary’ tab to display a summary of the model. E.g. The following image is a sample to demonstrate how summary can be displayed for the SparkK-Means algorithm component.

www.bdbizviz.com

Page | 88

8.2.

Regression Analysis

This algorithm is used to determine how an individual variable influences another variable using an exponential function. It finds a trend in the data set applying univariate regression analysis. There are three sub types provided under ‘Regression Analysis’: 8.2.1. i)

R-Linear Regression Drag the R-Linear Regression component to the workspace and connect it with a configured data source.

ii) Configure the following fields in the ‘Properties’ tab: a. Column Selection i. Dependent Column: Select the target column on which the regression analysis will be applied ii. Independent Column: Select the required input columns against which the regression analysis will be applied to the target column b. New Column Information i. Predicted Column Name: Enter a name for the new column containing the predicted values.

www.bdbizviz.com

Page | 89

iii) Click the ‘Advanced’ tab and configure if required: a. Input Data Handling i. Missing Values: Select a method to deal with missing values from the drop-down menu 1. Ignore: Selecting this option will skip the records containing missing values from the dependent and independent columns. 2. Keep: Selecting this option will retain the records containing missing values while performing the calculation. 3. Stop: Selecting this option will stop application of the algorithm if a value is missing in any column. b. Behavior i. Allow Singular Fit: Select an option for providing value to the Boolean Column 1. True: Selecting this option will ignore aliased coefficients from the coefficient covariance matrix. 2. False: Selecting this option will show an error in a model containing aliased coefficients ii. Contrasts: Selecting this option will display a list of contrast items that can be used for some variables in the model. iii. Confidence Level: Enter a value specifying accuracy (confidence

www.bdbizviz.com

Page | 90

level) of predictions for the algorithm. This field will take 0.95 as the default value.

Note: Model containing aliased coefficients signifies that the square matrix x*x is singular. iv. v. vi.

Click ‘Apply’. Click ‘Run’. Users will be redirected to the ‘Console’ tab.

vii.

Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. viii. A new column ‘Predicted Values1’ will be added to the result data displaying the predicted values.

www.bdbizviz.com

Page | 91

iv) Click the ‘Visualization’ tab. v) The result data will be displayed via the time series chart.

Note: ‘Behavior’ fields provided under ‘Advanced’ section differs as per the algorithm sub-type. ‘Input Data Handling’ remains the same for all the provided Regression types. Hence, only ‘Advanced’ tab is explained below for the remaining sub-algorithms provided under ‘Regression’. 8.2.2. R-Multiple Linear Regression i) Drag the R-Multiple Linear Regression component to the workspace and connect it with a configured data source.

www.bdbizviz.com

Page | 92

ii) Configure the ‘Properties’ tab. iii) Click the ‘Advanced’ tab and configure if required: a. Input Data Handling i. Missing Values: Select a method to deal with missing values (via the drop-down menu). 1. Ignore: Selecting this option will skip the records containing missing values from the dependent and independent columns. 2. Keep: Selecting this option will retain the records containing missing values while performing the calculation. 3. Stop: Selecting this option will stop application of the algorithm if a value is missing in any column. a)

Behavior • Confidence Level: Enter a value specifying accuracy (confidence level) of predictions for the algorithm. This field will take 0.95 as the default value.

iv) Click ‘Apply’. v) Click ‘Run’. vi) Users will be redirected to the ‘Console’ tab.

www.bdbizviz.com

Page | 93

vii) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. viii) A new column ‘PredictedValues1’ will be added to the result data.

ix) Click the ‘Visualization’ tab. x) The result data will be displayed via the time series chart.

8.2.3. i)

www.bdbizviz.com

R-Logistic Regression Drag the R-Logistic Regression component to the workspace and

Page | 94

connect it with a configured data source. ii) Configure the ‘Properties’ tab. iii) Click the ‘Advanced’ tab and configure if required: a. Behavior i. Family: Select an option from the drop down list 1. Binomial 2. Poisson 3. Gaussian 4. Gamma 5. Quasi 6. Quasipoisson 7. Quasibinomial ii. Maximum No. of Iterations: Enter a valid integer value allowed to calculate the algorithm coefficient. The default values for this field is 25.

iv) Click ‘Apply’. v) Click on ‘Run’. vi) Users will be redirected to the ‘Console’ tab.

www.bdbizviz.com

Page | 95

vii) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. viii) A new column containing ‘PredictedValues1’ will be added in the result Data.

ix) Click the ‘Visualization’ tab. x) The result data will be displayed via the scatter plot with regression line chart.

www.bdbizviz.com

Page | 96

8.2.3.1. Spark K-Means Connected to the Pipeline Components i) Connect the Spark algorithm componet with a pipeline component as shown in the following image:

ii) Configure the required component fields and Click ‘Run’. iii) Users will be redirected to the ‘Console’ tab.

www.bdbizviz.com

Page | 97

iv) Follow the below-given steps to display the result view: a. Click the data preparation component on the workspace. b. Click the ‘Result’ tab. v) Click the ‘Visualization’ tab to see the result data via the Scatter Plot Mattrix chart.

Add Visualization

8.3.

Forecasting

Forecasting is the process of making predictions of the future based on the past and present data and analysis of trends. It uses smoothing as a statistical technique to spot trends in a disorderly data. It can also compare trends between two or more variable time series. There are five sub-types provided under ‘Forecasting’:

www.bdbizviz.com

Page | 98

8.3.1. Triple Exponential Smoothing ii) Drag the Triple Exponential Smoothing component to the workspace and connect to a configured data source.

iii) Configure the following fields in the ‘Properties’ tab: a. Output Information i. Output Mode: Select a mode in which you want to display output Data 1. Trend: Selecting this option will display source data along with predicted values for the given data set. A new column ‘Predicted Values’ will be added in the result view when ‘Trend’ output mode has been selected. 2. Forecast: Selecting this option will display forecasted values for the given time period. Results will be appended to the target column when ‘Forecast’ output mode has been selected. ii. Period to Forecast: Enter a period to forecast. This field appears only when the selected ‘Output Mode’ option is ‘Forecast'. iii. Select Output Columns: Select a column that you want to display in output (Select at least one column using a tick mark) b. Column Selection i. Target Variable: Select the target variable for which you want to

www.bdbizviz.com

Page | 99

apply forecasting analysis (First selected option gets selected by default. Only numerical columns are accepted.) c. Input Data Handling i. Period: Select period of forecasting by choosing any one option from the drop-down menu.

ii. Period Per Year: This field appears only when selected ‘Period’ the option is ‘Custom’. iii. Start Period: Enter a value between 1 and the value specified for the selected option for ‘Period’ field iv. Start Year: Enter a year from which you want the data entries to be considered. Enter four digit value for selecting a year (E.g. 2000) d. New Column Information i. Predicted Column Name: Enter a name for the column containing predicted values (This field will be predefined and displayed only if the selected Output Mode is ‘Trend’ ). ii. Year Values: Enter a name for the column containing year value. (This field will be predefined, but users can change the value if needed). iii. Period Values: Enter a name for the column containing period Value (This field will be predefined, but users can change the value if needed).

www.bdbizviz.com

Page | 100

In this case, the selected Period option is ‘Custom’ hence, ‘Period Values’ field is displayed under the ‘New Column Information’.

Note: a. ‘New Column Information’ about the selected periods varies as per the selected ‘Period’ option from the ‘Input Data Handling’. It displays the below-mentioned column names for the Period Value columns based on the selected ‘Period’ option from the ‘Input Data Handling’ section. Selected ‘Period’ option Quarter

www.bdbizviz.com

Displayed Period Value field under ‘New Column Information’ Quarter Values

Page | 101

Month Custom

Month Values Period Values

b. The ‘Period Per Year’ field under the ‘Input Data Handling’ section is displayed only when ‘Custom’ is selected as an option for the ‘Period’ field. iv) Click the ‘Advanced’ tab and configure if required: a. Configure the following ‘Behavior’ fields: i. Alpha: Enter a valid double value in the given field for smoothing observations. (Alpha Range: 0
www.bdbizviz.com

Page | 102

v) Click ‘Apply’. vi) Click ‘Run’. vii) Users will be directed to the ‘Console’ tab.

viii) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. (In this case, the selected output mode is ‘Forecasting’).

www.bdbizviz.com

Page | 103

ix) Click the ‘Visualization’ tab. x) The result data will be displayed via the Time Series Chart.

Note: a. ‘Properties’ and ‘General’ sections remain the same for all the Forecasting sub-algorithms. b. The ‘Advanced’ tab displays different fields as per the Forecasting subtypes. Hence, ‘Advanced’ fields are explained over here. c. Predicted values will be appended to the target column in the result view for all the ‘Forecasting’ algorithms. 8.3.2.

Single Exponential Smoothing i) Drag the Single Exponential Smoothing component to the workspace and

www.bdbizviz.com

Page | 104

connect to a configured data source.

iv)

ii) Configure the ‘Properties’ tab. iii) Click the ‘Advanced’ tab and configure if required. a. Configure the following ‘Behavior’ fields: i. Alpha: Enter a valid double value in the given field for smoothing observations. Alpha Range: 0
v) Click ‘Run’. vi) Users will be directed to the ‘Console’ tab.

www.bdbizviz.com

Page | 105

vii) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. (In this case, the selected output mode is ‘Forecasting’). viii) Predicted values will be appended to the target column in the result data (The selected output mode is ‘Forecasting’).

ix) Click the ‘Visualization’ tab. x) The result data will be displayed via the Time Series Chart.

www.bdbizviz.com

Page | 106

8.3.3. Double Exponential Smoothing i) Drag the Single Exponential Smoothing component to the workspace and connect to a configured data source.

ii) Configure the ‘Properties’ tab. iii) Click the ‘Advanced’ tab and configure if required: a. Configure the following ‘Behavior’ fields: i. Alpha: Enter a valid double value in the given field for smoothing observations. Alpha Range: 0
www.bdbizviz.com

Page | 107

ii. Trend: Enter the initial value for finding trend parameters. (It is an optional field.) iii. Optimizer Inputs: Enter the initial values given for alpha and beta required for the optimizer. (It is an optional field.) iv) Click ‘Apply’.

i) Click ‘Run’. ii) Users will be directed to the ‘Console’ tab.

iii) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. iv) Predicted values will be appended to the target column in the result data. (The selected output mode is ‘Forecasting’).

www.bdbizviz.com

Page | 108

v) Click the ‘Visualization’ tab. vi) The result data will be displayed via the time series chart.

8.3.4. R-Auto ARIMA i) Drag the Single Exponential Smoothing component to the workspace and connect to a configured data source.

www.bdbizviz.com

Page | 109

v) ii) iii) iv)

Configure the ‘Properties’ tab. Click ‘Apply’ to configure the required details. Click ‘Run’. Users will be directed to the ‘Console’ tab.

vii) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. viii) Predicted values will be appended to the target column in the result data. (The selected output mode is ‘Forecasting’).

www.bdbizviz.com

Page | 110

v) Click the ‘Visualization’ tab. vi) The result data will be displayed via the time series chart.

Note: The ‘R-Auto ARIMA’ does not contain the ‘Advanced’ tab. 8.3.5. R- Auto Forecasting i) Drag the Single Exponential Smoothing component to the workspace and connect to a configured data source. ii) Configure the ‘Properties’ tab. iii) Click the ‘Advanced’ tab and configure if required: a. Configure the following ‘Behavior’ fields: i. Seasonal: Select a smoothing algorithm type from the dropdown menu (Holtwinter’s Exponential Smoothing algorithm) ii. No. of Periodic Observation: Enter the number of periodic observations required to start the calculation. The default value for this field is 2. b. Configure the following ‘Initial Values’ fields: i. Level: Enter the initial value for the level. (It is an optional field.) ii. Trend: Enter the initial value for finding trend parameters. (It is an optional field.) iii. Season: Enter initial values for finding seasonal parameters. It will depend on the selected column. It is an optional field.

www.bdbizviz.com

Page | 111

iv. Optimizer Inputs: Enter the initial values given for alpha and beta required for the optimizer. (It is an optional field.)

iv) Click ‘Apply’. v) Click ‘Run’. vi) Users will be redirected to the ‘Console’ tab.

ix) Follow the below-given steps to display the result view: c. Click the dragged algorithm component on the workspace. d. Click the ‘Result’ tab. vii) Predicted values will be appended to the target column in the result data. (The selected output mode is ‘Forecasting’).

www.bdbizviz.com

Page | 112

viii) Click the ‘Visualization’ tab. ix) The result data will be displayed via the time series chart.

8.3.6. Result View of Forecasting Algorithms when the selected output mode is ‘Trend’: A new column ‘Predicted Values’ will be added to the result view when ‘Trend’ is selected as an output mode.

1.

www.bdbizviz.com

Triple Exponential Smoothing i) Select ‘Trend’ option from the ‘Output Mode’ drop-down menu. ii) Fill in the required fields. iii) Click ‘Apply’. iv) Click ‘Run’.

Page | 113

v) Users will be redirected to the ‘Console’ tab.

vi) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. vii) A new column ‘PredictedValues1’ will be added to the result data.

viii) Click the ‘Visualization’ tab. ix) The result data will be displayed via the time series chart.

www.bdbizviz.com

Page | 114

2.

Single Exponential Smoothing i) Select ‘Trend’ option from the ‘Output Mode’ drop-down menu. ii) Fill in the required fields. iii) Click ‘Apply’. iv) Click ‘Run’. v) Users will be redirected to the ‘Console’ tab.

vi) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. vii) A new column ‘predicted values’ will be added to the result data.

www.bdbizviz.com

Page | 115

viii) Click the ‘Visualization’ tab. ix) The result data will be displayed via the time series chart.

3.

www.bdbizviz.com

Double Exponential Smoothing i) Select ‘Trend’ option from the ‘Output Mode’ drop-down menu. ii) Fill in the other required fields. iii) Click ‘Apply’. iv) Click ‘Run’. v) Users will be redirected to the ‘Console’ tab.

Page | 116

vi) Follow the below-given steps to display the result view: c. Click the dragged algorithm component on the workspace. d. Click the ‘Result’ tab. vii) A new column ‘PredictedValues’ will be added to the result data.

viii) Click the ‘Visualization’ tab. ix) The result data will be displayed via the time series chart.

www.bdbizviz.com

Page | 117

4.

R-Auto ARIMA i) Select ‘Trend’ option from the ‘Output Mode’ drop-down menu. ii) Fill in the required fields. iii) Click ‘Apply’. iv) Click ‘Run’. v) Users will be redirected to the ‘Console’ tab.

vi) Follow the below-given steps to display the result view: e. Click the dragged algorithm component on the workspace. f. Click the ‘Result’ tab. vii) A new column ‘PredictedValues’ will be added to the result data.

viii) Click the ‘Visualization’ tab. ix) The result data will be displayed via the time series chart.

www.bdbizviz.com

Page | 118

5.

R-Auto Forecasting i) Select ‘Trend’ option from the ‘Output Mode’ drop-down menu. ii) Fill in the required Component fields. iii) Click ‘Apply’. iv) Click ‘Run’. v) Users will be redirected to the ‘Console’ tab.

vi) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. vii) A new column ‘PredictedValues’ will be added to the result data.

www.bdbizviz.com

Page | 119

viii) Click the ‘Visualization’ tab. ix) The result data will be displayed via the time series chart.

8.4.

Association

This algorithm generates association rules discovering the recurrent patterns in large transactional data sets. It tries to understand future trends of customers based on their previous purchases and assists the vendors to associate items or services together. 8.4.1. i)

www.bdbizviz.com

Market Basket Analysis Drag the Market Basket Analysis component to the workspace and connect it with a configured data source.

Page | 120

ii) Configure the following fields in the ‘Properties’ tab: a. Output Information i. Output Mode: Select a mode of display for output data 1. Selecting ‘Rules’ will display rules for the selected data set 2. Selecting ‘Transaction’ will display the transaction IDs for the selected data set b. Input Data Information i. Input Data Format: Select the format of the input data from the drop-down menu (out of the following choices): 1. Tabular 2. Transactions As per the selected ‘Input Data Format’, the result view will be of 2 types. ii. Item Columns: Select the item columns on which you want to apply association rules/analysis. Choose at least one option from the drop-down menu. This field displays only numerical and string columns. It can not display date columns. iii. Transaction Id Column: Select the column containing Transaction Ids to which you can apply the algorithm. Note: ‘Transaction Id Column’ field appears only when ‘Transactions’ option has been selected from the ‘Input Data Format’ drop-down menu. c. Behavior i. Support: Enter a value for the minimum support of an item. The the default value for this field is 0.1 ii. Confidence: Select a value for the minimum confidence of the association (The default value for this field is 0.8).

www.bdbizviz.com

Page | 121

iii) Click the ‘Advanced’ tab and configure if required: a. Output Appearance i. Lhs Item(s): Enter item tags separated by comma which should the display on the left-hand side of rules or item sets. ii. Rhs Item(s): Enter item tags separated by comma which should the display on the right-hand side of rules or item sets. iii. Both Item(s): Enter item tags separated by comma which should the display on the both sides of rules or item sets. iv. None Item(s): Enter item tags separated by comma which need not display in the rules or item sets. v. Default Appearance: Select default appearance of the items out of the above-given choices using a drop-down menu vi. Min Length: Set minimum length value. Default value for this field is 1. vii. Max Length: Set maximum length value. Default value for this the field is 10. b. Performance i. Sort Type: Select a sort type using the drop-down menu for sorting items based on their frequency.

www.bdbizviz.com

Page | 122

ii. Filter Criteria: Enter an indicating numerical value for filtering unused items from transactions. The default value for this field is 0.1. iii. Use Tree Structure: Selecting ‘True’ option from the dropdown the menu will organize transaction as a prefix tree. iv. Use Heapsort: Selecting ‘True’ option from the drop-down menu will use heap sort against quick sort for sorting transaction. v. Optimize Memory: Selecting ‘True’option from the drop-down the menu will minimize memory usage instead of maximizing speed. vi. Load Transaction into Memory: Selecting ‘True’ from the drop down menu will load transactions into memory.

www.bdbizviz.com

Page | 123

iv) Click ‘Apply’. v) Click ‘Run’. vi) Users will be directed to the ‘Console’ tab.

x)

ii)

Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. Result view will be of 2 types: a. ‘Rules’ will be displayed as a first column in the result data (When the selected ‘Output Mode’ option is ‘Rules’).

b. ‘Transaction_Id’ will be displayed as the second column in the result data (When the selected ‘Output Mode’ option is ‘Transaction’).

www.bdbizviz.com

Page | 124

i. The matching rules for the selected items will be displayed through the ‘Matching_Rules’ column.

vii) Click the ‘Visualization’ tab. viii) The result data will be displayed via the word tag chart. a. Result View for the ‘Rules’ output mode.

b. Result View for the ‘Transaction’ output mode.

www.bdbizviz.com

Page | 125

8.5.

Outliers

This algorithm is used to discover patterns in data set that do not follow the expected behavior. It lists the outlying values based on the statistical distribution between the first and third quartiles. Interquartile Range has been provided as a sub algorithm type. 8.5.1. Interquartile Range i) Drag the Interquartile Range component to the workspace and connect it with a configured data source.

ii) Configure the following fields in the ‘Properties’ tab: a. Output Information i. Output Mode: Select a mode of display for output data. 1. Show Outlier: Selecting this option will add a Boolean column to the input data identifying whether the resultant value is an outlier. 2. Remove Outlier: Selecting this option will remove outlying values from the input data. b. Column Selection i. Feature: Select an input column that can be used to perform the

www.bdbizviz.com

Page | 126

analysis. c. Behavior i. Fence Coefficient: Enter the permissible deviation limit for values from the inter quartile range (The default value for this field is 1.5). d. New Column Information i. New Column Name: Enter a name for the new column containing the predicted values (This column appears only when ‘Show Outliers’ is selected as an Output Mode).

iii) Click the ‘Advanced’ tab and configure if required: a. Input Data Handling i. Missing Values: Select a method to deal with missing values from the drop-down menu. 1. Ignore: Selecting this option will skip the records containing missing values in the columns. 2. Stop: Selecting this option will stop application of the algorithm if a value is missing in any column.

www.bdbizviz.com

Page | 127

iv) Click ‘Apply’. v) Click ‘Run’. vi) Users will be redirected to the ‘Console’ tab.

vii) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. viii) ‘OutliersDetected’ column will be displayed in the result data (If ‘Show Outliers’ option has been selected).

www.bdbizviz.com

Page | 128

ix) Click the ‘Visualization’ tab. x) The result data will be displayed via the box plot chart.

OR Outliers column will not be displayed in the result data (If ‘Remove Outliers’ option has been selected).

Click the ‘Visualization’ to see the result data via the box plot chart.

www.bdbizviz.com

Page | 129

8.6.

Classification This algorithm categorizes a new observation on the basis of a trained set of data that contains observations from the known category. It compares each new observation to previous observations using means of similarity or distance.

There are two subtypes provided under ‘Classification’: 8.6.1. R-CNR Tree The R-CNR Tree can be configured using two algorithm types from the ‘Properties’ tab. Check out the below given description for the configuration details:

§

www.bdbizviz.com

Classification as Algorithm Type i)

Drag the R-CNR Tree component to the workspace and connect it with a configured data source.

ii)

Configure the following fields in the ‘Properties’ tab:

Page | 130

a. Output Information i. Algorithm Type: Select an algorithm type from the drop-down menu. 1. Classification: Select this option if users want to pass dependent column as the categorical values. 2. Regression: Select this option if users want to pass dependent column as numerical values. ii. Show Probability: Select an option from the drop-down menu to create a new column for indicating the chance factor involved in the probability. 1. True: Selecting this option will display a new column in the output data with probability values. 2. False: Selecting this option will not display any probability value in the output data. b. Column Selection i. Features: Select input columns from the drop down list to which the target column can be compared to performing the analysis. ii. Target Variable: Select the target column for which the analysis is performed. c. New Column Information i. Predicted Column Name: Enter a name for the new column containing the predicted values. ii. Probability Column Name: Enter a name for the new column containing the probability values. d. Validation: Enable validation by a check mark in the given box.

www.bdbizviz.com

Page | 131

Note: The ‘Show Probability’ field will appear only if, ‘Classification’ option is selected via the ‘Algorithm Type’ drop-down menu. iii) Click the ‘Advanced’ tab and configure if required: • Advanced Tab when ‘Validation’ is disabled: a. Input Data Handling i. Missing Values: Select a method to deal with missing values from the drop-down list. 1. Rpart: Selecting this option will try to estimate the missing values for the dependent column based on the independent columns. 2. Ignore: Selecting this option will skip the records containing missing values in the columns. 3. Keep: Selecting this option will retain the records containing missing values while performing the calculation. 4. Stop: Selecting this option will stop application of the algorithm if a value is missing in any column. b. Tree Pruning i. Minimum Split: It indicates a minimum number of observations within a single node for a split to be attempted. The default value for this field is 10.

www.bdbizviz.com

Page | 132

ii. Complexity Parameter: This parameter is primarily used to save the computing time by pruning off splits that are not worthwhile. Any split which does not improve the fit by a factor of the complex parameter is purned off performing cross-validation, hence the program will not pursue it. The default value for this field is 0.05. iii. Maximum Depth: It sets the maximum depth of any node of the final tree keeping the depth count for root node 0. It is an optional field ( It is recommended to set Maximum Depth value less than 30 rpart for 32 bit-machines.) c. Behavior i. Split Criteria: It is an optional field that depends on the selected algorithm type from the ‘Properties’. (This field appears only when the selected algorithm type is ‘Classification’). The splitting index can be: 1. Gini: Select this option to measure inequality among values of randomly chosen elements from a set. 2. Information: Select this option to get information about the variables used in the algorithm. ii. Cross Validation: It indicates number of cross validations that were performed to check the accuracy of the analysis method. iii. Prior Probability: It is an optional field. This field is dependent on the prior data values mentioned in the selected dataset. (This field appears only when the selected algorithm type is ‘Classification’). d. Surrogate Information i. Use Surrogate: Select one option from the drop-down menu. 1. Display Only: Selecting this option will only display the observation, but not split it further. 2. Use Surrogate: Selecting this option will search surrogate value for the missing values in order to split the observation. Two fields will be displayed: a. Surrogate Style: Select a style using the drop-down menu. b. Maximum Surrogate: Set the maximum surrogate value.

www.bdbizviz.com

Page | 133

3. Stop if missing: Selecting this option will choose an action based on the nature of majority observations. If values are missed for all the observations, then it will stop splitting further.

• Advanced Tab when ‘Validation’ is enabled: a. Tree Pruning: i.

www.bdbizviz.com

Complexity Parameter: This parameter is primarily used to save the computing time by pruning off splits that are not worthwhile. Any split which does not improve the fit by a factor of the complex parameter is purned off performing cross validation, hence the programme will not pursue it. The default value for this field is 0.05.

Page | 134

iv) Click the ‘Validation’ tab and configure the required fields. a. Model Selection Method: Select a method using the drop-down menu. Users need to configure the other fields based on the model selection method. i. Cross Validation Users need to configure the ‘Number of folds’, if the selected model method is ‘Cross Validation’.

ii. Bootstrap Users need to configure the ‘Number of resamples’ (Default value for this field is 5), if the selected model method is ‘Bootstrap’.

iii. Repeated Cross Validation Users need to configure the ‘Number of repeats’ and ‘Number of folds’, if the selected method is ‘Repeated Cross Validation’.

www.bdbizviz.com

Page | 135

iv. Leave One Out Cross Validation Users will not get any other field to configure, if the selected model method is ‘Leave one out cross validation’.

v) Click ‘Apply’. vi) Click ‘Run’. vii) Users will be redirected to the ‘Console’ tab.

viii) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. i. Result View when ‘Validation’ is disabled.

www.bdbizviz.com

Page | 136

ii.

Result view when ‘Validation’ is enabled.

Note: The Probability column will be displayed in the Array format when Validation is enabled. ix) Click the ‘Visualization’ tab. x) The result data will be displayed via the tree chart.

www.bdbizviz.com

Page | 137

§

Regression as Algorithm Type

i)

Drag the R-CNR Tree component to the workspace and connect it with a configured data source.

ii)

Configure the following fields in the ‘Properties’ tab: a. Output Information i. Algorithm Type: Select an algorithm type from the drop-down menu. 1. Classification: Select this option if users want to pass dependent column as the categorical values. 2. Regression: Select this option if users want to pass dependent column as numerical values. b. Column Selection i. Features: Select input columns from the drop-down list to which the target column can be compared to performing the analysis. ii. Target Variable: Select the target column for which the

www.bdbizviz.com

Page | 138

analysis is performed. c. New Column Information i. Predicted Column Name: Enter a name for the new column containing the predicted values. ii. Probability Column Name: Enter a name for the new column containing the probability values. d. Enable Validation: Enable validation by a check mark in the given box.

iii)

Click the ‘Advanced’ tab and configure if required: • Advanced Tab when ‘Validation’ is disabled: a. Input Data Handling i. Missing Values: Select a method to deal with missing values from the drop-down list. 1. Rpart: Selecting this option will try to estimate the missing values for the dependent column based on the independent columns. 2. Ignore: Selecting this option will skip the records containing missing values in the columns. 3. Keep: Selecting this option will retain the records containing missing values while performing the calculation.

www.bdbizviz.com

Page | 139

4. Stop: Selecting this option will stop application of the algorithm if a value is missing in any column. b. Tree Pruning i. Minimum Split: It indicates a minimum number of observations within a single node for a split to be attempted. The default value for this field is 10. ii. Complexity Parameter: This parameter is primarily used to save the computing time by pruning off splits that are not worthwhile. Any split which does not improve the fit by a factor of the complex parameter is purned off performing cross-validation, hence the program will not pursue it. The default value for this field is 0.05. iii. Maximum Depth: It sets the maximum depth of any node of the final tree keeping the depth count for root node 0. It is an optional field ( It is recommended to set Maximum Depth value less than 30 rpart for 32 bit-machines.) c. Behavior i. Split Criteria: It is an optional field that depends on the selected algorithm type from the ‘Properties’. (This field appears only when the selected algorithm type is ‘Classification’). The splitting index can be: 1. Gini: Select this option to measure inequality among values of randomly chosen elements from a set. 2. Information: Select this option to get information about the variables used in the algorithm. ii. Cross Validation: It indicates number of cross validations that were performed to check the accuracy of the analysis method. iii. Prior Probability: It is an optional field. This field is dependent on the prior data values mentioned in the selected dataset. (This field appears only when the selected algorithm type is ‘Classification’). d. Surrogate Information i. Use Surrogate: Select one option from the drop-down menu. 1. Display Only: Selecting this option will only display the observation, but not split it further. 2. Use Surrogate: Selecting this option will search surrogate value for the missing values in order to split the observation. Two fields will be displayed:

www.bdbizviz.com

Page | 140

a. Surrogate Style: Select a style using the drop-down menu. b. Maximum Surrogate: Set the maximum surrogate value. 3. Stop if missing: Selecting this option will choose an action based on the nature of majority observations. If values are missed for all the observations, then it will stop splitting further.

• Advanced Tab when ‘Validation’ is enabled: b. Tree Pruning: i. Complexity Parameter: This parameter is primarily used to save the computing time by pruning off splits that are not worthwhile. Any split which does not improve the fit by a factor of the complex parameter is purned off performing cross validation, hence the programme will not pursue it. The default value for this field is 0.05.

www.bdbizviz.com

Page | 141

iv) Click the ‘Validation’ tab and configure the required fields. a. Model Selection Method: Select a method using the drop-down menu. Users need to configure the other fields based on the model selection method. i. Cross Validation Users need to configure the ‘Number of folds’, if the selected model method is ‘Cross Validation’.

ii. Bootstrap Users need to configure the ‘Number of resamples’ (Default value for this field is 5), if the selected model method is ‘Bootstrap’.

iii. Repeated Cross Validation Users need to configure the ‘Number of repeats’ and ‘Number of folds’, if the selected method is ‘Repeated Cross Validation’.

www.bdbizviz.com

Page | 142

iv. Leave One Out Cross Validation Users will not get any other field to configure, if the selected model method is ‘Leave one out cross validation’.

v) Click ‘Apply’. vi) Click ‘Run’. vii) Users will be redirected to the ‘Console’ tab.

viii) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. i. Result View when ‘Validation’ is disabled.

www.bdbizviz.com

Page | 143

ii.

Result view when ‘Validation’ is enabled.

Note: The Probability column will be displayed in the Array format when Validation is enabled. ix) Click the ‘Visualization’ tab. x) The result data will be displayed via the tree chart.

www.bdbizviz.com

Page | 144

8.6.2. i)

R-Naive Bayes Drag the R-Naive Bayes component to the workspace and connect it with a configured data source.

ii) Configure the following fields in the ‘Properties’ tab: a. Column Selection i. Feature: Select input columns from the drop-down menu to which the target variable can be compared for performing the analysis. ii. Target Variable: Select the target column for which the analysis is Performed. b. New Column Information i. Predicted Column Name: Enter a name for the new column containing the predicted values. c. Validation: Enable validation by a check mark in the given box.

www.bdbizviz.com

Page | 145

iii) Click the ‘Validation’ tab and configure it. a. Model Selection i. Model Selection Method: Select a modeling method using the drop-down menu. 1. Cross Validation 2. Boot Strap 3. Repeated Cross Validation 4. Leave One Out Cross Validation ii. A number of folds: Enter a numerical value for the number of folds.

iv) Click the ‘Advanced’ tab and configure if required. • Advanced Tab when ‘Validation’ is disabled: a. Input Data Handling

www.bdbizviz.com

Page | 146

i.

ii.

Missing Values: Select a method to deal with missing values from the drop-down menu. 1. Ignore: Selecting this option will skip the records containing missing values in the columns. 2. Keep: Selecting this option will retain the records containing missing values while performing the calculation. Laplace Smoothing: Enter the smoothing constant for smoothing observations. Smoothing constant must be a double value greater than 0. Entering 0 will disable Laplace smoothing.

• Advanced Tab when ‘Validation’ is enabled: a. Input Data Handling i. Laplace Smoothing: Enter the smoothing constant for smoothing observations. Smoothing constant must be a double value greater than 0. Entering 0 will disable Laplace smoothing. ii. Kernel: Select an option using the drop-down menu. 1. True 2. False iii.

www.bdbizviz.com

Band Width: Enter a band width value (Default value for this field is 0.1).

Page | 147

v) Click ‘Apply’. vi) Click ‘Run’. vii) Users will be redirected to the ‘Console’ tab.

viii) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab.

www.bdbizviz.com

Page | 148

Note: a. The ‘Visualization’ tab does not display any graphical representation for the R Naive Bayes results in data. b. The ‘Validation’ tab provides multiple options under the ‘Model Selection Method’ drop-down menu. All the Model Selection Methods are described below: i. Cross Validation Users need to configure the ‘Number of folds’, if the selected model method is ‘Cross Validation’.

ii. Bootstrap Users need to configure the ‘Number of resamples’ (Default value for this field is 5), if the selected model method is ‘Bootstrap’.

www.bdbizviz.com

Page | 149

iii. Repeated Cross Validation Users need to configure the ‘Number of repeats’ and ‘Number of folds’, if the selected method is ‘Repeated Cross Validation’.

iv. Leave One Out Cross Validation Users will not get any other field to configure, if the selected model method is ‘Leave one out cross validation’.

8.6.3.

Spark-Naive Bayes The Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features. This algorithm can be trained to be very efficient. The user can set a threshold for each class. The algorithm will then classify values as per the set thresholds. Spark Naive Bayes consists of two types of model selection methods: 1. Multinomial- If the data set is numerical 2. Bernoulli- If the data set contains 0 and 1 i)

www.bdbizviz.com

Drag the R-Linear Regression component to the workspace and connect it with a configured data source.

Page | 150

ii)

Connect and configure the Spark Apply Model component to the combination of a data source and Spark Naive Bayes component (to display the results).

iii)

Configure the following fields in the ‘Properties’ tab: a. Feature: Select column(s) from the drop-down menu b. Label: Select column(s) from the drop-down menu c. Enable Validation: Put a check mark in the box to enable the validation (It is an optional field). Click ‘Next’ (By enabling ‘Validation’ the ‘Apply’ option changes into ‘Next’).

iv)

v)

www.bdbizviz.com

Users will be redirected to the Validation tab. There are two types of validation methods: a. Train Validation – Train validation begins by splitting a data set into two parts, as training and testing data sets as per the training

Page | 151

ratio. It also iterates through paramMapS. For each combination of parameters, the algorithm will iterate over it and select based on the evaluation metric. b. Cross Validation – Cross validation begins by splitting the data set into a set of folds which are used as a separate training and test data sets. e.g., with k=3 folds, Cross Validator will generate 3 (training, test) data set pairs, each of which uses 2/3 of the data for training and 1/3 for testing. It also iterates through paramMapS. The algorithm will iterate over each combination of parameters and folds to determine the best model using an average of the k folds. vi) Configure the following ‘Validation’ information: a. Model Selection Method: Select any one validation method using the drop-down menu: i. Train Validation ii. Cross Validation b. Evaluator: Select any one option using the drop-down menu to define evaluator. Evaluator consist of two types: i. Multi Class Classification – If the data set has multiple classes in label column ii. Binary Class Classification- if the data set has two classes in label column c. Train Ratio: This field will be displayed if train validation has been selected by using the ‘Model Selection Method’ field.

OR If ‘Cross Validation’ is enabled, users will be provided with a field ‘Number of folds’ from the input data to be taken as training data for the cross validation. (Spark Naive Bayes supports only string data when cross validation is selected)

www.bdbizviz.com

Page | 152

vii) Configure the following ‘Advanced’ information: a. Model Type: Select an option from the drop-down list Spark Naive Bayes consists of two types of model selection methods: i. Multinomial- If the data set is numerical ii. Bernoulli- If the data set contains 0 and 1 b. Thresholds: Enter multiple values separated by comma. Number of values entered as threshold should be same as that of many classes in labels. Sum of values must be equal to 1. Enter at least two commas separated values in this field. c. Parameter Grid: Enter a valid double value between 0 and 1 (1 included). Users can enter single or comma separated valid double value. viii) Click ‘Apply’.

Note: If validation is enabled, users can enter multiple comma separated values in the Parameter Grid in the Advanced tab and they will be taken as paraMapS.

www.bdbizviz.com

Page | 153

ix) Click ‘Run’. x) A message will pop-up to confirm whether users want to enable logging. xi) Click ‘No’.

xii) Users will be directed to the ‘Console’ tab.

xiii) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab.

www.bdbizviz.com

Page | 154

Note: Users can click the ‘Summary’ tab to view the model summary after connecting to a Spark Apply Model component. The Summary will be displayed only if the ‘Apply Model’ component contains summary to show.

8.6.4.

www.bdbizviz.com

Spark Decision Tree Decision Trees and their ensembles are popular methods for the machine learning tasks such as Classification and Regression. Decision trees are widely used since they are easy to interpret and do not require feature scaling. They can handle categorical features and extend to the multiclass classification setting. The Decision tree is an acquisitive algorithm that performs a recursive binary partitioning of the feature space and capture non-linearities and feature interactions. The tree predicts the same label for each bottom most (leaf) partition. Each partition is chosen avidly by selecting the best split from a set of possible splits, to maximize the information gain at a tree node.

Page | 155

BizViz Predictive Analysis provides Spark Decision Tree under the Classification algorithm in the tree-node menu. i)

Drag the Spark Decision Tree component to the workspace and connect to a configured data source to create a basic workflow.

ii)

Connect the Spark Decision Tree basic workflow with a configured ‘Spark Apply Model’ component to get the result view.

iii)

Configure the following fields in the ‘Properties’ tab (for algorithm component) a. Column Selection i. Feature: Select column(s) from the drop-down menu. ii. Label: Select column(s) from the drop-down menu. iii. Algorithm Type: Select an algorithm type from the drop-down menu. 1. Classification: Select this option if users want to pass dependent column as the categorical values (Default opiton). 2. Regression: Select this option if users want to pass dependent column as numerical values. iv. Seeds: Enter a numerical value to randomise the data. v. Enable Validation: Put a check mark in the box to enable the

www.bdbizviz.com

Page | 156

iv)

validation (It is an optional field). Click ‘Next’ (The ‘Apply’ option turns into ‘Next’, if ‘Validation’ has been enabled).

Based on the selected ‘Algorithm Type’ the ‘Advanced’ tab fields will be changed. Please check the following detials: §

Classification as Algorithm Type i)

www.bdbizviz.com

Users need to configure the following information (if ‘Validation’ is enabled): a. ‘Validation’ tab (If validation is enabled) i. Model Selection Method: Select any one validation method using the drop-down menu: 1. Train Validation: By selecting this method, the ‘Train Ratio’ field will be displayed to configure. 2. Cross Validation: By selecting this method, the ‘Number of folds’ field will be displayed to configure. ii. Evaluator: Select any one option using the drop-down menu to define evaluator. Evaluator consist of two types: 1. Multi Class Classification – If the dataset has multiple classes in label column 2. Binary Class Classification- if the data set has

Page | 157

two classes in label Column 3. Regression Class Classification-if the ‘Label’ column is continuous. iii. Train Ratio: This field will be displayed if train validation has been selected by using the ‘Model Selection Method’ field. iv. Click ‘Next’ (The ‘Apply’ option turns into ‘Next’ when ‘Validation’ is enabled).

b. Configure the required ‘Advanced’ information: i. Column Selection 1. Maximum Depth: Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (Type integer only. Default value 5.) 2. Maximum Bins: Maximum number of bins for discretizing continuous features. (The value must be >=2 and >=number of categories for any categorical feature. (Type integer only. Default value 32.) 3. Minimum Instances Per Node: Minimum number of instances each child must have after the split. If a split causes the left or right child to have fewer than Min. Instances Per Node, the split will be discarded as invalid (The value should be >=1). (Type integer only. Default value 1.) 4. Minimum Info Gain: Enter min. info. Gain for a

www.bdbizviz.com

Page | 158

split to be considered at a tree-node (Type double only. Default value 0.0). 5. Thresholds: Thresholds in multiclass classification to adjust the probability of predicting each class. The array must have a length equal to the number of classes, with values >=0. This class with the largest value p/t is predicted, where ‘p’ is the optional probability of that class and ‘t’ is the class’ threshold. (Type: Comma separated double value. Thresholds will be displayed only in case of the Classification algorithm type.) 6. Impurity: Select an option from the drop-down menu. The ‘impurity’ field is a measure of the homogeneity of the labels at the node. The current implementation of the algorithm provides two impurity measures for classification. a. Gini b. Entropy ii. Click ‘Apply’.

ii) iii) iv)

www.bdbizviz.com

Click ‘Run’. A message will pop-up to confirm whether users want to enable logging. Click ‘No’.

Page | 159

§ i)

www.bdbizviz.com

v)

Users will be directed to the ‘Console’ tab.

vi)

Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab.

Regression as Algorithm Type If the selected algorithm type is ‘Regression’.

Page | 160

ii) Users need to configure the following information: a. ‘Validation’ tab (If validation is enabled) i. Model Selection Method: Select any one validation method using the drop-down menu: 1. Train Validation: By selecting this method, the ‘Train Ratio’ field will be displayed to configure. 2. Cross Validation: By selecting this method, the ‘Number of folds’ field will be displayed to configure. ii. Evaluator: Select any one option using the drop-down menu to define evaluator. Evaluator consist of two types: 1. Multi Class Classification – If the dataset has multiple classes in label column 2. Binary Class Classification- if the data set has two classes in label column 3. Regression Class Classification- if the ‘Label’ column is continuous. iii. Train Ratio: This field will be displayed if train validation has been selected by using the ‘Model Selection Method’ field. iv. Click ‘Next’.

www.bdbizviz.com

Page | 161

b. Configure the required ‘Advanced’ information: iii. Column Selection 1. Maximum Depth: Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (Type integer only. Default value 5.) 2. Maximum Bins: Maximum number of bins for discretizing continuous features. (The value must be >=2 and >=number of categories for any categorical feature. (Type integer only.Default value 32.) 3. Minimum Instances Per Node: Minimum number of instances each child must have after the split. If a split causes the left or right child to have fewer than Min. Instances Per Node, the split will be discarded as invalid (The value should be >=1). (Type integer only. Default value 1.) 4. Minimum Info Gain: Enter min. info. Gain for a split to be considered at a tree-node (Type double only. Default value 0.0). iv. Click ‘Apply’.

iii) Click ‘Run’. iv) A message will pop-up to confirm whether users want to enable

www.bdbizviz.com

Page | 162

v)

logging. Click ‘No’.

vi) Users will be directed to the ‘Console’ tab.

vii) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab.

www.bdbizviz.com

Page | 163

Note: a. Users can click the ‘Summary’ tab to view the model summary after connecting to a Spark Apply Model component. The Summary will be displayed only if the ‘Apply Model’ component contains summary to show.

8.6.5.

www.bdbizviz.com

Spark Random Forest Random forest is a top performer tree ensemble algorithm for classification and regression tasks. The algorithm builds multiple decision trees based on different subsets of the features in the data. Outcomes are then predicted by running observations through all the trees and averaging the individual predictions. i)

Drag the Spark Random Forest component to the workspace and connect to a configured data source.

ii)

Connect the Spark Random Forest basic workflow with a configured ‘Spark Apply Model’ component to get the result view.

Page | 164

iii)

iv)

Configure the following fields in the ‘Properties’ tab: a. Column Selection i. Feature: Select feature columns from the drop-down menu. ii. Label: Select a binary column as a label from the drop down menu. iii. Algorithm Type: Select an algorithm type from the drop-down menu. 1. Classification: Select this option if users want to pass dependent column as the categorical values (Default opiton) 2. Regression: Select this option if users want to pass dependent column as numerical values. iv. Seeds: Enter numerical value to randomize data (Only integer value). v. Enable Validation: Enable validation by check marking the box. Click ‘Next’.

Based on the selected ‘Algorithm Type’ the configuration fields are described below:

www.bdbizviz.com

Page | 165

§

Classification as Algorithm Type Configure the following information: i) Validation Tab (if ‘Validation’ is enable). a. Model Selection Method: Select any one validation method using the drop-down menu: i. Train Validation: By selecting this method, the ‘Train Ratio’ field will be displayed to configure. ii. Cross Validation: By selecting this method, the ‘Number of folds’ field will be displayed to configure.

ii)

iii)

b. Evaluator: Select any one option using the drop-down menu to define evaluator. Evaluator consist of two types: i. Multi Class Classification – If the data set has multiple classes in label column ii. Binary Class Classification- If the data set has two classes in label Column iii. Regression Class Classification-If the Label column is continuous. c. Train Ratio: This field will be displayed if ‘Train Validation’ has been selected by using the ‘Model Selection Method’ field. Click ‘Next’.

Configure the required ‘Advanced’ information: a. The ‘Advanced’ tab when ‘Validation’ is enabled. i. Column Selection 1. Feature Subset Strategy: Select an option from the

www.bdbizviz.com

Page | 166

2.

3.

4.

5.

6.

drop-down menu. The number of features to consider for splits at each tree-node (Supported options: auto, all, n, one-third, sqrt, log2). Maximum Depth: Maximum depth of the tree. (>= 0) E.g. depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (Type integer only. Default value 5.) Maximum Bins: Maximum number of bins for discretizing continuous features. (The value must be >=2 and >=number of categories for any categorical feature. (Type integer only.Default value 32.) Minimum Instances Per Node: Minimum number of instances each child must have after the split. If a split causes the left or right child to have fewer than Min. Instances Per Node, the split will be discarded as invalid (The value should be >=1). (Type integer only. Default value 1.) Minimum Info Gain: Enter min. info. Gain for a split to be considered at a tree-node. (Type double only. Default value 0.0) Number of Trees: Enter number of trees to train (>=1).

7. Thresholds: Thresholds in multiclass classification to adjust the probability of predicting each class. The array must have a length equal to the number of classes, with values >=0. This class with the largest value p/t is predicted, where ‘p’ is the optional probability of that class and ‘t’ is the class’ threshold. (Type: Comma separate double value. Thresholds will be displayed only in case of the Classification algorithm type.)

8. Impurity: Select an option from the drop-down menu. The ‘impurity’ field is a measure of the homogeneity of the labels at the node. The current implementation of the algorithm provides two impurity measures for classification. a. Gini b. Entropy 9. Sub Sampling Rate: Set sub sampling rate (Default value is 1). iv) Click ‘Apply’.

www.bdbizviz.com

Page | 167

The ‘Advanced’ tab when ‘Validation’ is enabled.

v) Click ‘Run’. vi) A message will pop-up to confirm whether users want to enable logging. vii) Click ‘No’.

viii) Users will be directed to the ‘Console’ tab.

www.bdbizviz.com

Page | 168

ix) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab.

§

Regression as Algorithm Type i)

www.bdbizviz.com

Configure the following ‘Validation’ information: a. Model Selection Method: Select any one validation method using the drop-down menu: i. Train Validation ii. Cross Validation b. Evaluator: Select any one option using the drop-down menu to define evaluator. Evaluator consist of two types: i. Multi Class Classification – If the data set has multiple classes in label column ii. Binary Class Classification- If the data set has two classes in label Column

Page | 169

ii)

iii. Regression Class Classification-If the ‘Label’ column is continuous. c. Train Ratio: This field will be displayed if train validation has been selected by using the ‘Model Selection Method’ field. Click ‘Next’.

iii)

Configure the required ‘Advanced’ information: a. Column Selection

iv)

The number of features to consider for splits at each tree node. Supported options: auto, all, one-third, sqrt, log2.

i. Feature Subset Strategy: Select an option from the drop -down menu. The number of features to consider for splits at each tree-node (Supported options: auto, all, n, one-third, sqrt, log2). ii. Maximum Depth: Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (Type integer only. Default value 5.) iii. Maximum Bins: Maximum number of bins for discretizing continuous features. (The value must be >=2 and >=number of categories for any categorical feature. (Type integer only.Default value 32.) iv. Minimum Instances Per Node: Minimum number of instances each child must have after the split. If a split causes the left or right child to have fewer than Min. Instances Per Node, the split will be discarded as invalid (The value should be >=1). (Type integer only. Default value 1.)

www.bdbizviz.com

Page | 170

v. Minimum Info Gain: Enter min. info. Gain for a split to be considered at a tree-node. (Type double only. Default value 0.0) vi. Number of Trees: Enter number of trees to train (>=1). vii. Thresholds: Thresholds in multiclass classification to adjust the the probability of predicting each class. The array must have a length equal to the number of classes, with values >=0. This class with the largest value p/t is predicted, where ‘p’ is the optional probability of that class and ‘t’ is the class’ threshold. (Type: Comma separate double value. Thresholds will be displayed only in case of the Classification algorithm type.)

v)

viii. Impurity: Select an option from the drop-down menu. The ‘impurity’ field is a measure of the homogeneity of the labels at the node. The current implementation of the algorithm provides two impurity measures for classification. 1. Gini 2. Entropy ix. Sub Sampling Rate: Set sub sampling rate (Default value is 1). Click ‘Apply’.

vi) Click ‘Run’. vii) A message will pop-up to confirm whether users want to enable logging. viii) Click ‘No’.

www.bdbizviz.com

Page | 171

ix) Users will be directed to the ‘Console’ tab.

x)

www.bdbizviz.com

Follow the below-given steps to display the result view: c. Click the dragged algorithm component on the workspace. d. Click the ‘Result’ tab.

Page | 172

Note: Users can click the ‘Summary’ tab to view the model summary after connecting to a Spark Apply Model component. The Summary will be displayed only if the ‘Apply Model’ component contains summary to show.

8.7.

Correlation

The Correlation algorithm provides a method for clustering a set of objects into the optimal number of clusters without specifying the number in advance.

8.7.1.

www.bdbizviz.com

R- Correlation i) Drag the R-Correlation component to the workspace and connect to a configured data source. ii) Configure the following fields in the ‘Properties’ tab: a. Input Columns: Select any two columns using the drop-down menu b. Method: Select a method using the drop-down menu. The available methods are:

Page | 173

i. Pearson ii. Kendall iii. Spearman c. Missing Value Method: Select the required option using the drop-down menu.The available methods to apply on the Missing Value are: i. Everything ii. All.obs iii. Complete.obs iv. Na.or. complete v. Pairwise.complete.obs iii) Click ‘Apply’.

iv) Click ‘Run’. v) Users will be redirected to the ‘Console’ tab.

vi) Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. vii) Columns displaying ‘Eruption’ and ‘Waiting’ probable values will be added in the result data.

www.bdbizviz.com

Page | 174

viii) Click the ‘Visualization’ tab. ix) The probable values of the selected columns will be displayed via the correlogram chart.

8.8.

Recommendation Engine

The Recommendation Engine algorithm helps to build a prediction model. The algorithm will consider the known user-item association as training data. The Training data is then used to predict about the unknown set of data at Test data.

8.8.1. Spark ALS The Spark ALS (Alternating Least Squares) can be used to do basic recommendation. This feature uses the collaborative filtering techniques by filling in the missing entries of a user-item association matrix. Spark currently supports model-based collaborative filtering, in which users and

www.bdbizviz.com

Page | 175

products are described by a small set of latent factors that can be used to predict missing entries. Users can use this component as in spark pipeline and predict what people might like and to uncover relationships between items to aid in the discovery process. i)

Drag the Spark ALS component to the workspace and connect to a configured data source and other required pipeline components as shown below:

ii) Configure the following fields in the ‘Properties’ tab: a. Column Selection i. User: Select a user column from the drop-down menu. ii. Item: Select an item column from the drop-down menu. iii. Rating: Select a rating column from the drop-down menu. iii) Click ‘Apply’ (If you do not require to configure ‘Advanced’ tab. Else, configure the ‘Advanced’ tab).

iv) Configure the required ‘Advanced’ information: a. Input Data Handling i. Number of Item Block: Items will be partitioned as per the

www.bdbizviz.com

Page | 176

entered the number of item block to parallelize computation (default value is 10). ii. Number of User Block: Users will be partitioned as per the entered number of user block to parallelize computation (default value is 10). iii. Rank: This refers to the number of factors in ALS model, that is the number of hidden features in our low-rank approximation matrices. Generally, the greater the number of factors, the better, but this has a direct impact on memory usage, both for computation and to store models for serving, particularly for a large number of users or items. Hence, this is often a trade-off in real-world use cases. A rank in the range of 10 to 200 is usually reasonable (default value is 10). iv. Max Iteration: This refers to the number of iterations to run. Each iteration in ALS is guaranteed to decrease the reconstruction error of the rating matrix. ALS models will converge to a reasonably good solution after relatively few iterations. Users do not require to run for too many iterations in most cases (Defaul value is 10) v. Reg. Param: This parameter controls regularization and over fitting of the ALS model. The regularization value is dependent on the size, nature, and sparsity of the underlying data. The ‘Reg. Param’should be tuned using the sample test data and cross validation approach. vi. Alpha: Alpha is a parameter applicable to the implicit feedback a variant of ALS that governs the baseline confidence in preference observations (Default value is 1.0). vii. Seed: to replicate the randomization of data

www.bdbizviz.com

Page | 177

v)

viii. Implicit: ImplicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data (Default value is ‘false’ which means to use explicit feedback). ix. Non-Negative: Select ‘Non-Negative’ to use non negative constraints for least squares (Default value is ‘False’). Click ‘Apply’.

vi) Configure all the required components to create a workflow and Click ‘Run’. vii) A message will pop-up to confirm whether users want to enable logging. viii) Click ‘No’.

www.bdbizviz.com

Page | 178

ix) Users will be directed to the ‘Console’ tab.

x)

Follow the below-given steps to display the result view: a. Click the dragged algorithm component on the workspace. b. Click the ‘Result’ tab. xi) A new column entitled ‘prediction5’ will be added in the ‘Result’ view.

www.bdbizviz.com

Page | 179

Note: a. Users need to connect the ALS component with a Spark Apply model to get the result view. b. Users can click the ‘Summary’ tab to view the model summary after connecting to a Spark Apply Model component. The Summary will be displayed only if the ‘Apply Model’ component contains summary to show.

9. Apply Model 9.1.

Spark Apply Model

This component is provided to generate predictions based on Spark trained the classification model. Users can view predicted column value and probability of each label class by using the classification model.

www.bdbizviz.com

Page | 180

Users can create a model via the following ways: • Generate a model using an algorithm • Generate a model using the saved models The Spark Apply Model component consists of 2 input nodes and 1 output node. • Input Nodes o Upper node – Model/Training data o Lower node – Testing data • Output Node o Node – Result data i) Click the ‘Apply Model’ tree-node. ii) The ‘Spark Apply Model’ leaf-node will be displayed.

iii) Drag the Spark Apply Model component onto the workspace and connect it with a valid combination of Data source and algorithm (Configure the data source and algorithm components). iv) Click ‘Spark Apply Model’ component.

v) Basic component details will be displayed. vi) Click ‘Apply’.

www.bdbizviz.com

Page | 181

vii) Click ‘Run’. viii) A message will pop-up to confirm whether users want to enable logging. ix) Click ‘No’.

x)

Users will be redirected to the ‘Console’ tab.

xi) Follow the below-given steps to display the result view: a. Click the dragged Spark Apply Model component on the workspace. b. Click the ‘Result’ tab.

www.bdbizviz.com

Page | 182

xii) Click the ‘Properties’ tab to view the properties details (This Properties tab display workflow properties).

Note: a. The result data set of the model can be written to a data base using the Cassandra Writer. b. Column header and data type of feature column for both saved model and testing data should match. If column headers and data types do not match, an alert message will be displayed. c. It is not mandatory for the testing data set to contain a label column.

9.2.

R Apply Model

This component is provided to generate predictions based on R trained classification model. Users can view predicted column value and probability of each label class by using the classification model. Users can create a model via the following ways: • Generate a model using an algorithm • Generate a model using the saved models

www.bdbizviz.com

Page | 183

The R Apply Model component consists of 2 input nodes and 1 output node. • Input Nodes o Upper node – Model/Training data o Lower node – Testing data • Output Node o Node – Result data i) Click the ‘Apply Model’ tree-node. ii) The ‘R Apply Model’ leaf-node will be displayed.

iii) Drag the R Apply Model component onto the workspace and connect it with a valid combination of Data source and algorithm (Configure the data source and algorithm components). iv) Click ‘R Apply Model’ component.

v) Basic component details will be displayed. vi) Click ‘Apply’.

vii) Click ‘Run’. viii) Users will be redirected to the ‘Console’ tab.

www.bdbizviz.com

Page | 184

ix) Follow the below-given steps to display the result view: a. Click the dragged R Apply Model component on the workspace. b. Click the ‘Result’ tab. x) The columns displaying Predicted values and probability will be added in the result view.

xi) Click the ‘Summary’ tab to view the model summary.

www.bdbizviz.com

Page | 185

Note: a. The result data set of the model can be written to a data base using a Data Writer. b. Column header and data type of feature column for both saved model and testing data should match. If column headers and data types do not match, an alert message will be displayed. c. It is not mandatory for the testing data set to contain a label column.

10.

Performance

Users can evaluate model performance through a list of parameters. The performance component can be attached to classification or regression algorithms.

10.1.

Spark Performance The Spark Performance component is provided as a leaf-node under the Performance tree-node. It contains 3 input nodes that can be used to compare up to 3 models. Each node has a static name like model_0, model_1, and model_2. Based on connection to the node model summary can be viewed with respective names.

www.bdbizviz.com

Page | 186

Spark Performance components can be of the following formats: 1. Binary Classification: Used when the label has two classes 2. Multi Class Classification: Used when the label has 3 or more beta values 3. Regression Evaluator Metrics: In the case of multiple models, all the model statistics will come in the summary of performance (up to 3 models can be compared). Connecting the Spark Performance component to a model: i) Drag the Spark Performance component to the workspace and connect to a valid workflow (In this example, a workflow created with the Spark Decision Tree algorithm has been used).

ii)

Configure the ‘Properties’ tab. a. Performance Type: Select an option out of i. Binary Classification Metrix ii. Multiclass Classification Metrix (Default option) iii. Regression Evaluator Metrix b. Beta Value: Enter a numerical value iii) Click ‘Apply’.

Users will get different out comes based on the selected Performance Types as described below: i.

www.bdbizviz.com

When the selected Performance Type is ‘Multiclass Classification Metrics’

Page | 187

1. Click ‘Apply’. 2. Click ‘Run’. 3. A message will pop-up to confirm whether users want to enable logging. 4. Click ‘No’.

5. Users will be redirected to the ‘Console’ tab.

6. After the console process gets completed, users can click on the ‘Summary’ tab to view Summary of Multiclass Metrics.

www.bdbizviz.com

Page | 188

ii.

When the selected Performance Type is ‘Binaryclass Classification Metrics’.

1. Click ‘Apply’. 2. Click ‘Run’. 3. A message will pop-up to confirm whether users want to enable logging. 4. Click ‘No’.

5. Users will be redirected to the ‘Console’ tab. 6. Users can follow the below-given steps to display the result view, if the selected performance type is Binary: a. Click the dragged performance component on the workspace. b. Click the ‘Result’ tab.

www.bdbizviz.com

Page | 189

7. Click the ‘Visualization’ tab. 8. The resulting view will be presented via the PR Curve or ROC Curve. a. Result data displayed via the PR Curve

b. Result data displayed via the ROC Curve

www.bdbizviz.com

Page | 190

iii.

When the selected Performance Type is ‘Regression Evaluator Metrix’ (The Beta Value tab will not appear for the ‘Regression Evaluator Metrics’ Performance type).

1. Click ‘Apply’. 2. Click ‘Run’. 3. A message will pop-up to confirm whether users want to enable logging. 4. Click ‘No’.

www.bdbizviz.com

Page | 191

5. Users will be redirected to the ‘Console’ tab. 6. View summary by following the steps given below: a. Click the performance component on the workspace b. Click the ‘Summary’ tab.

10.2.

R Performance The R Performance component is provided as a leaf-node under the Performance tree-node. It contains 3 input nodes that can be used to compare up to 3 models. Each node has a static name like model_0, model_1, and model_2. Based on connection to the node model summary can be viewed with respective names. Connecting the Performance component to a model: i) Drag the R Performance component to the workspace and connect to a valid workflow.

ii) Configure the ‘Properties’ tab. a. Performance Type: Select an option using the drop-down menu. i. Binary Classification: To be used when the label has two classes. ii. Multiclass Classification (Default option): To be used when the label has 3 or more beta values.

www.bdbizviz.com

Page | 192

iii) Click ‘Apply’.

a. When the selected Performance Type is ‘Multiclass Classification Metrics’.

1. Click ‘Apply’. 2. Click ‘Run’. 3. Users will be redirected to the ‘Console’ tab.

www.bdbizviz.com

Page | 193

4. Users can view summary by clicking the ‘Summary’ tab (First click the performance component and then click on the ‘Summary’ tab). The following details will be displayed under the ‘Summary’ tab: a. Confusion Metrix and Statistics i. Displays Confusion Matrix of each model ii. Column consists of Actual labels and row consist of Predicted labels. b. Overall Statistics i. Overall statistics of each model can be viewed in a tabular format. ii. Each model will be rows and following statistics will be columns 1. Accuracy 2. 95% CI 3. No Information Rate 4. P – value 5. Kappa 6. Mcnemar's Test P-Value c. Statistics by Class i. Label wise the following statistics can be shown: 1. Sensitivity 2. Specificity 3. Pos Pred Value 4. Neg Pred Value 5. Prevalence 6. Detection Rate 7. Detection Prevalence 8. Balanced Accuracy

www.bdbizviz.com

Page | 194

b. When the selected Performance Type is ‘Binary Classification Metrics’.

1. Click ‘Apply’. 2. Click ‘Run’. 3. Users will be redirected to the ‘Console’ tab.

www.bdbizviz.com

Page | 195

4. Click the ‘Visualization’ tab to see the graphical representation of the result data.

www.bdbizviz.com

Page | 196

Note: a. In case of the multiple models, all the model statistics will be displayed in the summary tab of performance component (up to 3 models can be compared). b. No data will be displayed under the ‘Result’ tab for R-Performance (Binary Classification).

11.

Data Writer(s)

Data Writers are provided to store the results of the predictive analysis in flat files or databases for further in-depth analysis. 11.1. File Writer Users can write output data to flat files like CSV, TEXT, and DAT files using the File Writer. 11.1.1.

CSV Writer

i) Click ‘TreeNode’ provided next to the ‘Data Writer’ option. ii) Select ‘File Writer’ option. iii) Select and drag ‘CSV Writer’ component to the workspace.

iv) v) vi) vii)

Connect the ‘CSV Writer’ to a configured data source. Click on CSV Writer component to access component properties. Enter ‘File Name’ in the displayed field. Click ‘Apply’.

viii) Click ‘Run’. ix) A pop-up message will appear with a link to download the CSV file.

www.bdbizviz.com

Page | 197

x) 11.1.2. i) ii) iii)

iv) v) vi) vii)

Click the link to download the CSV file. JSON Writer Click on ‘TreeNode’ provided next to the ‘Data Writer’ option. Select ‘File Writer’ option. Select and drag ‘JsonWriter’component to the workspace.

Connect the ‘JsonWriter’ to a configured data source. Click on ‘JsonWriter’ component to access component properties. Enter ‘File Name’ in the displayed field. Click ‘Apply’.

viii) Click on ‘Run’ or ‘Run Till Here’ option. ix) A Pop-up message will appear with a link to download the ‘Json’ file.

www.bdbizviz.com

Page | 198

x) 11.2.

Click the link to download the JSON file.

Database Writer

11.2.1. Internal Data Writer This data writer will store the data into databases like MySQL, MSSQL, and Oracle. i) Click ‘TreeNode’ provided next to the ‘Data Writer’ option. ii) Select ‘Database Writer’ option. iii) Select and drag ‘Internal Data Writer’ component to the workspace.

iv) Drag and Connect the ‘Internal Data Writer’ component to a configured data onto the workspace. source. v) Click ‘Internal Data Writer’ component to access the Component Properties Users will have different properties fields based on the selected table the choice as described below: a. Selecting the ‘Create a New Table’ as Table Operation: i. Data Connector Name: All the available data connectors in particular user id will be listed. Select a data connector from the drop-down menu.

www.bdbizviz.com

Page | 199

ii. Type: This field will be preselected based on the selected data Connector. iii. Number of Rows in a batch: Enter a number to limit the entries of rows for one batch iv. Database Name: Select a database name from the drop-down menu v. Password: Enter the database password vi. Table Name: Select ‘Create New Table’ option from the list vii. Create New Table: It is an optional field. It appears only when the the user selects ‘Create New Table’ option from the ‘Table Name’ drop- down menu. viii. Column Selected from model: Select columns that are needed to be written into the selected data base.

b. Selecting an Existing Table as Table Operation: i. Data Connector Name: Select a data connector from the dropdown menu ii. Type: Displays a type based on the selected data connector

www.bdbizviz.com

Page | 200

iii. Number of Rows in a batch: Enter a number to limit the entries of rows for one batch iv. Database Name: Select a database name from the drop-down menu v. Password: Enter the database password vi. Table Name: Select an existing table name from the dropdown menu vii. Table Operation: Select an option using the drop-down menu. The following are the provided choices: 1. Append Table 2. Overwrite Table viii. Column Selected from model: Select columns that are needed to be written into the selected data base. ix. Details of the Selected table: Displays column headers from the selected table.

vi) Click ‘Apply’. vii) Click ‘Run’. viii) Users will be directed to the ‘Console’ tab. ix) The data will be saved in the selected database.

www.bdbizviz.com

Page | 201

11.2.1.1. Delta Load in Internal Data Writer (for MySQL connector) The internal data writer can extract only new or changed records while loading data from the MySQL data base. The Schema View has been added to the internal database writer to extract data using delta data load type.

option.

i)

Click ‘TreeNode’ provided next to the ‘Data Writer’

ii) Select ‘Database Writer’ option. iii) Select and drag ‘Internal Data Writer’ component to the workspace.

iv) Connect the ‘Internal Data Writer’ component to a configured data source. v) Click the ‘Internal Data Writer’ component. vi) Users will be directed to the components tab. vii) Configure the following fields: Properties Users will have different properties fields based on the selected table choice as described below: a. Selecting ‘Create a New Table’ as Table Operation: i. Data Connector Name: All the available data connectors in particular user id will be listed. Select a data connector from the drop-down menu. ii. Type: This field will be preselected based on the selected data Connector. iii. Number of Rows in a batch: Enter a number to limit the entries of rows for one batch iv. Database Name: Select a database name from the drop-down menu.

www.bdbizviz.com

Page | 202

v. vi. vii.

Password: Enter the database password. Table Name: Select ‘Create New Table’ option from the list. Table Operation: Select an option using the drop -down menu. The following choices are provided: 1. Append: Rows can be appended to table 2. Overwrite: Delete the existing information

and write the new data. 3. Upsert: Insert rows to table if they do not exist viii.

or update them if they do. Create New Table: Enter table name using this

field (This field appears only when the user selects ‘Create New Table’ option using the ‘Table Name’ field). ix. Auto Increment: User can enable or disable ‘Auto Increment’ by selecting an option out of ‘Enable’ or ‘Disable’. x. Auto Increment Label: Enter a label for the auto increment column (This field will be displayed only if, the user has enabled ‘Auto Increment’ option). xi. Column Selected from model: Select columns from

xii.

www.bdbizviz.com

the model that is to be written into the selected data base. Click ‘Next’.

Page | 203

Note: The Schema Viewer tab will be displayed only after configuring the ‘Table Name’ field.

field.

www.bdbizviz.com

viii) Users will be directed to the ‘Schema Viewer’ tab. ix) Define Primary keys by using the ‘Select Primary Keys’ x)

Click ‘Apply’.

Page | 204

b. Selecting an Existing Table as Table Operation: i. Data Connector Name: Select a data connector from the drop-down menu ii. Type: Displays a type based on the selected data connector iii. Number of Rows in a batch: Enter a number to limit the entries of rows for one batch iv. Database Name: Select a database name from the v. vi. vii.

viii.

drop-down menu Password: Enter the database password Table Name: Select an existing table name from the drop-down menu Table Operation: Select an option using the drop down menu. The following are the provided choices: 1. Append 2. Overwrite 3. Upsert Column Selected from model: Select columns

that are to be written into the selected data base. ix. Details of the Selected table: Displays column headers from the selected table. xi) Click ‘Next’.

www.bdbizviz.com

Page | 205

xii) Users will be directed to the ‘Schema Viewer’ tab. xiii) The defined/selected primary keys will be displayed. xiv) Click ‘Apply’.

xv) Click ‘Run’. xvi) Users will be directed to the console tab.

www.bdbizviz.com

Page | 206

xvii) Users will be directed to the result tab.

Note: The Result tab appears only when the data source is connected with an algorithm component. 11.2.2. Cassandra Writer Cassandra Writer can be used to store predictive executions. i) Click ‘TreeNode’ provided next to the ‘Data Writer’ option. ii) Select ‘Database Writer’. iii) Select and drag ‘Cassandra Writer’ component to the workspace.

www.bdbizviz.com

Page | 207

iv) Connect the ‘Cassandra Writer’ to a configured data source. v) Click the ‘Cassandra Writer’ component to access it. Properties: a. Selecting ‘Create a New Table’ as Table Operation: i. Select Data Connector: Select a data connector using the dropdown menu ii. Host Name: Based on the selected data connector a host name will be displayed (Users cannot edit this field). iii. Port Name: The server port number will be displayed (Users cannot edit this field). iv. Username: Username of the selected connection appears by default. (Users cannot edit this field). v. Password: the data base password vi. No. of rows in a batch: Enter a number to limit the entries of rows for one batch vii. Select Key Space: Select a key space using the drop-down menu viii. Replication Factor: The replication factor mentioned in the selected ‘Key Space’ will be displayed (Users cannot edit this field) ix. Select Table: Select ‘Create a New Table table from the dropdown menu x. Select Columns: Select the columns that you want to write. xi. Consistency: Select an option from the drop-down menu. xii. New Table: Provide a name for the newly created table. xiii. New time uuid column name: Enter a UUID column name. vi) Click ‘Next’.

www.bdbizviz.com

Page | 208

vii) Users will be redirected to the ‘Key Specification’ tab. viii) Configure the following information: i. Headers: All the columns from the data set will be listed. ii. Partition Key (Name): The Partition Key determines which node stores the data. It is responsible for data distribution across the nodes. • The UUID Column name will be displayed under the ‘Partition Key’ window. • Users can select and move any column from ‘Header’ (Select Column) to ‘Partition Key’ space. • The sequence of the columns listed under Partition Key can be arranged by using ‘Up’ or ‘Down’ options. iii. Clustering Key: The Clustering Key is a storage engine process that sorts data within the partition. It determines per-partition clustering.

www.bdbizviz.com

Page | 209

• The items listed under Clustering Key box can be arranged by using ‘Up’ or ‘Down’ options. • Users can select any column from ‘Headers’(Select Column) to ‘Clustering Key’ space.

ix) Click ‘Apply’ x) Click ‘Run’. xi) A message will pop-up to confirm whether users want to enable logging. xii) Click ‘No’.

xiii) Users will be redirected to the ‘Console’ tab.

www.bdbizviz.com

Page | 210

Note: Users will be provided with some defined consistency level while designing the Key Space which can be overridden based on the selected replica nodes. Users are provided with the following consistency options: § § § §

One Two Three Quarum

b. Selecting an Existing Table as Table Operation: i. Select Data Connector: Select a data connector from the dropdown menu ii. Host Name: Enter database server details (from where the user wants to fetch data) iii. Port Name: The server port number iv. Username: Username of the selected connection appears by default (Users cannot edit this field). v. Password: the data base password vi. No. of rows in a batch: Enter a number to limit the entries of rows for one batch vii. Select Key Space: Select a key space using the drop-down menu

www.bdbizviz.com

Page | 211

viii. Replication Factor: Replication factor in the selected ‘Key Space’ will be displayed (Users cannot edit this field) ix. Select Table: Select a table from the drop-down menu x. Select Columns: Select columns from the drop-down menu that users want to be written in the data writer. xi. Consistency: Select an option using the drop-down menu xii. Settings: Select an option using the drop-down menu. The following choices will be the provided: 4. Append Table 5. Overwrite Table

xiv) Click ‘Apply’. xv) Click ‘Run’.

www.bdbizviz.com

Page | 212

xvi) The list of column headers existing in table will be displayed once users select a table.

Custom R

12.

Script Users can create and add customized algorithm components by using the ‘Custom R-Script’ component. The created scripts will be stored under the ‘Saved Scripts’ option. 12.1.

Creating a New R Script

i) Click ‘Custom R Script’ tree-node on the Predictive Analysis home page. ii) Click ‘Create New Script’. iii) Users will be directed to the ‘Component’ tab. iv) Configure the following fields in the ‘General’ tab: d. Basic i. Component Name: Enter a name or title that you wish to be saved as a saved R script. ii. Component Type: Default Component type will be displayed in this field. iii. Description: Describe the Component (It is an optional field). v) Click ‘Next’.

vi) Users will be directed to the ‘Script’ tab. vii) Provide the following information as required: a. Script Editor i. Paste the R-script in the given space under ‘Script Editor’. ii. Click the ‘Validate’ option. iii. Use ‘Primary Function Details’ to embed the customized R-script into the function. iv. Set the function details as shown below:

www.bdbizviz.com

Page | 213

1. Primary Function Name: Select name of the created function from the drop-down menu. 2. Input Data Frame: Select a dataset (that has been used above) from a drop-down menu. 3. Output Data Frame: Enter an option to which the data will be passed. 4. Model Variable Name: Enter the output model variable (This field will appear only when the model summary has been enabled). v. If you need a visualization chart for the ensuring data, tick the ‘Show Visualization’ checkbox. vi. If you need to show the summary, tick the ‘Show Summary’ checkbox. viii) Click ‘Next’.

ix) Users will be directed to the ‘Settings’ tab. x) Configure the following fields:

www.bdbizviz.com

Page | 214

a. Output Table Definition: This option will configure a number of output columns, column headers, data types. i. Consider all columns from the previous component: To display all columns from the previous component. ii. Consider None: To display no column from the previous component. iii. Data Type: Select a data type for the newly created column using the drop down list. iv. New Predicted Column Name: Enter an appropriate name for the new predicted column. v.

: To remove the added row containing ‘Data Type’ and ‘New Predicted Column Name’.

vi.

: To add a new row containing ‘Data Type’ and ‘New Predicted Column Name’. b. Property View Definition i. Function Parameters: Actual names of parameters configured in the script. ii. Property Display Name: Parameter name to be displayed while configuring saved R script as a component. iii. Control Type: User can select out of the following options: 1. Text box, 2. Drop-down menu, 3. Column Selector (single), 4. Column Selector (multiple). iv. Settings option : To set display for mandatory fields and validate data type for input column. This field is associated to function parameters. xi) Click ‘Apply’.

www.bdbizviz.com

Page | 215

xii) The newly created R Script will be saved in the ‘Saved Scripts’ list.

Guidelines to be followed while Writing R- Script 1. R- script needs to be written inside a valid R function. i.e. The entire code body should be inside the curly braces of the function. 2. The R-script should have at least one main function. Mulitple functions are acceptable and one function can call another function, but it should be written above the calling function body. (If called function is an outer function) or above the calling statement (if called function is an inner function).

www.bdbizviz.com

Page | 216

3.

4. 5.

6.

7.

Any extra packages that are required to run your R script must be installed on the R-server and it should be loaded using library (‘library_name’) statement, before calling the associated function in your script. The R-script should return data in the form of a list only, containing the data frame and model (if used). In the return statement, only a data frame can be assigned to the variable ‘out’. This data frame supports all structures like list, string, vector, matrix, table. If ‘Show Visualization’ field is marked as ‘yes’ during the creation of component, then there should be a plot created in the R-script and if ‘Show Summary’ field is marked as ‘yes’ then the structures list should have the ‘model’ variable. Empty cells, (NULL), (null), NULL, null, /N, NA, N/A are considered as unwanted values and replaced by “NaN” in case of double, long, short, float, byte, integer, and “NA” in case of boolean, string, so instead of using these values in R code use “NaN” or “NA” according to data type of input data.

Note: a. Click the ‘Information’ button to get the above-mentioned list of rules for R-script. b. ‘Model Variable Name’ can be enabled only after selecting ‘Show Summary’ option. c. Select ‘Show Summary’ and ‘Show Visualization’ option only if, the Rscript carries both the items. d. All the supported date data types are listed in date formats in data type definition, all other date formats are considered as string data type. e. Mssql data types are considered as string data type. 12.2.

Saved R-Scripts

12.2.1. Viewing a Saved R Script i) Select an R Script from the list of ‘Saved R-Script’. ii) Right click on the selected R Script. iii) A context menu will open. iv) Select ‘View’. v) Users will be redirected to the ‘Component’ tab.

www.bdbizviz.com

Page | 217

12.2.2. Editing a Saved R Script i) Select an R Script from the list of ‘Saved R-Script’. ii) Right click on the selected R Script. iii) A context menu will open. iv) Select ‘Edit’ v) Users will be redirected to the ‘Component’ tab vi) Users can edit the required fields provided under General, Script, and Settings tabs.

12.2.3. Sharing a Saved R Script This feature gives users the ability to share a custom R script with other users and groups. The following options are available to share a custom R script: 1. Share With: This option allows the user to share a custom R script with selected users or user groups. Any changes made to the custom R script will be transferred to all the users with whom the custom R script has been shared. i) Right, click on a saved R script from the list of ‘Saved Scripts’. ii) Select ‘Share Custom R Script’ from the context menu. iii) The ‘Share With’ option will be displayed (by default). iv) Select either ‘Group’ or ‘Users’. a. By selecting a group all group members inside the group will be listed. Users can be excluded by not selecting them from

www.bdbizviz.com

Page | 218

list

the group. b. Users can be excluded by not selecting a user name from the when ‘User’ option has been selected. v) Select a specific user or group from the list by check marking the box. vi) Click ‘Apply’.

vii) The selected saved R script will be shared with the chosen user(s)/group(s). 2. Copy To: This option creates a copy and shares the copy of the custom R script with the selected users and user groups. Any changes to the original custom R script after sharing will not show up for the users that received the shared file via the ‘Copy To’ option. i) Right, click on a saved R script from the list of ‘Saved Scripts’. ii) Select ‘Share Custom R Script’ from the context menu. iii) Select ‘Copy To’. iv) The copied custom R script name will be displayed in a box. v) Select either the ‘Group’ or ‘Users’ tab. a. By selecting a group all group members inside the group will be listed. Users can be excluded by not selecting them from the group.

www.bdbizviz.com

Page | 219

b. Users can be excluded by not selecting a user name from the list

when ‘User’ option has been selected. vi) Select a specific group or user from the list by check marking the box. vii) Click ‘Apply’.

viii) The copied saved R script will be shared with the selected user(s)/group(s). 12.2.4. Deleting a Saved R Script i) Select an R Script from the list of ‘Saved R-Script’. ii) Right click on the selected R Script. iii) A context menu will open. iv) Select ‘Delete.

v) A pop-up window will appear to assure the deletion. vi) Click ‘Ok’.

www.bdbizviz.com

Page | 220

vii) The selected R-Script will be deleted. 12.2.5. Connecting Saved R Script with a Data Source i) Click the ‘Custom R Script’ tree node. ii) Select and drag a saved R-script to the workspace. iii) Connect the R-Script to a configured data source component.

iv) Click the ‘R Script’ component. v) Configure the required component fields. vi) Click ‘Apply’.

www.bdbizviz.com

Page | 221

vii) Click ‘Run’ or ‘Run Till Here’. viii) The ‘Result’ view will be displayed.

ix) Click the ‘Visualization’ tab x) The result data will be displayed through graphics.

www.bdbizviz.com

Page | 222

Note: The above-given process is displayed for a CSV data source. A similar set of steps can be followed for other data source types.

13.

Custom Scala Script

Users can create and add customized algorithm components using the ‘Custom Scala Script’ component. The created scripts will be stored under the ‘Saved Scripts’ option. The ‘Custom Scala Script’ component will run only on Spark. 13.1. Creating a New Script i) Click ‘Custom Scala Script’ tree-node on the Predictive Analysis home page. ii) Click ‘Create New Script’.

iii) Users will be directed to the ‘Component’ tab. iv) Configure the following fields in the ‘General’ tab: a. Basic i. Component Name: Enter a name or title that you wish to be saved as a saved Scala Script. ii. Component Type: Default Component type will be displayed in this field. iii. Description: Describe the Component (It is an optional field). v) Click ‘Next’.

www.bdbizviz.com

Page | 223

vi) Users will be directed to the ‘Script’ tab. vii) Provide the following information: a. Script Editor i. Write the R-script in the given space under ‘Script Editor’. ii. Click the ‘Validate’ option.

iii. Configure the required ‘Primary Function Details’ to embed the customized Scala script into a function. 1. Primary Function Name: Select name of the created function from the drop-down menu. 2. Input Data Frame: Select a dataset (that has been used above) from a drop-down menu. viii) Click ‘Next’. (Users can click ‘Previous’if wish to open the previous page)

www.bdbizviz.com

Page | 224

ix) Users will be directed to the ‘Settings’ tab. x) Configure the following fields: a. Output Table Definition: This option will configure number of output columns, column headers, data types. Select any one out of the following options: i. Consider all columns from the previous component: To display all columns from the previous component. ii. Consider None: To display no column from the previous component. b. Define Predicted Columns i. New Predicted Column Name: Enter an appropriate name for the new predicted column. ii. ‘New

: To remove the added row containing ‘Data Type’ and Predicted Column Name’.

iii. : To add a new row containing ‘Data Type’ and ‘New Predicted Column Name’.

www.bdbizviz.com

Page | 225

c. Property View Definition i. Function Parameters: Actual names of parameters configured in the script. ii. Property Display Name: Parameter name to be displayed while configuring saved R script as a component. iii. Control Type: User can select out of the following options: 1. Text box, 2. Drop-down menu, 3. Column Selector (single), 4. Column Selector (multiple). iv. Settings option : To set display for mandatory fields and validate the data type for input column. This field is associated to function parameters. xi) Click ‘Apply’.

xii) The newly created Scala Script will be saved in the ‘Saved Scripts’ list.

www.bdbizviz.com

Page | 226

Guidelines to be followed while Writing Scala Script 1. The First argument of the function should be a data frame. 2. The Scala script needs to be written inside a valid Scala function. E.g. the entire code body should be inside the curly braces of the function. 3. The Scala script should have at least one main function. Multiple functions are acceptable and one function can call another function, but it should be written above the calling function body (if the called function is an outer function) or above the calling statement (if the called function is an inner function). 4. All the packages used in function need to import explicitly before writing function. # import org.apache.spark.sql.{Dataset, Row}. 5. The Scala script should return data in the form of a data set only and should define while writing function. 6. The column names should remain same while creating new columns in the Output Table Definition. 7. If users need to define column selector (Multiple), then in definition ': List[String]' should be used and body of the function should be in 'to Array’. 8. If users need to define column selector (Single), then ‘String’ has to be used in the definition.

Note: a. Click the ‘Information’ button write a Scala script.

www.bdbizviz.com

to get the above-mentioned rules to

Page | 227

b. All the supported date data types are listed in date formats in data type definition, all other date formats are considered as string data type. c. Mssql data types are considered as string data type. 13.2. Saved Scala Scripts 13.2.1. Viewing a Saved Scala Script i) Select a Scala Script from the ‘Saved Scripts’ list. ii) Right click on the selected Scala Script. iii) A context menu will open. iv) Select ‘View’. v) Users will be redirected to the ‘Component’ tab.

13.2.2. Editing a Saved Scala Script i) Select a Scala Script from the list of ‘Saved Scripts’ list. ii) Right click on the selected Scala Script. iii) A context menu will open. iv) Select ‘Edit’. v) Users will be redirected to the ‘Component’ tab. vi) Users can edit the required fields provided under General, Script, and Settings tabs.

13.2.3. Sharing a Saved Scala Script This feature gives users the ability to share a custom Scala script with other users and groups.

www.bdbizviz.com

Page | 228

The following options are available to share a custom R script: 1. Share With: This option allows the user to share a custom Scala script with selected users or user groups. Any changes made to the custom Scala script will be transferred to all the users with whom the custom Scala script has been shared.

list

i) Select a Scala script from the list of ‘Saved Scripts’. ii) Right click on the selected Scala script. iii) Select ‘Share’ from the context menu. iv) The ‘Share With’ option will be displayed (by default). v) Select either ‘Group’ or ‘Users’. c. By selecting a group all group members inside the group will be listed. Users can be excluded by not selecting them from the group. d. Users can be excluded by not selecting a user name from the when ‘User’ option has been selected. vi) Select a specific user or group from the list by check marking the box. vii) Click ‘Apply’.

Search tab need to be shown viii) The selected saved Scala script will be shared with the chosen user(s)/group(s). 2. Copy To: This option creates a copy and shares the copy of the custom Scala script with the selected users and user groups. Any changes to the original custom Scala script after sharing will not show up for the users that received the shared file via the ‘Copy To’ option.

www.bdbizviz.com

Page | 229

list

i) Select a Scala script from the list of ‘Saved Scripts’. ii) Right click on the selected Scala script. iii) Select ‘Share’ from the context menu. iv) Select ‘Copy To’. v) The copied custom Scala script name will be displayed in a box. vi) Select either the ‘Group’ or ‘Users’ tab. a. By selecting a group all group members inside the group will be listed. Users can be excluded by not selecting them from the group. b. Users can be excluded by not selecting a user name from the when ‘User’ option has been selected. vii) Select a specific group or user from the list by check marking the box. viii) Click ‘Apply’.

ix) The copied saved Scala script will be shared with the selected user(s)/group(s). 13.2.4. Deleting a Saved Scala Script i) Select a Scala Script from the ‘Saved Scripts’ list. ii) Right click on the selected Scala Script. iii) A context menu will open. iv) Select ‘Delete.

www.bdbizviz.com

Page | 230

v) A pop-up window will appear to assure the deletion. vi) Click ‘Ok’.

vii) The selected Scala Script will be deleted. 13.2.5. Connecting Saved Scala Script with a Data Source i) Click the ‘Custom Scala Script’ tree node. ii) Select and drag a saved Scala script to the workspace. iii) Connect the Scala Script to a configured data source (Here, the used workflow has String Indexer and Spark Apply Model components connected with the Scala script component).

iv) Click the dragged ‘Scala Script’ component.

www.bdbizviz.com

Page | 231

v) Configure the required fields in the ‘Custom Group’ tab. vi) Click ‘Apply’.

vii) Click ‘Run’. viii) A message will pop-up to confirm whether users want to enable logging. ix) Select ‘No’.

x) Users will be redirected to the ‘Console’ tab.

www.bdbizviz.com

Page | 232

xi) Follow the below-given steps to display the result view: a. Click the dragged Spark Apply Model component on the workspace. b. Click the ‘Result’ tab.

14.

Scheduler

Scheduler helps to schedule the Predictive Workflow as per the requirement. 14.1. New Schedule This section explains steps to schedule a new job. Scheduling new job is a continuous step by step process as described below:

www.bdbizviz.com

Page | 233

i) Navigate to the Predictive home page. ii) Click the ‘Scheduler’ tree node. iii) Two options will be displayed: a. New Scheduler b. Status iv) Select ‘New Schedule’.

v)

Users will be redirected to the ‘General’ tab.

14.1.1.

Configuring General Tab

i) A ‘General’ tab will open (by default). ii) Fill in the required information: a. Model Name: Select a model name using the drop-down menu. b. Job Name: Enter a job name. c. Description: Describe the job (optional field). d. Use Existing Data Connector: Use radio buttons to select an option. i. Select ‘Yes’ to use an existing data connector. ii. Select ‘No’ for not using an existing data connector. e. Use Existing Datawriter: Use radio buttons to select an option. i. Select ‘Yes’ to use an existing data writer. ii. Select ‘No’ for not using an existing data writer. iii) Click ‘Next’.

www.bdbizviz.com

Page | 234

iv) Users will be redirected to the ‘Data Source’ tab. 14.1.2. Configuring Data Source Provide the required information to configure a data source: i) ‘General’ fields will be displayed by default. ii) Users can fill in the required fields: a. Component Name: A default name provided for the component. b. Alias Name: User can enter a name for the component. c. Description: Users can describe the component (optional). iii) Click ‘Next’.

www.bdbizviz.com

Page | 235

iv) Users will be redirected to the ‘Properties’ fields. v) Configure the following fields (to configure a new data source): a. Select Data Connector: Select a data connector from the drop-down menu b. Select Data Service: Select a data service from the drop-down menu c. Based on the selected data service the below-given columns will be displayed i. Column Header ii. Data Type vi) Click ‘Next’.

vii) Users will be redirected to the ‘Conditions’ tab. (If conditions are available, else the data source configuration will end at the previous step.) viii) Configure the required ‘Conditions’ fields. ix) Click ‘Next’.

www.bdbizviz.com

Page | 236

x) Users will be redirected to the ‘Mapping’ tab. xi) Configure the column header information from the data service that will be used for the selected model columns. xii) Click ‘Next’.

xiii) Users will be redirected to the ‘Data Writer’ tab. Note: The ‘Data Source’ tab will be enabled, only if users select ‘No’ for ‘Use Existing Data Connector’ option while configuring the ‘General’ tab for a new schedule.

www.bdbizviz.com

Page | 237

14.1.3. Configuring a Data Writer Based on the selected data writer type the configuration fields will be displayed. The ‘Data Writer’ tab provides the following data writer types to complete the scheduling process: 1. Data Writer 2. Elastic Search Writer

Users need to configure the data writer tab in the either of the ways as described below: 1. Data Writer Type- Data Writer i) Fill in the required details to configure a data wr.iter ii) Click ‘Next’.

www.bdbizviz.com

Page | 238

iii) Users will be redirected to the ‘Schedule’ tab. 2. Data Writer Type- Elastic Search Writer i) Users will be directed to create Hierarchy Definition. ii) Drag and drop the required dimensions to define hierarchical drill. iii) Click ‘Next’.

iv)

Users will be redirected to the ‘Schedule’ tab.

Note: The ‘Data Writer’ tab will be enabled, only if users select ‘No’ for ‘Use Existing Data Writer’ while configuring the ‘General’ tab for a new schedule.

www.bdbizviz.com

Page | 239

14.1.4. Scheduling a New job Users can select a time to schedule a new job using this section. As per the selected scheduling time, refresh interval option will be provided. i) Start Date: Select a start date and time for the scheduled job (It should be greater than the Current System Date and Time) ii) Select a Job Refresh Interval option: E.g. When selected time range is ‘Hourly’, the selected interval option can be as described below: Every_hour: Selecting this option will refresh the scheduled job after every selected interval. OR At: Selecting this option will refresh the scheduled job at the selected hour. iii) End Date: Select an end date and time for the scheduled job. (It should be greater than the Start date and the Current System Date and Time) iv) Run Now: Select this option to run the scheduled job on applying. v) Click ‘Next’. vi) Users will be redirected to the ‘Notification’ tab. Describing Various Job Refresh Intervals • Hourly: By selecting this option users can schedule the job on hourly basis. Job Refresh Interval Details 1. Select a specific hour by using the below-given options: Every_hour: Selecting this option will refresh the scheduled job after the selected hourly interval. OR

www.bdbizviz.com

Page | 240

At: Selecting this option will refresh the scheduled job at the selected hour.

• basis.

Daily: By selecting this option users can schedule the job on daily Job Refresh Interval Details 1. Select a specific day by using the below-given options: Every_ Days: the scheduled job will be refreshed after every selected number of days. OR Every Week Day: the scheduled job will be refreshed daily till the end date. 2. Select Start time.

www.bdbizviz.com

Page | 241

• Weekly: By selecting this option users can schedule the job on weekly basis. Job Refresh Interval Details 1. Select a day or days of week when the scheduled job can be refreshed. 2. Select a start time.

• Monthly: By selecting this option users can schedule the job on monthly basis. This time range is for more than one month. Job Refresh Interval Details 1. Select a specific day of month by using the below given options: E.g. 1st day of 1st month OR E.g. The First Monday of the 1st month 2. Select Start time

www.bdbizviz.com

Page | 242

• Yearly: By selecting this option users can schedule the job on a yearly basis. This time range is for more than one year. Job Refresh Interval Details 1. Select a specific day of the month by using the below-given options: Select Every 1st day of January month. Or Select the first Monday of January 2. Select Start time

www.bdbizviz.com

Page | 243

Note: By selecting the ‘Use Existing Data Connector’ and ‘Use Existing Data Writer’ options ‘Schedule’ tab will be displayed immediately after the ‘General’ tab. 14.1.5.

Notification

i) Configure the below-given fields: a. Enable Email Notification: Use a check mark in the box to enable email b. Email Address: Enable this option by check marking the box c. Send Mail when R Server is not running: Users can check mark in the box to enable this option. By enabling this option, users will get an email when R server is not running. d. Send Mail when Process is Completed Successfully: Users can check mark in the box to enable this option. By enabling this option user will get mail after the process is successfully completed. e. Send Mail when the Process is a Failure: Users can check mark in the box to enable this option. By enabling this option user will get email when the the process fails. ii) Click ‘Apply’ to save the details.

www.bdbizviz.com

Page | 244

iii) A success message will pop-up to assure that the job/process has been scheduled.

iv) The scheduled job/ process will be added to a list provided under the ‘Status’ tab.

Note:

www.bdbizviz.com

Page | 245

a. The PDF summary will be sent through email for the scheduled workflows. b. Multiple email addresses can be entered in coma separated value. c. At present, Spark Workflows are not supported by Scheduler. 14.2.

Status This section will display detailed information for all the scheduled jobs. i) Click the ‘Scheduler’ tree node. ii) Select ‘Status’.

iii) Users will be redirected to the Component tab. iv) A list containing all the scheduled jobs will be displayed.

Click ‘View Logs’ to see the logs of the selected workflow under the ‘Component’ tab.

www.bdbizviz.com

Page | 246

Related Actions for a Scheduled Job: Options Name Edit Stop

Description To edit/update the scheduled job details To stop the scheduled job

Remove To remove the scheduled job from the list Start To start the scheduled job Note: a. ‘Edit’ option will allow the user to update/ edit all the tabs for the selected job. b. Users can click ‘Start’ button to restart the scheduler for a scheduled job until it reaches the end date. c. Users can enable ‘Edit’ and ‘Remove’ actions only after stopping the Scheduled job.

15.

Live Job Status

Users can monitor spark processes using the ‘Live job Status’ feature. The ‘Live Job Status’ option will be a new tree node on the existing tree structure and Spark will be a leaf node to the new tree node. Users need to enable logging to view the log in live job status in Spark after running a workflow. i) Create a workflow in Spark. ii) Click ‘Run’. iii) A window will pop-up asking confirmation to enable or disable log. iv) Click ‘Yes’ to enable logging. (Selecting ‘No’ will not display log in the live job status.)

www.bdbizviz.com

Page | 247

v) Click the ‘Live Job Status’ tree node from the tree structure. vi) Click the ‘Spark’ leaf node. vii) Users will be redirected to the ‘Status’ tab.

b. View Log: log of the completed workflow can be viewed under the ‘Console’ tab by clicking the ‘View Log’ icon .

www.bdbizviz.com

Page | 248

c. Live Job Status: If the workflow execution is still in progress, users can view live action by clicking the ‘Live Job Status’ icon . Live jobs will be displayed under the ‘Console’ tab.

d. Summary: Click the ‘Summary’ icon to view a consolidated summary of all the components in a workflow. It will be displayed under the ‘Summary’ tab.

e. Actions i. Stop: Users can stop an ongoing execution at any time by clicking on the stop button. The status of the process will change to ‘Cancelled’ if the execution has been stopped.

www.bdbizviz.com

Page | 249

ii. Delete: Click the ‘Delete’ icon to remove an execution.

The selected workflow will be deleted from the ‘Live Job Status’ table and a warning message will be displayed to convey the same.

Note: a. Click the ‘Refresh’ option

www.bdbizviz.com

to refresh the table for viewing a live job.

Page | 250

b. Click the ‘Remove all jobs’ option table.

16.

to delete all the jobs from the

Saved Workflows

Users can save a workflow by clicking the ‘Save’ button provided on the workspace menu row. All the saved workflows will be displayed under the ‘Saved Workflow’ tree node. This section explains various options assigned to a saved workflow. i) ii) iii) iv) v)

16.1.

Navigate to the Predictive home page. Click ‘Saved Workflow’ tree-node. A list of all the saved workflows will be displayed. Right, click on a workflow from the list of ‘Saved Workflows’. A context menu will open with various options (As shown below):

Opening a Workflow i) Right click on a workflow from the list of ‘Saved Workflows’. ii) Select ‘Open’ from the context menu. iii) The selected workflow will be displayed in the right pane of the screen.

www.bdbizviz.com

Page | 251

Note: The workflow name will be displayed on the left side of the workspace menu row while opening a workflow.

16.2.

Deleting a Workflow i) Right click on a workflow from the list of ‘Saved Workflows’. ii) Select ‘Delete’ from the context menu.

iii) A message window will pop-up to confirm the deletion. iv) Click ‘Ok’.

v)

www.bdbizviz.com

The selected workflow will be deleted from the list.

Page | 252

16.2.1. Delete Connection for a Workflow A Right click on the inter-node connection will display the ‘Delete Connection’ option in a workflow. Click the ‘Delete Connection’ option to delete a connection.

16.3.

Renaming a Workflow i) Press a right click on a workflow from the list of ‘Saved Workflows’. ii) Select ‘Rename’ from the context menu.

iii) A pop-up window will appear. iv) Enter a new/modified name for the workflow. v) Click ‘Yes’.

vi) The selected workflow will be renamed.

www.bdbizviz.com

Page | 253

16.4.

Sharing a Workflow This feature gives users the ability to share saved workflows with other users and groups.

The following options are available to share a selected workflow: 1. Share With: This option allows the user to share a file with the selected users or user groups. Any changes made to file will be transferred to all the users with whom the file has been shared. i) ii) iii) iv)

v) vi)

Press a right click on a work flow from the list of ‘Saved Workflows’. Select ‘Share Workflow’ from the context menu. The ‘Share With’ option will be displayed (by default). Select either ‘Group’ or ‘Users’. a. By selecting a group all group members inside the group will be listed. Users can be excluded by not selecting them from the group. b. Users can be excluded by not selecting a user name from the list when ‘User’ option has been selected. Select a specific group or user from the list by check marking the box. Click ‘Apply’.

vii) The selected workflow will be shared with the chosen user(s)/group(s). 2.

www.bdbizviz.com

Copy To: This option creates a copy and shares the copy with the selected users and user groups. Any changes to the original file after sharing will not show up for the users that received the shared file via the ‘Copy To’ method.

Page | 254

i) ii) iii) iv) v)

vi) vii)

Press a right click on a work flow from the list of ‘Saved Workflows’. Select ‘Share Workflow’ from the context menu. Select ‘Copy To’. The copied workflow name will be displayed. Select either ‘Group’ or ‘Users’. a. By selecting a group all group members inside the group will be listed. Users can be excluded by not selecting them from the group. b. Users can be excluded by not selecting a user name from the list when ‘User’ option has been selected. Select a specific group or user from the list by check marking the box. Click ‘Apply’.

viii) The copied workflow will be shared with the chosen users/groups. 16.5.

Deploying a Workflow The Predictive Workflows can be deployed to the BizViz Dashboard Designer. i) ii)

Press a right click on a Workflow from the list of ‘Saved Workflows’. Select ‘Deploy Workflow’ from the context menu.

iii)

Users will be redirected to select an Apply Model component from the

www.bdbizviz.com

Page | 255

iv)

v)

workflow. Select an Apply Model component and click ‘Yes’.

A success message will pop-up to assure that the workflow has been published.

vi) Navigate to the Dashboard Designer home page. vii) Click ‘New’. viii) Click ‘Dashboard’.

ix)

Users will be directed to the Dashboard canvas.

x) Click the ‘Data Source’ icon sources.

www.bdbizviz.com

to display all the available data

Page | 256

xi) Click the ‘Create New Connection’ option provided next to the ‘Predictive Service’ data source. xii) A new connection will be created and added below.

xiii) Click on the connection to display the connection specific details. xiv) Select the deployed Predictive workflow as a data source via the dropdown menu. xv) Configure the other subsequent details: a. Load At Start: Enable this option to get the updated data. b. Timely Refresh: Enable this option to refresh data. c. Refresh Interval: Select the time interval to refresh the data.

d. Once the data connection is established the selected predictive workflow can be used as a data source to the Dashboard Designer.

www.bdbizviz.com

Page | 257

Recommendations § R Workflows: The result set located before a data writer component within a deployed R workflow will be considered as data set by the dashboard designer. § Spark Workflows: • The result set from the ‘Apply Model’ component within a deployed Spark workflow will be considered as data set by the dashboard designer (a result set after the ‘Apply Model’ component will not be considered). • A Spark workflow must contain one Apply model, read model (Saved Model component), and Spark filter (optional) component to deploy the workflow. 16.6.

Result of Each Component Users can view the result of each component in the spark workflow. i) Select a component from the spark workflow after the execution is completed. ii) Click the ‘Result’ tab. iii) The result data of the selected component will be displayed.

16.7.

Stop Button on the Progress Bar Users can stop an ongoing Spark workflow execution by clicking the ‘Stop’ button on the progress bar.

www.bdbizviz.com

Page | 258

17.

Saved Spark Models

A model is a reusable component created by training an algorithm using historical data and saving the instance. The ‘Saved Models’ tree-node contains a list of all the saved predictive models.

17.1. Saving a Spark Model i) Open a spark workflow. ii) Connect ‘Apply Model’ component with the workflow (as shown below). iii) Right click on the ‘Apply Model’ component. iv) A context menu will open. v) Select ‘Save Model’.

vi) A pop-up window will appear. vii) Enter a name for the model that you wish to save. viii) Click ‘Ok’.

www.bdbizviz.com

Page | 259

ix) The created Predictive Model will be saved under the ‘Saved Spark Models’ list.

17.2. Reading a Spark Model Users can drag a saved model to the workspace and reuse the model for a test data. A saved model can be connected to only Apply Model and new test data source. i) Select and drag a saved model onto the workspace. ii) Connect the saved model with a configured data source and an Apply Model component (As shown in the following image). s

iii) Click on the dragged Saved Model component.

www.bdbizviz.com

Page | 260

iv) Users will be redirected to the component tab v) Configure the following fields in ‘General’:

vi) Click the ‘Summary’ tab.

vii) Click ‘Run’. viii) Users will be redirected to the ‘Console’ tab.

ix) Follow the below-given steps to display Result. a. Click Apply model component.

www.bdbizviz.com

Page | 261

b. Click the ‘Result’ tab.

x) Click the ‘Properties’ tab to display the model properties.

Note: a. To run the workflow with a ‘Saved Model’ component it is mandatory that column headers and data type of the test data source should match with the selected saved model. Users will encounter an error if validation fails while running the workflow. b. Users can connect a data writer to the ‘Apply Model’ component in a workflow that contains a saved model. c. Currently, only Spark trained Workflows can be saved under the ‘Saved Models’ tree-node.

17.3. Renaming a Spark Model i) Select a model from the ‘Saved Models’ list. ii) Right click on the selected model.

www.bdbizviz.com

Page | 262

iii) A context menu will open. iv) Select ‘Rename’.

v) A pop-up window will appear to rename the model. vi) Enter a new ‘Model Title’ or modify the existing model title in the given field (if desired). vii) Click ‘Yes’.

viii)

The selected Spark Predictive Model will be renamed.

17.4. Deleting a Spark Model i) ii) iii) iv)

www.bdbizviz.com

Select a model from the ‘Saved Models’ list. Right click on the selected model. A context menu will open. Select ‘Delete’.

Page | 263

v) A pop-up window will appear to confirm the deletion. vi) Click ‘Ok’.

vii) Selected predictive model will be deleted and removed from the list of ‘Saved Spark Models’.

17.5. Sharing a Spark Model Users can share a saved model with other users or user groups. There are two options to share a selected model: 1. Share With: This option allows the user to share a file with the selected users or user groups. Any changes made to file will be transferred to all the users with whom the file has been shared. i) Right, click on a model from the list of ‘Saved Models’. ii) Select ‘Share Model’ from the context menu. iii) The ‘Share With’ option will be displayed (by default). iv) Select either ‘Group’ or ‘Users’. a. By selecting a group all group members inside the group will be listed. Users can be excluded by not selecting them from the group.

www.bdbizviz.com

Page | 264

b. Users can be excluded by not selecting a user name from the list when ‘User’ option has been selected. v) Select a specific group or user from the list by check marking the box. vi) Click ‘Apply’.

2. Copy To: This option creates a copy and shares the copy with the selected users and user groups. Any changes to the original file after sharing will not show up for the users that received the shared file via the ‘Copy To’ method. i) Right, click on a work flow from the list of ‘Saved Models’. ii) Select ‘Share Model’ from the context menu. iii) Select ‘Copy To’ option. iv) The copied model name will be displayed. v) Select either ‘Group’ or ‘Users’ option by a click. a. By selecting a group all group members inside the group will be listed. Users can be excluded by not selecting them from the group. b. Users can be excluded by not selecting a user name from the list when ‘User’ option has been selected. vi) Select a specific group or user from the list by check marking the box. vii) Click ‘Apply’.

www.bdbizviz.com

Page | 265

•

18.

A copy of the model will be shared with the selected user or group.

Saved R Models

R Apply Model is a component used to generate predictions based on trained classification or regression model. The user can either split the data set to training and testing, create a model with training data and apply over the testing data. Another approach is to save the model and apply model over new test data set. Users can save an R model after a successful execution. The saved R models will be listed under the ‘Saved R Model’ tree node. Users can select a saved R model from the list and use to create a new workflow. R Apply Model will come as a leaf node under Apply model tree node. The R Apply Model Component consists of two nodes for reading data from data source and another one for giving the result.

18.1. Saving an R Model i) Open an R workflow. ii) Connect ‘Apply Model’ component with the workflow (as shown below). iii) Right click on the ‘Apply Model’ component. iv) A context menu will open. v) Select ‘Save Model’.

vi) A new window will pop-up. vii) Enter a name for the model that you wish to save. viii) Click ‘Ok’.

www.bdbizviz.com

Page | 266

ix) The created Predictive Model will be saved under the ‘Saved Models’ list.

18.2. Reading an R Model Users can drag a saved model to the workspace and reuse the model for a test data. A saved R model can be connected to only Apply Model and new test data source. i) Select and drag a saved R model component onto the workspace. ii) Connect the dragged model with a configured data source and an Apply Model component (As shown in the following image).

iii) Click on the dragged Saved Model component. iv) Users will be able to view the following ‘Component’ tabs: a. General

www.bdbizviz.com

Page | 267

b. Click ‘Summary’ tab to display the model summary.

v) Click ‘Apply’ using the Apply Model component.

www.bdbizviz.com

Page | 268

vi) Click ‘Run’. vii) Users will be redirected to the ‘Console’ tab.

viii) After the process gets completed under the Console tab, click the ‘Result’ tab to see result view of data.

Note: a. A mandatory condition to run the workflow with a ‘Saved R Model’ component is that column headers and data type of the test data source

www.bdbizviz.com

Page | 269

should match with the selected saved model. Users will encounter an error if validation fails while running the workflow. b. Users can connect a data writer to the ‘Apply Model’ component in a workflow containing a saved model.

18.3. Renaming an R Model i) Select a model from the ‘Saved R Models’ list. ii) Right click on the selected model. iii) A context menu will open. iv) Select ‘Rename’.

v) A pop-up window will appear to rename the model. vi) Enter a new ‘Model Title’ or modify the existing model title in the given field (if desired). vii) Click ‘Yes’.

viii)

The selected R Predictive Model will be renamed.

18.4. Deleting an R Model i) Select a model from the ‘Saved R Models’ list. ii) Right click on the selected model.

www.bdbizviz.com

Page | 270

iii) A context menu will open. iv) Select ‘Delete’.

v) A pop-up window will appear to confirm the deletion. vi) Click ‘Ok’.

vii) The selected predictive model will be deleted and removed from the list of ‘Saved R Models’. Note: After renaming or deleting a Saved R Model, workflows used by the same model will not work.

19.

Signing Out Follow the below-given steps to log out from the BizViz Platform. i) ii) iii) iv)

Click ‘User’ icon on the Platform home page. A menu appears with the logged in user details. Click ‘Sign Out’. Users will be successfully logged out from the BizViz Platform.

Note: Clicking on ‘Sign Out’ will redirect the user back to the ‘Login’ page of the BizViz platform.

www.bdbizviz.com

Page | 271

Recommend Documents