Apply machine learning predictions to your data without any programming knowledge.

Download PDF lets you apply machine learning predictions to your data without any programming knowledge. You just need to create an Excel (.xlsx) or csv file (.csv) with columns of “predictor” variables and 1 column of “target” variable. (An example data file called example_input_train.csv and example_input_live.csv are available for download.) Our program will learn from your predictors-target pairs, and make predictions on unseen live data that you supply. We will also show you the performance metrics of the model: e.g. how accurate the predictions are. can predict either discrete target variables such as the sign of returns, or continuous variables such as the returns themselves.

NOTE: The program allows both .csv files and .xlsx files to be processed. However, when modifying/creating a csv file via Excel, some characters in the current file may not be correctly converted to .csv. We encourage Excel users to save their files to .xlsx, rather than .csv!

Steps To Do

  1. Login / Register to the website
    Note that password plaintext is not stored anywhere on the server, but an encrypted version only.
  2. The user chooses between Train option and Live option, and click on Submit.


  • “Training” means you supply historical data with known outcome (target) for our program to learn how to make predictions.
  • Specifies the necessary parameters to perform data preprocessing, splitting between train and test sets, hyperparameter optimization, feature selection, train model. Many parameters have default values that are specified already. (Details in Hyperparameters for Training section below.)
  • Uploads the data file on which to perform analysis. Note that there must be a column which will be used as the target variable. This target variable will be specified by the user.
  • The program will automatically process this target (also called “label”) column by internally assigning 0 if target is negative or zero, and 1 if target is strictly positive. For example, in a metalabeling application, the target can be returns. In other classification tasks, the labels must have values -1 or 1.
  • Note other details regarding the data specification in the Data preprocessing section below.
  • Outputs from training will be available for download. See Outputs from training section below.


  • “Live” means you use your previously trained model to make new (live) predictions.
  • Uploads a data file made of predictors (also called “features”). Note: if there is a target column, it will be ignore
  • The data provided will be appended to a training dataset used for fitting the ML model, in order to apply data preprocessing (i.e. if input is a time series, there must be no gaps between last date in train data, and first date of live data.)
  • The prediction csv files (with probabilities and predictions) can be downloaded via download links.
  1. target : column name for the target data variable. This column should contain real numbers, and the program will convert them into 2 classes based on their signs (if Classification task), as described above, or will automatically detect the number of classes in it if the labels are already in the form of classes (e.g. 1, 2, 3, etc.) or will be kept same if task is Regression. IMPORTANT! In a metalabel application, if a trade wasn’t made for a certain sample, the target should be NaN or Null instead of 0.
  2. timeseries (yes/no) : whether the data is in timeseries format. Default: no.
    • If yes, then the program will automatically search for a ‘date/Date/DATE’ column. If the program cannot find it, then it will default to “no”.
    • If no, then the program will perform default indexing of data (0,1,2,. ..).
  3. target type/task (classification/regression) : whether to perform regression or classification.
    • Classification: Classification can be used in 2 cases.

    • If the labels are continuous but the objective is to predict the sign of the label. Then the program will convert the label into two labels -1 and 1 and run classification on it.
    • If the labels are already in the form of labels with multiple labels e.g. 1, 2, 3, etc. (multiclass) then the program will detect the number of classes and run classification on it.

    • Regression: If you want to predict the value of label. The label needs to be continuous values. Output will also be continuous.

  4. feature selection (shap/Cluster MDA/none): perform feature selection. This will display a ranking of all your features based on feature importance scores. By choosing feature selection, will display and select those features which contribute most to your prediction of the target. Some features in your data can decrease the accuracy of the models and make your model learn based on irrelevant/unhelpful features.
    • If shap, then perform Shap feature selection. The SHAP Tree Explainer [1] feature selection algorithm is an explanation method used for ensemble learning that computes optimal local explanations, using topics from game theory. More details can be found in Refer to this paper by Ernest P. Chan and Xin Man on SHAP feature selection method
    • If Cluster MDA, then clustered feature importance is used for feature selection. cluster together features are that are similar and receive the same importance rankings. This promises to be a great way to remove the substitution effect. In our new paper Man and Chan 2021b , we applied a hierarchical clustering methodology and compare it with MDA feature selection method.
    • If none, then perform no feature selection. Default: ‘shap’.
  5. hyperparameter optimization analysis(small/none) : modify the level of hyperparameter optimization. Default: ‘small’. The parameter grid used for the randomized search is the following:
    • ‘num_leaves’: [10,20, 30,60, 80, 100, 120, 150, 200, 300, 400, 500,600,700,1000]
    • ‘n_estimators’: [int(x) for x in np.linspace(start = 50, stop = 4250, num = 200)]
    • ‘bagging_fraction’ : [0.1,0.2,0.3,0.4,0.5,0.8, 0.9 ]
    • ‘feature_fraction’ : [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8, 0.9,1]
    • ‘learning_rate’:[0.03,0.05, 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
    • ‘max_depth’: [int(x) for x in np.linspace(10, 510, num = 24)]
    • ‘reg_alpha’: [0,0.02, 0.2, 0.5, 0.6, 0.8]
    • reg_lambda’: [0, 0.02,0.4, 0.5, 0.6, 0.8
    • ‘drop_rate’:[0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95]
    • ‘max_drop’:[-1,10,20,30,40,50,60,70,80,90,100,120,140 ]
    • ‘xgboost_dart_mode’:[True, False]

    If you don’t want hyperparameter optimization, the model will be trained using the following default parameters:

    • ‘num_leaves’: 100
    • ‘n_estimators’:100
    • ‘feature_fraction’:1
    • ‘bagging_fraction’:1
    • ‘learning_rate’:0.1
    • ‘reg_alpha’:0
    • ‘reg_lambda’:0
    • ‘xgboost_dart_mode:False
  6. testsize: Data to be used for test set. This parameter will only be used if mode=’train’. The testsize value will be used for splitting the feature matrix used for train/test, as well as corresponding labels. Test set is assumed to follow the train set. If 0 <= testsize < 1, we treat this as a fraction of the total number of rows. If 1 < testsize, then we treat this as the exact number of test rows. If testsize==1 then only 1 row of data will be used as testsize. Note that testsize is ignored when mode = ‘live’, since all data provided will be used for live predictions.
  7. boost (gbdt/dart): Perform ensemble learning boosting. Available options: gradient boosted decision trees (GBDT) or DART technique. Default: gbdt.
  8. Weights (yes/no) : whether sample weights are used in the model (LGBMClassifier). Default: no.
  9. prob_calib (yes/no): whether the user wants to perform probability calibration method, as seen on Default: no.
  10. Exploratory data analysis (EDA) performs basic statistics about the input data, such as mean, standard deviation, counting number of Null values, etc. It will warn user if there are columns with too many null values, as well as the respective column names and how many nulls those columns have. Note that currently, the threshold for raising this warning is 0.1, i.e., if there are columns with more than 10% of elements are NaNs. It will also warn the user if the dataset has too few rows. Current minimum number of rows before a warning is raised is 100. Default: no.
  11. suffix (string): file suffix used for renaming the output files.

(This section contains technical details, and can be skipped on first reading.)
Our current backend program achieves the following pipeline:

  1. Data preprocessing
    • Check whether dataset file is xlsx or csv. Otherwise, an error will be raised.
    • If the mode = ‘train’, remove rows that have no label. If ‘mode’==live, then any label will be ignored. The user does not need to provide any labels column in live model.
    • ‘Null’, ‘inf’, ‘N/A’, ‘nan’, (in any combo of cases) will be interpreted as NaNs (null values)
    • Remove all special character signs such as comma (,), dollar ($), euro, yen, yuan from a specific cell
    • Program will try the following: after removing special characters found in nonnumerical columns, it will try to convert all elements of those non-numerical columns to numeric datatype. If it fails, then all of the elements from that column will be interpreted as string datatype, (and implicitly, will be treated as a categorical column)

    Look for ‘date/Date/DATE/Time/TIME/time’ column name in the dataset. If one finds such a column, it will automatically set the index of the dataset to that column and will not be used as feature. All date (Python/Pandas) formats are acceptable, including dates and time in the same column. For example, ‘2013/02/01 12:00 AM’ is an acceptable format.

  2. EDA analysis – saves output from panda describe() method in ‘filename.csv’, where filename is Save output from panda profile report in ‘profile_report_filename.html’.
  3. Perform one-hot-encoding of features that are categorical. Note that Categorical features must not have pure numbers! Users are warned that all cells in the dataframe that have numbers only will be treated as numerical.
  4. Perform fractional differentiation (if user indicates numerical features are time series data). If the numerical features are not under timeseries format, then the concept of stationary features is inapplicable.
  5. Perform features selection (SHAP) w.r.t Nancy’s rank averaging score. We use purged CV whenever timeseries = ‘yes’.
  6. Perform hyperparameter optimization using cross-validation, with accuracy or F1 score as objective. If the ratio of (nr. of class 0 elements)/ (nr.of class 1 elements) is in [0.6,…1.6], then we use accuracy objective for RandomizedSearchCV. The intuition is that in this case classes are ‘balanced’, and accuracy might be appropriate. If otherwise, i.e., classes are ‘imbalanced’, then F1 score (precision + recall) is necessary. If the user chooses ‘None’, no hyperparameter optimization is performed.
  7. Compute AUC/Accuracy/F1 score on both CV and test set and output indication of predictability (whether >0.5)
  8. Compute probabilities of class 1 on both CV sets (test folds) and test set.
  9. Predict labels on both CV sets (test folds) and test set, assuming probability threshold of 0.5.

After training is done, website will display the performance metrics, for both CV and test sets, such as accuracy, F1 score, AUC score. Other output can be downloaded via links provided in the Results page.

Output files:
performance_metrics_<>.txt=same as the metrics displayed on Results page.
predicted_prob_cv_<>.csv=predicted probability of class 1 (i.e. positive target) for CV (test folds)
predicted_targets_cv_<>.csv=predicted classes (based on probability threshold = 0.5) for CV (test folds)
predicted_prob_test_<>.csv=predicted probability of class 1 for test set
predicted_labels_test_<>.csv= predicted classes (based on probability threshold = 0.5) for test set
summary_plot_all_test_folds_<>.png= plot most relevant features across all folds using SHAP feature selection
saved_model_<>.pkl= ML model saved into a pickle file using joblib dump function

  1. To download this instruction manual, click on ‘Get Datasets at the top of the page.
  2. To download the example_input_train.csv, click on ‘Finance Train Data . To download example_input_live.csv file, click on ‘Finance live data’ at the top of the page.
    The example_input_train.csv file contains a feature matrix with several technical indicators along with 10-day (buy and hold) SPY future return as target variable (also known as label). Note that the window length for technical indicators are arbitrarily chosen, since this is used for illustrative purposes only. We would like to train a ML model based on this feature matrix and target variable.
  3. Registration: click on Try for Free
  4. In the form, input the username, password, retype password and click Register. You will be redirected to a Paypal window. You will need to sign the subscription agreement, otherwise the sign up is incomplete and will not be able to login!
    Figure 1: Registration form
  5. If you want to Login, click the Login button on the top right page. Input username and password and click on Sign In.

    Figure 2: Login form
  6. User dashboard (Train/Live mode)

    Figure 3: User Dashboard (choose between Train/Live mode)
    The user has a choice between choosing Train mode or Live mode. Recall that Train mode involves splitting the data between train and test, build the model, and then output results for CV and test set, where test set is specified as a parameter. Live mode is used for live prediction only, by using a pre-built pre-built machine learning model. Since we do not have an already existing machine learning model, choose Train and click on Submit in order to open the range of options used to build such a model.
  7. Assuming you have downloaded example_input_train.csv from ‘Download Train example data’ link from top of the page, one needs to specify the data file on which the model is trained.
  8. Please upload your dataset by clicking on the box where it says ‘Drop files here to upload’ or drag & drop it there. Please wait until the file has been uploaded. You will notice the green progress bar that will show your progress. From the folder where example_input_train.csv has been saved, select it, and click Open. See Figure 4a). After file has been successfully uploaded, click on Next.
    Figure 4a) Upload file request

    Figure 4b) File successfully uploaded
  9. After clicking Next, you need to specify the parameters that will be used in order to train your machine learning model. See Figure 4c.
    If you open this file in Excel, you will notice that first column is called ‘Date’, whereas the last column is called ‘futreturn’. Set the name of the target variable as ‘futreturn’, denoting 10-day future return for (buy and hold) SPY index. In this case, the data is under timeseries format, since the features are consecutive in time, and thus require feature selection, we click on ‘yes’, at the question ‘Is your data file under timeseries format?’
    Figure 4c) Training parameters form
  10. We need to decide whether the target variable is under class format or continuous format. If it is under class format, one needs to select ‘Classification’ at the question ‘Is the target variable under class format (Classification) or is it continuous (Regression)?’. For this particular example, we choose Classification because we want to predict whether the returns will be positive or negative. If one wants to predict actual values, one needs to choose ‘Regression’. A suitable example for regression would be when one would like to predict house prices, actual returns, etc.
  11. We keep level of machine learning optimization to ‘Small’, for this example, to make computations faster. Note that high machine learning optimization analysis could lead to better performance metrics. If you don’t want hyperparameter optimization, just choose ‘None’ – the model will training using its default parameters as defined in Hyperparameters for training, paragraph e).
  12. We choose 200 to be the size of test set. Recall that if this value >= 1, the last 200 rows are selected for test set. If the size of test set <1, we treat it as a fraction of the total number of rows.
  13. We select ensemble learning to be GBDT (default value, using Random Forrest), and let sample weights, prob calibration be set to ‘no’. Set EDA analysis to ‘yes’.
    After clicking on Submit button, you are asked to review your inputs – if you are satisfied with the inputs provided, please click on Run Model and wait for the ML calculation to take place. You will be informed about the status of the computations via a percentage progress bar, along with some extra information. It is advised to leave the web browser OPEN. If the computation has been completed successfully, you will receive a result link in the browser (and also sent via an email) and you can access it. If at any point an error occurs, you will be informed about the nature of the error, and no CPU time will be added in your account.

    Figure 5: Train mode Results
  14. In the large yellow box, we see the performance metrics for both CV and test set, along with their date range. Recall that if the data has a column ‘Date/date/Time/time’, it will automatically set it as dates. If not, default indexing 0,1,2,… is performed.
  15. The following download links follow the Outputs from training section. Note that if EDA=yes, one can download basic statistics about the data inputted (using ‘Download your dataset describe().csv’), along with more complicated analysis, in a .html file (using ‘Download your dataset profile report’). For our example, when downloaded, those files will have names: ‘example_input_train_describe().csv’ and ‘profile_report_example_input_train.html’.
    WARNING: If you would like to later perform Live mode, don’t forget to click on ‘Download model file’, because this will be used later. In our example, this file is called ‘saved_model_example.pkl’. Also, please do not modify the name of the model file, because in Live mode, this name will be used to link it with the data used for training.
  16. Now, we can go to our (Home) we can go to our dashboard by clicking on Home, either in the top page, or in the link in the bottom page.
  17. Select Live mode and click on Submit.

    Figure 6: Live mode
  18. Assuming you have downloaded example_input_live.csv, and an available model, such as ‘saved_model_example.pkl’, please upload these files and click on Predict.

    Figure 7- Live mode Results
  19. Using ‘Download PREDICTED TARGETS LIVE .csv’, one can download the predicted targets for the feature matrix provided. Using ‘Download PREDICTED PROBABILITIES LIVE .csv’, one can download the predicted targets for the feature matrix provided. We can go to our (Home) dashboard by clicking on Home, either in the top page, or in the link in the bottom page.
[1] Lundberg, S.M., Erion, G., Chen, H. et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell 2, 56–67 (2020).

  1. What type of file should I provide?
    You can provide an Excel file (.xlsx) or a .csv file. Please do not use whitespaces in the name of the file.
  2. How my dataset should look like?
    Your dataset should have rows and columns, but not charts. It is advised to use a dataset with at least 100 rows. You should also name your columns. Please avoid using whitespaces in naming columns.
  3. What is Train mode and Live mode?
    These are the 2 modes used for HAL machine learning. In Train mode, you train a model by splitting your dataset in train and test sets, construct a machine learning model using train set data, make predictions and performance metrics based on both train and test set data. In Live mode, you perform predictions on out-of-sample data. Note that you cannot use Live mode if you have not used Train mode previously.
  4. What is a target (a.k.a label)?
    A target is a column that contains real numbers, and the program will convert them into 2 classes based on their signs, as described above. If the target is positive, the program will convert this to class 1. If the target is 0 or negative, the program will convert this to class 0.
  5. How many target/label columns can I have?
    Note that only ONE label (column) should be present in the dataset.
  6. I’m using for metalabelling. My strategy does not trade frequently. What should I do?
    In a metalabel application, if a trade wasn’t made for a certain sample, the target should be NaN or Null instead of 0.
  7. I get an error when I attach a .csv file. What should I do?
    Most likely, when user opens a csv file in Excel and saves it, it will pop up a warning that says “some characters may not be well converted”. We strongly encourage you to upload a .xlsx file in this case The moral of the story is: if you are constructing your dataset via Excel, you should stick to a .xlsx file, and not .csv
  8. What are ‘Classification’ and ‘Regression’ parameters?
    If you want to predict actual numbers/values, you need to choose Regression. If you want to predict whether your target variable belongs to a certain class, such as 0 or 1, one needs to choose ‘Classification’.
  9. What size for test set should I provide?
    You may choose any value between 0.01 and 0.999, to represent the percentage of data used as test set. We recommend to use 0.3, i.e. 70% of data will be used for training, whereas the remaining 30% will be used for testing. If you input an integer number greater than 1, then we treat this as the exact number of test rows. If you input 1 as the size of the test set, then only 1 row of data will be used as test size.
  10. What exactly is the ‘suffix’?
    It is just a name for your model file used to keep track of machine learning logic internally.
  11. Can I have 2 dates column?
    No. The dataset must have only ONE date column. Make sure this is named as ‘Date/date/DATE/Time/TIME/time’.
  12. Is it necessary to always provide a date column?
    It is not mandatory. But note that any strings/non-integer values will be used for one-hot encoding. We recommend, at least for the prototype version to use a date column, for simplicity.
  13. Can I have column names on different rows?
    No. Column names can only occur on the horizontal axis. Look at the example xlsx/csv files from the download links for comparison.
  14. How much is the waiting time to get results?
    The waiting time varies depending on the size of your dataset. For comparison, a dataset with approx. 6000 rows and 90 columns will get around of 10 mins of waiting time. Smaller datasets will need much less waiting time.
  15. Now I can view the Results page. What should I do?
    You can look at the performance metrics of your trained machine learning model. You can also download via download links information about predicted probabilities (of achieving a class 1), as well as predicted targets (either 1 or 0), for both train (via CV) and test data.
  16. There is a DOWNLOAD MODEL link at the bottom of the Results page. What should I do?
    It is mandatory to download this model if you intend to do live predictions. Please do not rename this file.
  17. I can’t seem to get the Live mode working. It outputs ‘Internal server error’. What should I do?
    Make sure you have the same column names in your dataset (check for typos as well!). Note that it is not necessary to include the target/label column for Live mode. Also, please do not modify the name of the model file that you will upload when being asked in the Live mode, because in Live mode, this name will be used to link it with the data used for training.
  18. I keep getting failure for cMDA option, but works for SHAP option. What’s wrong?
    Since cMDA is null sensitive, try to minimize the null values in your Dataset. Our program does take care of null values with cMDA but if ‘Failure’ persists: try minimizing the number of null values.
  19. I obtained Results for Live mode. What are these?
    You can download predicted probabilities as well as predicted targets for your live data that you uploaded.