Running using Databricks Connect #582#583
Conversation
|
|
||
| public String getMsg2(double prediction, double score); | ||
|
|
||
| public void displayRecords(ZFrame<D, R, C> records, String preMessage, String postMessage); |
There was a problem hiding this comment.
What happens if:
we have two interfaces here. TrainingDataModel and LabelDataViewHelper. The data model has methods for reading and writing training pairs, getting scores etc. The view has messages.
TrainingDataModel should extend from ZinggBase and automatically gets pipeutil and other context stuff. ZinggBase already has the methods to get stats etc..and other methods can be moved there. You can use TDM in labeller and labelupdater just like we use the trainer and matcher in trainmatcher.
TDM and LabelDataViewHelper are returned from Client methods and used in python.
There was a problem hiding this comment.
1st draft available in commit 48b2134 please review
|
|
||
| public void updateLabellerStat(int selected_option, int increment); | ||
|
|
||
| public void printMarkedRecordsStat(); |
There was a problem hiding this comment.
kept update in model and print in view , commit 48b2134, please review
| options = ClientOptions([ClientOptions.PHASE,inpPhase]) | ||
|
|
||
| #Zingg execution for the given phase | ||
| zingg = Zingg(args, options) |
There was a problem hiding this comment.
the labeler should get automatically kicked off in execute based on the phase. User should not have to program anything here.
There was a problem hiding this comment.
zingg usage should be zingg.sh --run pyprog .
| @@ -0,0 +1,64 @@ | |||
| from zingg.client import * | |||
There was a problem hiding this comment.
Can we create on single file defining the data schema etc and use that for both databricks and local? Only the locations of the zinggDir etc will change in the Databricks specific file.
|
|
||
| # Running on Databricks | ||
|
|
||
| The cloud environment does not have the system console for the labeler to work. Zingg is run as a Spark Submit Job along with a python notebook-based labeler specially created to run within the Databricks cloud. |
There was a problem hiding this comment.
arent we giving a labeler for the user on the client machine?
There was a problem hiding this comment.
done in commit a3ddd46 with shell script changes
| } | ||
|
|
||
| public void processRecordsCli(ZFrame<D,R,C> lines) throws ZinggClientException { | ||
| public ZFrame<D,R,C> processRecordsCli(ZFrame<D,R,C> lines) throws ZinggClientException { |
There was a problem hiding this comment.
why do we need to return a zframe here?
There was a problem hiding this comment.
This is done so that writing of labelled output happens in a a separate method. This is needed for python api to work.
| processRecordsCli(unmarkedRecords); | ||
| ZFrame<D,R,C> updatedLabelledRecords = processRecordsCli(unmarkedRecords); | ||
| if (updatedLabelledRecords != null) { | ||
| getTrainingHelper().writeLabelledOutput(updatedLabelledRecords,args); |
There was a problem hiding this comment.
move null check to the method writeLabelledOutput
| } | ||
| } | ||
|
|
||
| public ZFrame<D,R,C> getUnmarkedRecords() { |
There was a problem hiding this comment.
arent hese methods are already defined in zinggbase/trainingdatahelper?
| printMarkedRecordsStat(); | ||
| getTrainingHelper().updateLabellerStat(selected_option, 1); | ||
| getTrainingHelper().printMarkedRecordsStat(); | ||
| if (selected_option == 9) { |
There was a problem hiding this comment.
make 9 as a constant in the view
| selected_option = displayRecordsAndGetUserInput(getDSUtil().select(currentPair, displayCols), msg1, msg2); | ||
| updateLabellerStat(selected_option, 1); | ||
| printMarkedRecordsStat(); | ||
| getTrainingHelper().updateLabellerStat(selected_option, 1); |
There was a problem hiding this comment.
not sure whats 1 here. please check.
There was a problem hiding this comment.
constant INCREMENT = 1 defined in commit ea7e8f4
| global _spark_ctxt | ||
| global _sqlContext | ||
| global _spark | ||
| jar_path = os.getenv('ZINGG_HOME')+'/zingg-0.3.5-SNAPSHOT.jar' |
There was a problem hiding this comment.
move name of jar to a global constant up in the code
| _sqlContext = SQLContext(_spark_ctxt) | ||
| return 1 | ||
|
|
||
| def initClient(): |
There was a problem hiding this comment.
we can edit the zingg script to have a new option --run-databricks so that user doesnt have to set the env . It is more explicit and gives user the ability to run locally or remote within the same env
| _spark_ctxt = SparkContext.getOrCreate() | ||
| _sqlContext = SQLContext(_spark_ctxt) | ||
| _spark = SparkSession.builder.getOrCreate() | ||
| return 1 |
There was a problem hiding this comment.
to signal that all is done without error, if calling code wants to check
0.3.5 sync
changes to be able to run using Databricks Connect i.e. still invoke the python script from user's machine but the actual would be run / submitted to data bricks