• 12

A PHP Error was encountered

Severity: Notice

Message: Undefined index: userid

Filename: views/question.php

Line Number: 191


File: /home/prodcxja/public_html/questions/application/views/question.php
Line: 191
Function: _error_handler

File: /home/prodcxja/public_html/questions/application/controllers/Questions.php
Line: 433
Function: view

File: /home/prodcxja/public_html/questions/index.php
Line: 315
Function: require_once

After identifying the best parameters using a pipeline and GridSearchCV, how do I pickle/joblib this process to re-use later? I see how to do this when it's a single classifier...

from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl') 

But how do I save this overall pipeline with the best parameters after performing and completing a gridsearch?

I tried:

  • joblib.dump(grid, 'output.pkl') - But that dumped every gridsearch attempt (many files)
  • joblib.dump(pipeline, 'output.pkl') - But I don't think that contains the best parameters

X_train = df['Keyword']
y_train = df['Ad Group']

pipeline = Pipeline([
  ('tfidf', TfidfVectorizer()),
  ('sgd', SGDClassifier())

parameters = {'tfidf__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'tfidf__max_df': [0.25, 0.5, 0.75, 1.0],
              'tfidf__max_features': [10, 50, 100, 250, 500, 1000, None],
              'tfidf__stop_words': ('english', None),
              'tfidf__smooth_idf': (True, False),
              'tfidf__norm': ('l1', 'l2', None),

grid = GridSearchCV(pipeline, parameters, cv=2, verbose=1)
grid.fit(X_train, y_train)

#These were the best combination of tuning parameters discovered
##best_params = {'tfidf__max_features': None, 'tfidf__use_idf': False,
##               'tfidf__smooth_idf': False, 'tfidf__ngram_range': (1, 2),
##               'tfidf__max_df': 1.0, 'tfidf__stop_words': 'english',
##               'tfidf__norm': 'l2'}
from sklearn.externals import joblib
joblib.dump(grid.best_estimator_, 'filename.pkl')

If you want to dump your object into one file - use:

joblib.dump(grid.best_estimator_, 'filename.pkl', compress = 1)
  • 36
Reply Report
      • 2
    • As a best practice, once the best model has been selected, one should retrain it on the entire dataset. In order to do so, should one retrain the same pipeline object on the entire dataset (thus applying the same data processing) and then deploy that very object? Or should one recreate a new model?
      • 1
    • @Odisseo - My opinion is that you retrain a new model starting from scratch. You can still use a pipeline, but you change your grid_classifier to your final classifier (say a Random forest). Add that classifier to the pipeline, retrain using all the data. Save the end model. - The end result is your entire data set was trained inside the full pipeline you desire. This may lead to slightly different preprocessing for instance, but it should be more robust. In reality, this means you call pipeline.fit() and save the pipeline.
      • 1
    • @Odisseo I'm a little bit late but... GridSearchCV automatically retrain the model on the entire dataset, unless you explicitly ask it not to do it. So, when you train the GridSearchCV model, the model you use for predicting (in other words, the best_estimator_) is already retrained on the whole dataset.

Trending Tags