ODU Research Computing Forum

How can I complete a Jupyter notebook that requires more than 24 hours to complete?

I have been developing a complex analysis notebook in Jupyter and testing them with small datasets. Now it’s time to do my big calculation, but it is going to take over 24 hours to complete! How do I get around the max time limit of 24 hours?

The Open OnDemand Jupyter sessions are meant for interactive development and analysis. It is not mean for long-running computations. That is the use of batch computing and batch scheduler, and HPC! Here are some suggestions to go:

  1. Convert the Jupyter notebook to a Python script and submit it via the SLURM job scheduler. See https://wiki.hpc.odu.edu/Software/Python-beyond-Jupyter for a more complete guide.

  2. Use jupyter nbconvert command to “run” the notebook non-interactively. Please look at this web page: http://tritemio.github.io/smbits/2016/01/02/execute-notebooks/ . This will require you to create a batch script that will invoke the juypter nbconvert command from it, to execute your notebook with the big dataset(s). Watch out for a few caveats:

    • This is a non-interactive computation: the code in the notebook must read the input(s) and perform all the necessary calculations then produce the output(s) without intervention. There is no chance to check the intermediate values and decide what to do next in-between. This is true in the case of conversion to Python script above.

    • The final notebook is not available until the calculation is finished: This could be a problem in the case of errors. The plain jupyter nbconvert command-line program will not produce any output in the case of an error (often called “exception” in Python). Please see the Error Handling section of that referenced website for an idea. This requires you to write a thin Python wrapper, using nbconvert as a library to execute the notebook with an error handling that saves the intermediate notebook upon an error.

    • As a follow-up to the last point, we recommend that you use some print() functions, save some key intermediates that you can use to check the validity of the computation. Some examples: intermediate Pandas tables can be saved using the df.to_csv() method; some graphics from matplotlib can be saved to disk images, …

  3. If your analysis can be broken up to distinct stages, then maybe that one long notebook can be broken up into several shorter notebooks, each one corresponding to one particular stage. You save the output from one stage to disk, then launch the next notebook to process the next, and so on. This will allow you to continue using the OnDemand Jupyter interface despite the 24-hour total time limitation. The advantage of this approach is that it forces you to check and validate intermediate results–which is a good thing.

Look for Parallelism!

Regardless of the course you take, this is time to look for an opportunity for parallelization. If your analysis consists of many inputs and / or independent computations, then maybe it is time to break it up into those independent pieces and launch them individually. As always, please contact us if you need help with your specific use case!

References