This page contains common questions from users of JupyterHub, broken down by their roles within organizations.
Yes! JupyterHub has been used at-scale for large pools of users, as well as complex and high-performance computing. For example, UC Berkeley uses JupyterHub for its Data Science Education Program courses (serving over 3,000 students). The Pangeo project uses JupyterHub to provide access to scalable cloud computing with Dask. JupyterHub is stable customizable to the use-cases of large organizations.
Here is a quick breakdown of these three tools:
The Jupyter Notebook is a document specification (the .ipynb) file that interweaves narrative text with code cells and their outputs. It is also a graphical interface that allows users to edit these documents. There are also several other graphical interfaces that allow users to edit the .ipynb format (nteract, Jupyer Lab, Google Colab, Kaggle, etc).
.ipynb
JupyterLab is a flexible and extendible user interface for interactive computing. It has several extensions that are tailored for using Jupyter Notebooks, as well as extensions for other parts of the data science stack.
JupyterHub is an application that manages interactive computing sessions for multiple users. It also connects them with infrastructure those users wish to access. It can provide remote access to Jupyter Notebooks and Jupyter Lab for many people.
JupyterHub provides a shared platform for data science and collaboration. It allows users to utilize familiar data science workflows (such as the scientific python stack, the R tidyverse, and Jupyter Notebooks) on institutional infrastructure. It also allows administrators some control over access to resources, security, environments, and authentication.
Yes - the core JupyterHub application recently reached 1.0 status, and is considered stable and performant for most institutions. JupyterHub has also been deployed (along with other tools) to work on scalable infrastructure, large datasets, and high-performance computing.
JupyterHub is used at a variety of institutions in academia, industry, and government research labs. It is most-commonly used by two kinds of groups:
Small teams (e.g., data science teams, research labs, or collaborative projects) to provide a shared resource for interactive computing, collaboration, and analytics.
Large teams (e.g., a department, a large class, or a large group of remote users) to provide access to organizational hardware, data, and analytics environments at scale.
Here are a sample of organizations that use JupyterHub:
Universities and colleges: UC Berkeley, UC San Diego, Cal Poly SLO, Harvard University, University of Chicago, University of Oslo, University of Sheffield, Université Paris Sud, University of Versailles
Research laboratories: NASA, NCAR, NOAA, the Large Synoptic Survey Telescope, Brookhaven National Lab, Minnesota Supercomputing Institute, ALCF, CERN, Lawrence Livermore National Laboratory
Online communities: Pangeo, Quantopian, mybinder.org, MathHub, Open Humans
Computing infrastructure providers: NERSC, San Diego Supercomputing Center, Compute Canada
Companies: Capital One, SANDVIK code, Globus
See the Gallery of JupyterHub deployments for a more complete list of JupyterHub deployments at institutions.
JupyterHub puts you in control of your data, infrastructure, and coding environment. In addition, it is vendor neutral, which reduces lock-in to a particular vendor or service. JupyterHub provides access to interactive computing environments in the cloud (similar to each of these services). Compared with the tools above, it is more flexible, more customizable, free, and gives administrators more control over their setup and hardware.
Because JupyterHub is an open-source, community-driven tool, it can be extended and modified to fit an institution’s needs. It plays nicely with the open source data science stack, and can serve a variety of computing enviroments, user interfaces, and computational hardware. It can also be deployed anywhere - on enterprise cloud infrastructure, on High-Performance-Computing machines, on local hardware, or even on a single laptop, which is not possible with most other tools for shared interactive computing.
That depends on what kind of hardware you’ve got. JupyterHub is flexible enough to be deployed on a variety of hardware, including in-room hardware, on-prem clusters, cloud infrastructure, etc.
The most common way to set up a JupyterHub is to use a JupyterHub distribution, these are pre-configured and opinionated ways to set up a JupyterHub on particular kinds of infrastructure. The two distributions that we currently suggest are:
Zero to JupyterHub for Kubernetes is a scalable JupyterHub deployment and guide that runs on Kubernetes. Better for larger or dynamic user groups (50-10,000) or more complex compute/data needs.
The Littlest JupyterHub is a lightweight JupyterHub that runs on a single single machine (in the cloud or under your desk). Better for smaller usergroups (4-80) or more lightweight computational resources.
Yes - most deployments of JupyterHub are run via cloud infrastructure and on a variety of cloud providers. Depending on the distribution of JupyterHub that you’d like to use, you can also connect your JupyterHub deployment with a number of other cloud-native services so that users have access to other resources from their interactive computing sessions.
For example, if you use the Zero to JupyterHub for Kubernetes distribution, you’ll be able to utilize container-based workflows of other technologies such as the dask-kubernetes project for distributed computing.
The Z2JH Helm Chart also has some functionality built in for auto-scaling your cluster up and down as more resources are needed - allowing you to utilize the benefits of a flexible cloud-based deployment.
The short answer: yes. JupyterHub as a standalone application has been battle-tested at an institutional level for several years, and makes a number of “default” security decisions that are reasonable for most users.
For security considerations in the base JupyterHub application, see the JupyterHub security page
For security considerations when deploying JupyterHub on Kubernetes, see the JupyterHub on Kubernetes security page.
The longer answer: it depends on your deployment. Because JupyterHub is very flexible, it can be used in a variety of deployment setups. This often entails connecting your JupyterHub to other infrastructure (such as a Dask Gateway service). There are many security decisions to be made in these cases, and the security of your JupyterHub deployment will often depend on these decisions.
If you are worried about security, don’t hesitate to reach out to the JupyterHub community in the Jupyter Community Forum. This community of practice has many individuals with experience running secure JupyterHub deployments.
No - JupyterHub manages user sessions and can control computing infrastructure, but it does not provide these things itself. You are expected to run JupyterHub on your own infrastructure (local or in the cloud). Moreover, JupyterHub has no internal concept of “data”, but is designed to be able to communicate with data repositories (again, either locally or remotely) for use within interactive computing sessions.
JupyterHub offers a few options for managing your users. Upon setting up a JupyterHub, you can choose what kind of authentication you’d like to use. For example, you can have users sign up with an institutional email address, or choose a username / password when they first log-in, or offload authentication onto another service such as an organization’s OAuth.
The users of a JupyterHub are stored locally, and can be modified manually by an administrator of the JupyterHub. Moreover, the active users on a JupyterHub can be found on the administrator’s page. This page gives you the abiltiy to stop or restart kernels, inspect user filesystems, and even take over user sessions to assist them with debugging.
A key benefit of JupyterHub is the ability for an administrator to define the environment(s) that users have access to. There are many ways to do this, depending on what kind of infrastructure you’re using for your JupyterHub.
For example, The Littlest JupyterHub runs on a single VM. In this case, the administrator defines an environment by installing packages to a shared folder that exists on the path of all users. The JupyterHub for Kubernetes deployment uses Docker images to define environments. You can create your own list of Docker images that users can select from, and can also control things like the amount of RAM available to users, or the types of machines that their sessions will use in the cloud.
For interactive computing sessions, JupyterHub controls computational resources via a spawner. Spawners define how a new user session is created, and are customized for particular kinds of infrastructure. For example, the KubeSpawner knows how to control a Kubernetes deployment to create new pods when users log in.
For more sophisticated computational resources (like distributed computing), JupyterHub can connect with other infrastructure tools (like Dask or Spark). This allows users to control scalable or high-performance resources from within their JupyterHub sessions. The logic of how those resources are controlled is taken care of by the non-JupyterHub application.
Yes - JupyterHub can provide access to many kinds of computing infrastructure. Especially when combined with other open-source schedulers such as Dask, you can manage fairly complex computing infrastructure from the interactive sessions of a JupyterHub. For example see the Dask HPC page.
This is highly configurable by the administrator. If you wish for your users to have simple data analytics environments for prototyping and light data exploring, you can restrict their memory and CPU based on the resources that you have available. If you’d like your JupyterHub to serve as a gateway to high-performance compute or data resources, you may increase the resources available on user machines, or connect them with computing infrastructure elsewhere.
JupyterHub provides some customization of the graphics displayed to users. The most common modification is to add custom branding to the JupyterHub login page, loading pages, and various elements that persist across all pages (such as headers).
Depending on the complexity of your setup, you’ll have different experiences with “out of the box” distributions of JupyterHub. If all of the resources you need will fit on a single VM, then The Littlest JupyterHub should get you up-and-running within a half day or so. For more complex setups, such as scalable Kubernetes clusters or access to high-performance computing and data, it will require more time and expertise with the technologies your JupyterHub will use (e.g., dev-ops knowledge with cloud computing).
In general, the base JupyterHub deployment is not the bottleneck for setup, it is connecting your JupyterHub with the various services and tools that you wish to provide to your users.
JupyterHub works well at both a small scale (e.g., a single VM or machine) as well as a high scale (e.g., a scalable Kubernetes cluster). It can be used for teams as small a 2, and for user bases as large as 10,000. The scalability of JupyterHub largely depends on the infrastructure on which it is deployed. JupyterHub has been designed to be lightweight and flexible, so you can tailor your JupyterHub deployment to your needs.
For JupyterHubs that are deployed in a containerized environment (e.g., Kubernetes), it is possible to configure the JupyterHub to be fairly resistant to failures in the system. For example, if JupyterHub fails, then user sessions will not be affected (though new users will not be able to log in). When a JupyterHub process is restarted, it should seamlessly connect with the user database and the system will return to normal. Again, the details of your JupyterHub deployment (e.g., whether it’s deployed on a scalable cluster) will affect the resiliency of the deployment.
Out of the box, JupyterHub supports a variety of popular data science interfaces for user sessions, such as JupyterLab, Jupyter Notebooks, and RStudio. Any interface that can be served via a web address can be served with a JupyterHub (with the right setup).
JupyterHub provides a standardized environment and access to shared resources for your teams. This greatly reduces the cost associated with sharing analyses and content with other team members, and makes it easier to collaborate and build off of one another’s ideas. Combined with access to high-performance computing and data, JupyterHub provides a common resource to amplify your team’s ability to prototype their analyses, scale them to larger data, and then share their results with one another.
JupyterHub also provides a computational framework to share computational narratives between different levels of an organization. For example, data scientists can share Jupyter Notebooks rendered as voila dashboards with those who are not familiar with programming, or create publicly-available interactive analyses to allow others to interact with your work.
Yes, Jupyter is a polyglot project, and there are over 40 community-provided kernels for a variety of languages (the most common being Python, Julia, and R). You can also use a JupyterHub to provide access to other interfaces, such as RStudio, that provide their own access to a language kernel.