Introduction

If you are moving data to or from the cloud (which includes Box and Google Drive), the best option to do it from SCG is with rclone. rclone supports many kinds of “remotes” (that is, remote storage, which in this case is cloud storage).

To use rclone, run module load rclone in your Terminal (or job script).

NOTE: In the past, rclone used to be installed in SCG as a native program. We had to stop doing this, as cloud technologies were moving too fast for the normal OS update process. So, we switched to providng rclone as a software module.

The first time you want to use a remote, you must configure it. Do this by running rclone config, choosing to add a new remote, and then following the instructions. Each kind of remote will have a different setup procedure.

Here are notes for certain cloud services commonly used with SCG:

Medicine Box

Medicine Box is the default Box instance used within the School of Medicine. It allows High Risk data and PHI. Because of this, rclone and SCG does not work with Medicine Box. If you need to move data from SCG to Medicine Box, the only option is to transfer the data to a macOS or Windows machine, and then move the data to Box.

Stanford Box

Stanford Box is the default Box instance used outside of the School of Medicine. It allows High Risk data, but no PHI. rclone and SCG will work fine with Stanford Box, but remember that SCG does not allow High Risk data.

Stanford University is in the process of reducing use of Stanford Box (not Medicine Box, just Stanford Box), due to impending price increases. This is not unique to Stanford. If you are still interested in migrating from Stanford Box to Medicine Box, first read more about Medicine Box and the migration request form. You can also read the University IT announcement, and stay tuned to the UIT web site for updates.

NOTE: Box has a file size limit of 15 GiB. For larger files, you should choose a different storage/transfer method.

WARNING: Box remotes must be used at least once every 60 days, or Box will expire the credentials used by the remote. If this happens, you will get the error “Invalid refresh token” when you next use the Box remote. If this happens, you should delete and re-create the Box remote. If you use the same name on the re-created Box remote, your scripts should not be affected.

The rclone documentation page for Box is https://rclone.org/box/. When adding Box as a new rclone remote, here are some notes:

In order for rclone to connect to Box, you will need to visit the Box web site, and log in through Stanford Login. This happens once, when you are first configuring rcone to work with Box. You have two options:

  1. You can use a Login Desktop session through SCG OnDemand. This way, when rclone needs you to visit the Box web site, it will automatically launch Firefox.

  2. You can download rclone to your own computer. When the times comes for you to visit the Box web site, rclone on SCG will tell you to run a special rclone command on your own computer. Rclone on your computer will then open your web browser. At the end, rclone on your computer will ask you to copy/paste a code into rclone on SCG.

Once you have decided which method to do, here is how to configure Box for rclone on SCG:

Open a Terminal window, and run the command rclone config. Choose to make a new remote, and select box as the service name. You will then be asked to answer a series of configuration questions.

For Box, the following configuration items are not required, and should be left blank:

  • Client ID
  • Client Secret
  • config.json file
  • Primary Access Token

You should tell rclone to act as a user (instead of a service account or “enterprise”).

What do to next depends on if you are running rclone config through an SCG OnDemand Login Desktop.

  • If you are running rclone config through an SCG OnDemand Login Desktop, when you are asked to use auto config, choose “Yes”. A web browser will launch; you should proceed to log in to Box with your Stanford email address. You will be sent through Stanford Login. The Box web site also might ask if you are sure you want to allow rclone access to your Box account. Once you have logged in and granted authorization, you will see a message telling you to go back to rclone.

    Once the web browser says you can return to rclone, close the web browser and go back to the terminal window. rclone will be asking you if you want to save the new configuration; you should say yes. At this point, the Box remote is ready; you may now exit rclone config.

  • If you are running rclone config through a normal SSH connection, and you have rclone downloaded to your computer, when you are asked to use auto config, choose “No”. You will be told to run a command on your computer. Open a Terminal on your computer, and run the command. Remember, in this terminal window, you will not be connected to SCG. rclone will be running on your computer, opening a web browser on your computer.

    rclone on your computer will launch a web browser (again, on your computer), and will take you through the process of logging in to Box. At the end, you will be given a code. Copy the code, go back to your SCG connection (where rclone config is waiting), and paste the code.

Configuration should now be complete.

Regular rclone commands (like ls, copy, and sync) work fine on Box.

To see the list of folders shared with you, use the lsd sub-command, like so:

rclone lsd box:

(The above command assumes you named your Box remote “box”.)

Google Drive

Google Drive has two components:

  • Each Stanford user has their own, personal Google Drive space.

  • Google Shared Drives (formerly known as “Team Drive”) provides a common space for access to files, both for people within a Google Group, and for other users, including Google Service Accounts.

NOTE: Google Drive has a number of limitations:

  • Shared Drives may contain a maximum of 750,000 inodes. Each file and directory consumes one “inode” (just like with Oak storage).

  • Each user and Service Account may only upload 750 GiB of data per day. This is a hard limit, and cannot be increased.

  • There are limits on how many operations a user or Service Account may perform within Google Drive. These limits are not clearly documented. Accessing Google Drive too frequently may result in errors from the Google Drive service. rclone will recognize the situation and attempt to slow down, but may end up returning errors to the user. When that happens, waiting one day and re-trying the operation normally resolves the problem. Using your own OAuth client_id might also help, as that allows you to request API quota increases.

If you would like to use rclone to access your personal Google Drive and one or more Shared Drives, you will need to configure a separate remote for each.

The rclone documentation page for Box is https://rclone.org/drive/. Before setting up a Google Drive remote with rclone, please review this entire section.

Before you begin configuring the remote, you should decide if you want to use your own OAuth client_id, and if you want to use a Service Account. Google Drive authenticates both the user performing an action, and the software being used to perform the action.

  • The OAuth client_id is used to authenticate the client (in this case, rclone) to Google Drive. rclone ships with an OAuth client_id that is shared by all users of rclone across the world. If you plan on using rclone regularly with Google Drive, you should consider getting your own OAuth client_id. Doing so means the actions of others will not affect your API quota, and gives you access to request API quota increases from Google.

  • A Service Account decouples the Google Drive access from your Google account. When you authenticate to Google as yourself, actions performed using rclone are performed as you. If your Google account becomes disabled, or you upload too much data in a day (either through rclone or elsewhere), the rclone remote would stop working. If you place your rclone configuration in a shared location, then others could use the rclone configuration to perform actions in Google Drive as you. This can be avoided by creating a Google Service account, which acts as a separate user for access to Google Drive. However, a Service Account may only be used with Shared Drives.

To use a custom OAuth client_id or Service Account, you will need a GCP (Google Cloud) project. This means that a PTA is required, but Google Drive is not a charged service, so you should not see any charges from your GCP project (unless, of course, you start using it for something else).

If you are a user of SCG, you can set up a Google Cloud project through SCG. Otherwise, you can set up a Google Cloud project through University IT. Afterwards, you can use the following pages to set up an OAuth client_id and a Service Account:

You should now be prepared to create the Google Drive remote(s) in rclone. Remember that each Drive (personal, team) needs to be created as a separate remote.

When configuring a Google Drive remote, you will first be asked for a Client ID and Client Secret. If you chose to create an OAuth client_id, enter the ID and secret here.

For scope of access, you should either choose “drive” (for read-write access) or “drive.readonly” (for read-only access). Read-only access may be useful for workflows that only download data from Drive.

The root folder ID should be left blank.

If you chose to use a Service Account, you will get a JSON file containing Service Account credentials. This is the time to provide the path to the JSON file. If you are not using a Service Account, the file path should be left blank.

You should choose to not edit advanced config.

If you chose to not use a Service Account, you will now be asked to auto config. This is where you authenticate to Google, and give rclone access to either your Drive or a Shared Drive. In general, you should choose “No”. You will be given a (long) URL to enter into a web browser, which will log you in to Google and give you a (shorter) code to give to rclone.

You may now be asked to configure the remote as a team drive. If you used a Service Account, you should choose “Yes”. If you authenticated as yourself, you can choose “Yes” to connect to a Shared Drive, or “No” to connect to your personal Google Drive. When you choose “Yes”, you will see a list of Team Drives you can access. If you are using a Service Account and cannot see the Shared Drive you want, make sure the Service Account’s email address has access to the Shared Drive.

At this point, Google Drive remote configruation is complete! To see the list of folders in your Drive, use the lsd sub-command, like so:

rclone lsd my-drive:

(The above command assumes you named your Drive remote “my-drive”.)

Other services

To see the full list of cloud services rclone supports, see the rclone documentation.

When using rclone, if it asks you to use auto-config, you should normally say “No”. SCG is a remote/headless machine, and choosing auto-config will normally launch a web browser. If you have X11 configured or are using a virtual desktop you can say “Yes”, but most of ths time you should say “No”, and then follow the instructions.

Here are quick links to remote-specific instructions that will be useful to SCG users:

Once your remote has been configured, you should check the rclone docs to see what commands are available. In the Subcommands section, we sugget reading up on the lsd, ls, copy, and sync subcommands.