Skip to main content

Git for Digital Humanities

Version control is not only for keeping track of research data and files; it is also a potential safeguard against loss of information and a vital component in making research reproducible and, if done well, understandable.

Here are some tips I have developed for myself over the years:

  1. Every DH research project should employ the distributed version control system Git for code and, if possible, data.
    • Concerns about the size of the dataset and its content (e.g., whether it contains protected, private, sensitive, or proprietary information) must dictate how the dataset is tracked.
    • Use Git LFS for tracking large files (audio, video, large datasets, etc.).
  2. A .gitignore file should be used judiciously to prevent sensitive, protected, or proprietary knowledge from being tracked by Git and pushed to remote repositories. A useful collection of ready-made .gitignore files can be found at https://github.com/github/gitignore.
  3. Every Git repository should have a README.md file that provides basic information about the project.
  4. Every Git repository should have a license clearly indicating whether the content can be shared, reused, repurposed, etc. The site https://choosealicense.com/ is a good resource for determining which license is appropriate.
  5. Code repositories should include files (e.g., requirements.txt, environment.yml) that enable other researchers to use the same software versions that were used in the original project.
  6. Commit messages should be thought of as a research journal, containing consistent and clear language. The Conventional Commits standard is worth following.e Not to toot my own horn, but I wrote an article on this subject. Don't follow this example: https://xkcd.com/1296/ 😉.
  7. Milestones in the project should have annotated tags, and those tags should be pushed to the remote repository.
  8. Major milestones should be made citable with a service such as Zenodo.
  9. Each project should have a remote repository on a platform such as GitHub, GitLab, or ideally an institutional repository.
  10. Remote repositories should be set to private if they contain any information that should not be publicly available.

Photo by Gabriel Heinzer on Unsplash