
Git for Digital Humanities
Posted: July 30, 2025 in Git
Version control is not only for keeping track of research data and files; it is also a potential safeguard against loss of information and a vital component in making research reproducible and, if done well, understandable.
Here are some tips I have developed for myself over the years:
- Every DH research project should employ the distributed version control system Git for code and, if possible, data.
- Concerns about the size of the dataset and its content (e.g., whether it contains protected, private, sensitive, or proprietary information) must dictate how the dataset is tracked.
- Use Git LFS for tracking large files (audio, video, large datasets, etc.).
- A
.gitignore
file should be used judiciously to prevent sensitive, protected, or proprietary knowledge from being tracked by Git and pushed to remote repositories. A useful collection of ready-made.gitignore
files can be found at https://github.com/github/gitignore. - Every Git repository should have a
README.md
file that provides basic information about the project. - Every Git repository should have a license clearly indicating whether the content can be shared, reused, repurposed, etc. The site https://choosealicense.com/ is a good resource for determining which license is appropriate.
- Code repositories should include files (e.g.,
requirements.txt
,environment.yml
) that enable other researchers to use the same software versions that were used in the original project. - Commit messages should be thought of as a research journal, containing consistent and clear language. The Conventional Commits standard is worth following.e Not to toot my own horn, but I wrote an article on this subject. Don't follow this example: https://xkcd.com/1296/ 😉.
- Milestones in the project should have annotated tags, and those tags should be pushed to the remote repository.
- Major milestones should be made citable with a service such as Zenodo.
- Each project should have a remote repository on a platform such as GitHub, GitLab, or ideally an institutional repository.
- Remote repositories should be set to private if they contain any information that should not be publicly available.
Photo by Gabriel Heinzer on Unsplash