Solving the lonely data scientist problem

By Monica Gerber in Code review

December 17, 2024

There is a well documented problem across statisticians, data scientists, and data analysts. Whether they are an isolated data scientist or are embedded on a team of statisticians, they tend to do their work alone. Even if your team is using git and GitHub, it doesn’t mean that people are reading each other’s code.

The longer that I worked as a statistician on research teams, the more I became convinced that code review should be a key component of this job and would solve the problem of working in isolation. How else would I grow and if I wasn’t looking at others’ work and having someone review mine? If you are a writer or editor, one of the key ways you get better at your craft is by reading widely and sharing drafts. I started advocating for a code review system in my group. However, each time I tried to implement this practice it felt like it never really stuck.

I couldn’t understand why code review in the lab was such a difficult practice to establish. Even though it’s not a typical practice in the research environments where I worked, the benefits felt apparent. However, it was hard to justify incorporating code review into our work patterns when others viewed it as an additional hurdle to getting things done. I realized later that my request for the group to implement code review failed because I was actually asking the group to adopt a different, team mindset for how they approached their work and to develop different patterns of interacting with each other.

The mindset shift

Code review is hard to implement because we tend to view it as a technical change for teams (learning git, setting up an organizational Github, etc.), rather than a shift in how we interact with each other. Mentoring research teams through Openscapes helped me understand this failure of implementation more clearly. Openscapes has been pivotal in fostering a conversation about how teams must adopt new patterns of interacting with each other to solve pressing scientific issues like climate change. Critically, Julia Lowndes, Openscapes Co-founder, points out, teams must implement new social infrastructure alongside technical infrastructure to reshape how we do science.

The Openscapes mindset teaches us different ways of working in teams to increase the speed at which we solve big problems. This is a much loftier goal than my own personal goal to decrease isolation in the workplace and improve team morale! But Openscapes showed me that part of my frustration of working in isolation was connected to the larger picture of moving forward from old approaches of working to an Open Science culture.

Below, I dig into the friction I’ve observed when data science teams¹ try to adopt code review², and how a socio-technical approach can move a team from a group of individuals struggling to move forward together, to a group that practices team science.

Friction #1: None of our training has taught us how to do this

First, it’s important to acknowledge that most of our educational training did not cover code review, why it might be important, or how to do it. In my experience, there is more of a mindset that code is either ‘right’ or ‘wrong’, but not that writing code is a practice to develop or get better at. Additionally, many programs do not teach the tools needed, such as version control, that would help enable code review (historically, at least – this is changing).

To move forward, team leaders must empower a learning culture. This means not only setting aside the time to learn and take courses, but practicing and demonstrating how to be a learner. Prioritize learning over knowing, and acknowledge that people with all levels of experience have something to learn. Consider scheduling a regular collaborative learning meeting where people have the space to practice, get feedback and learn together (see Openscapes’ Seaside Chats).

Friction #2: We’re afraid of being wrong and embarassed to make errors

There is a subset of statistical programmers who work on clinical trials whose job is to “double program.” These statisticians independently write scripts and then compare the data outputs of those scripts for equality. This practice is an example of how, in the biomedical realm, we are focused on getting the output just right, but the work that goes into its creation is hidden, sometimes intentionally. And the programmer becomes the only parameter to safeguard against errors. Because we’re used to only checking the output, and we haven’t developed the practice of others reading our code and giving feedback, it makes us feel vulnerable to all of a sudden have someone peeking in.

The reality is that when we start looking at more code, a lot of it will be less than ideal – due to the fact that a lot of us are self-taught or have developed our own practices in isolation. In addition to learning version control and code collaboration tools, there is a real need for learning better programming practices.

While we upskill by learning how to modularize our code, write functions, create unit tests, and nix hard coding variables, it’s crucial to develop psychological safety on teams and an environment where people feel comfortable with discussing mistakes (“blameless post mortem” culture). In Opinionated Analysis Development, Hilary Parker discusses how to shift our paradigm from viewing an analysis error as an individual’s fault to viewing an error as a symptom of a failed system – this is in contrast to the “double programming” model where any errors that arise are due to a mistake made by the programmer. Managers should also expect that some team members will have anxiety related to code review and may need more support (see the Code Review Anxiety Workbook).

Friction #3: Our collaborators value the output, not the code development process

In multi-disciplinary scientific teams, we collaborate with other domain experts who may only be interested in seeing the output of code, for example, a figure or a table. The code used to produce outputs is often undervalued or invisible. This is especially true in fields where data is sensitive or restricted (e.g. human subjects research) and open-source tools or practices are harder to implement. If collaborators never see the code, and the code and/or data is not published along with the findings, statisticians are less likely to put the time into improving the code base for the research they are conducting.

This creates a source of consistent and burdensome technical debt on data science teams. Speed of delivery is prioritized over making things easier for our future selves or co-workers. Even if we can’t publish our code and data in a public repository, and not all of our collaborators will value a beautiful code base, investing in the code base will pay off in terms of future time and increased team morale.

In addition to improving our programming skills, there are many tools that we can adopt to improve our code bases and reduce technical debt. Some examples are style guidelines and linters, templates and research compendiums, and common packages and workflows. However, the team must work together to agree upon the set of shared practices and norms that create the structure that these tools try to implement. Teams must care for their code base, develop an opinionated approach to development, and set aside time to work on shared infrastructure.

Friction #4: There’s not enough time to read code, and when we do, it’s hard!

When teams first set up a code review process, it often feels overwhelming. It makes a bunch of technical debt more visible, and there’s the pressure on people’s time not only to learn, but to now spend time reading code. So while teams are getting started with version control, using package managers, and setting up reproducible environments so that others can run their code, a shift in how the team spends their time may be necessary.

Kudrjavets and Rstogi’s research on code velocity among software engineers resulted in suggestions for improving the speed of code review. Some of their recommendations include prioritizing follow-up changes (what is critical vs. suggested), being mindful of the size of the code review, and setting aside time for code review. They found that developers spend almost 20% of the work week on reviewing code. I suspect that if this time is not recognized and accounted for on statistical teams, reviewing code is likely to be devalued.

The overwhelm that reviewing code can bring about is also related to the cognitive load that’s needed to read an unfamiliar code. Felienne’s book, The Programmer’s Brain is an excellent resource for statisticians who would like to get better at reading code and onboarding new team members to a new code base. I like this book because it recognizes that reading code and learning new coding languages is a skill that takes practice and that we can develop.

Technical Infrastructure	Social Infrastructure
Version control (git)	Investment in learning and trust
GitHub and Git Flow	Team psychological safety, learning how to to give and receive feedback, time set aside for code review
Better programming practices (modularizing code, learning to write functions, using a linter, etc.)	Shared guidelines (e.g. a style guide), blameless post mortem culture
Research compendium model, shared templates and workflows	Develop practice of contributing to a shared code base
Reproducibility tools (package management, reproducible computing environments, etc.)	Cultural shift from valuing only outputs of code to valuing the code base as something that needs to be reproduced and maintained over time.

Table 1. Technical and social practices that enable code review

In conclusion

Changing the way we work can feel overwhelming due to the uncertainty it creates and the pressures it can place on our time. However, it should also feel exciting and invigorating. That’s why I love Julia Lowndes and Erin Robinson’s analogy to embarking on a journey and creating a trail system:

The Open science trail systems, as in real life, don’t just happen overnight. They become worn in through use or they are built with intentional investment.

Arriving at the trailhead brings a rush of energy, but at the same time feelings of fear of the unknown or doubts about bandwidth can be ameliorated by traveling “safely together, intentionally.” As teams set off on their journey, I hope they pack their social infrastructure–their first aid training and campfire songs–along with their map and compass.

I use “data science teams” and “statistical teams” interchangeably here. ↩︎
By “code review” I am referring broadly to activities involving reading and reviewing code, from reading and reproducing an analysis notebook, to reviewing a pull request on a shared code repository or software package. ↩︎