Lessons Learned in Data Science

I have recently moved to Seattle to the newly founded DIRAC Institute, which means that my three years at NYU’s Center for Data Science (CDS)  are coming to an end. I thought I’d take this opportunity for a little actually quite long (sorry) retrospective and summarize those three years in this post.
Coming to CDS was an experiment. Indeed, the CDS itself is an experiment, and its scientific mission currently funded by a joint grant provided by the Moore and Sloan foundations, along with two other institutes at UC Berkeley and UW Seattle. I came to CDS with the goal of learning as much as I can about how other fields deal with data. I did that, too, occasionally, but the learning experiences I value most have nothing at all to do with that. They’re in no particular order of importance, and credit for some of this post goes to David Hogg for some illuminating discussions along the way.
1) What’s a data scientist, anyway? After there years at the Center for Data Science, I’m still no closer to understanding what a “data scientist” actually is. But that’s okay, I’m not sure anyone really does (yet?). I sort of call myself that occasionally, in an awkward attempt to try and explain what I do and why I think it’s useful (despite the occasional heckling from other scientists about how it’s all hype and not really science, anyway).
2) The person with whom you have the least scientific overlap is the person you can learn most from. It sounds obvious when spelled out, doesn’t it? Coming from an astronomy background, I didn’t get to talk to people from the social sciences very often before I came to CDS. But then I got to talk to Brittany Fiore-Gartland at Astro Hack Week 2014, and was really excited by her approach to studying communities, which was entirely new to me. And during my three years, Laura Norén was undoubtedly the person who has had the greatest impact on how I see academia, my professional life in it, and how to critically look at the communities I belong to (also, if you haven’t yet, you should definitely sign up to her excellent data science newsletter). As an astronomer, I focus a lot of my time on the technical aspects of my job. At CDS, I learned that the social aspects are just as important, from thinking about ethics for data science to how to build an inclusive community.
3) Building Communities. Astro Hack Week is in it’s fifth year, and I’ve been more or less involved in all five of them. Because of the workshop’s experimental nature, we worry a lot about how to make it useful, welcoming and how to evaluate it. I’d never thought about conferences and communities in that much detail before, and from organizing these workshops (and listening to people who study these things for a living) I learned that (1) communities don’t just happen, they need to be actively built and (2) building a community requires a lot of thoughtful and critical evaluation of the status quo. Especially in data science, which is new and kind of experimental itself, having a shared sense of cooperation is important. My best experiences in CDS were with the people deeply invested in making it a success (which, to be fair, is pretty much all of them!), with those that were ready to invest the time required to learn how to talk to people from other fields. That doesn’t work without having a recognizable community that people are invested in and care about making a success. So, for building a successful data science collaboration, build a community first. The rest will follow almost automatically.
4) Ask all the (right) questions. Coming out of a PhD, I thought I knew how to ask questions. I was wrong. I knew how to ask questions of astronomers, but these weren’t the questions I wanted to ask at CDS. In order to learn from the people with knowledge most dissimilar to what I was used to, I had to learn how to talk to them first. That’s much harder than I had expected, because there’s little shared language between fields like computer science, astronomy and neuroscience. And I never realized how much time we I spend in academia trying not to seem stupid. In order to really learn things I wanted to know about, I had to overcome that reflex and just ask the simple, basic stuff (note: basic to people in other fields).
To be fair, in return I occasionally answered questions that would seem fairly basic to an astronomer. Overcoming that reflex of “that’s probably a stupid question” is still something I have to actively work on three years later. On the plus side, I learned a lot about how to communicate about astronomy with non-astronomers (hint: nobody knows what a “light curve” is!), which has been valuable occasionally for translating between fields.
5) Impostor Syndrome sucks. Seriously. Because during my time at CDS, I split my time between the data scientists and the astronomers, I got to feel inadequate in two directions: with the data scientists, because I will never know as much about machine learning as someone with a PhD in it, but because I now no longer know as much
about astronomy as the astronomers. Realistically, I know that there is value in
being at the interface, but that doesn’t always stop the gut reaction of “oh, crap, I suck at everything!”
6) How to take scientific risks (and live with them). I’m a pretty risk-averse person by nature. I took the job at CDS because it was what I really, really wanted to do, and also by patently ignoring the potential failure on the academic job market for doing something that’s “not really science“ (senior astronomy professor, private communication, 2017). I also promised myself that I would write at least one paper at CDS that isn’t on astronomy, because why not jump straight into the deep end, right? In the end, I wrote two, but neither was what I expected: I co-authored a paper about participant selection for conferences, and about hack weeks. I wrote some astronomy papers, too, but most of them are still quite methods-focused. Does it pay off? Has it doomed my career? Not quite yet; I have a job for the next few years, so perhaps ask me again when I’m braving the job market once more. The one thing I can say is that writing those papers were immensely fun, and I had the privilege of working on them with some fantastic researchers I wouldn’t have had the opportunity to meet without the existence of the Moore-Sloan Data Science Environment. As a bonus, I learned a whole lot about academic writing and about how other fields do things. However, we still don’t know whether traditional journals will actually publish either of the two papers, so stay tuned (and please cross your fingers for favourable reviewers).
7) The most important outcomes are very hard to measure.
I probably don’t even know all big and small ways in which these three years impacted me both professionally and personally, but many of those I do know about aren’t easily measured by publications or citation networks and lines of code written. Like for example the confidence to start an open-source project and do my science all out in the open, because there were lots of great role models at CDS. Or the confidence to just go and search the statistics literature or read a machine learning paper, which I wouldn’t have had three years ago. Or the potential future (planned) collaboration with someone in computational chemistry that we haven’t gotten around to yet. Or discussing American politics with someone who studies it for a living. Or having lunch with someone from another field only to learn an entirely new perspective on things I thought I knew. Or learning from a researcher in neuroscience how they do time series analysis. Or having Brian McFee tell me that 90% of all computational problems seemingly boil down to using graphs (I really need to learn more about those!). Or discussing with David Hogg and Phil Marshall whether tutorials at Astro Hack Week should be mandatory or not. Or having the freedom to work on a data journalism project with Meredith Broussard instead of my usual astronomy projects for a few days. Or being able to go into the neighboring office when it was occupied by Andreas Müller and ask “Help! Andy, how does cross validation work again?”. Or the many, many data science lunch talks I went to and made notes like “need to chase this up” or “this could be useful for astronomy.”
Many of those might actually never lead to collaborations our outputs in easily measurable ways during my time at NYU itself, but they helped to shape my understanding of science and the world in so many ways they will impact my research for many years to come.
So, this isn’t everything, far from it, but this is already longer than anyone wants to read. As perhaps has become clear from this post, and maybe unsurprisingly, it’s the people at CDS who have made the most impact, and I can only hope that I’ve given back a fraction of what I’ve learned at CDS. I am looking forward with excitement to my new job at the University of Washington, where I hope to apply everything I’ve experienced at CDS, and I’ll do my best to keep learning.

3 thoughts on “Lessons Learned in Data Science

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s