S02 E05

Democratizing Data Tools at Upsolver

Podcast available on
Santona Tuli

Santona Tuli

Santona Tuli

Santona Tuli is head of Data at Upsolver where she is shaping strategy to make data tools more user-friendly. She loves turning complex data problems into simple solutions! She even led a team of physicist at CERN studying nuclear physics—spotting rare particles. Santona’s all about making data science accessible and easy for everyone.

Episode Summary

The role of data at Upsolver

Data isn’t just a tool at Upsolver—it’s the product. Tuli and her team use Upsolver’s own platform to tackle everything from high-scale ingestion to querying smaller data sets, making data accessible for informed decisions while keeping complexity low.

Company culture and Tuli’s top tips to lead a data team

At Upsolver, it’s all about agility and empowerment. With a “try it and see” attitude, everyone is encouraged to experiment and pivot. Tuli advises data leaders toembrace small data challenges, balance intuition with insights, and always focus on delivering real value without over-engineering.

The data footprint at Upsolver

With a lean data stack, Upsolver runs primarily on its own platform, utilizing AWS S3 for storage, Athena for low-level querying, and Snowflake for analytics. The focus is on simplicity and purpose, with tailored dashboards to support impactful insights.

The biggest data wins at Upsolver

One major achievement is the development of the ‘Chill Data Summit,’ a learning hub for Iceberg, where data professionals and newcomers alike dive into this open-source lakehouse format. Additionally, Upsolver’s strong in-house analytics keeps the team focused on core product and customer-driven needs.

What’s next for Tuli and the team?

With Iceberg adoption and innovative data solutions, Tuli’s team will continue refining how data empowers Upsolver’s roadmap. Focused on both customer feedback and the evolving data landscape, they’re poised to expand Upsolver’s value while keeping it agile and user-centered.

Transcript

Tarush Aggarwal (00:00)

Welcome to another episode of People of Data, where we get to highlight the wonderful people of data. These are our data leaders from various industries, geographies, and disciplines, discover how companies are using data, their achievements, and the challenges they face. Today on the show, we have Santona Tuli the head of data at Upsolver, where she is shaping strategy to make data tools more user-friendly. She loves learning complex data problems into simple solutions.

She even led a team of physicists at CERN studying nuclear physics, spotting rare particles. Santona is all about making data science accessible and easy for everyone. Welcome to the show, Santona.

Santona Tuli (00:37)

Thank you so much for having me. This is going to be a blast.

Tarush Aggarwal (00:40)

Are you ready? Let's do it. So first question, what does a Upsolver do?

Santona Tuli (00:41)

Yep.

So Upsolver is a managed iceberg offering. We have been working with the lake house for many years, eight plus years. We just had it as a proprietary lake house and now we've sort of expanded to iceberg, which is open source. And to be fair, like we were supporting hive based lake houses before we just had sort of our own.

optimizations and management, like sort of organization style that was different. Now we've taken sort of our many years of experience with high scale, high volume data management and paired that up with the natural strengths of Iceberg as an open table format.

and sort of as best of both worlds, that's our offering. And then part of the offering is high-scale ingestion. So we're very good at ingesting data from high volume sources, like, you know, message queues, Kafka queues, et cetera, and also change data capture, which is replicating production databases. So we ingest from those systems, very easy to set up, and then your data is in a optimized iceberg lake house that you can.

query from basically wherever you want. Anything that supports Iceberg you can query from.

Tarush Aggarwal (01:55)

Yeah, that's awesome. What's super interesting is you are a company where data is your product, where you are building and selling a data platform. But what role does data play in helping a Upsolver internally?

Santona Tuli (02:10)

Yeah, I mean, it would be silly not to use this platform that we've built for our internal data work. So of course we do. it's, mean, we do, try to incorporate it in sort of every aspect of the company where it makes sense. I will caveat that with, Upsolver is a smallish startup, right? And I mean, I specialize in small startups. Let's say I like to work at small startups. And so this challenge of how do we

make small data meaningful and try to gain insights from the data and influence decision, but not over-index on the small data because it's not so reliable. This is a challenge that I've been working on since before Upsolver as well. So was at Astronomer before, sort of same thing, small-ish, a little bit larger than Upsolver, but small-ish startup. And then before that as well, I was at another startup, although that

was different because the product was actually database, so was an ML product. And so the corpus of data was actually much bigger as far as what we were providing to end users, but sort of the internal analytics stuff sort of more around our customers, because it was B2B. you know, like we might have Airbnb as a customer and they have a massive, you know,

document set that we were working with, but what if we wanted to make predictions about Airbnb as a customer, right? Then it's like one of a few hundred or whatever. So it's this interesting problem of we want to use data to its maximum potential. We want to use data to inform our product decisions, et cetera. But at the end of the day,

our data set is not that big. So to me, it's a fun challenge. It's a frustrating challenge sometimes because, as you mentioned, I come from more of a big data background where we're working with massive, massive data sets from particle collisions at CERN. So statistical significance is still a huge deal, of course. And I don't want to mislead people by saying it's easy to find the statistical significance because, of course, we're looking for very rare signals.

But, you know, sort of we start with a massive data set and we're, it's the game there is around game. The fun part there is around maximize that signal to noise ratio from this massive data set with a rare signal and get to that significance. Whereas with the smaller data sets in startup sort of analytics data work, it's more around, I start with a small data set. I don't want to throw anything away if I don't have to.

and I want to get something meaningful from this. And then I want to decide how meaningful it is. like coming back to the point of like confidence intervals and uncertainty bands. Like how do I not only say, okay, I am, you know, this is my data point. am X, I'm positive that this is going to happen.

but within this band of uncertainty, it's almost election time here in the US and it's like, well, you've got these polls and you know that some fraction of them are gonna be proven wrong and so on and so forth. So just because a poll says that it's 98 % confident or it's 98 % chances of candidate A winning, doesn't mean that they're winning.

98, like with 98 % of the vote or 98 % of the time, right? It's so much more to it. Anyway, sorry, I'm talking at length here, but yes. So I think for me, working at Upsolver and at other small startups, the interesting data challenge is how do we say meaningful things with the smaller data? And also I think the second part of that is how not to over engineer. So at larger companies,

Tarush Aggarwal (05:35)

Yeah.

Santona Tuli (05:42)

or larger teams in general, sometimes you have to build out really complex engineering workflows to work with the data. And I can, again, go back to my particle physics example or experience where the code bases are massive. It's year over year over year maintenance of this very specialized pipeline that pulls in the data and does these processing steps.

And we need that. We need the thousands of physicists and engineers working on that. But if you're trying to make decisions at a startup, you should move fast. And again, the data is not that massive and you have to be able to like really switch out components, change things. So how do we keep it simple? How do we not build a solution that's like really rigid and really big and complex? How do we sort of move fast?

Tarush Aggarwal (06:32)

Talking about moving fast and being a startup, what's the overall culture like at Upsolver? You mentioned the company's been around for eight years and it's had many different lives with earlier working on Hive and sort of proprietary formats and now moving towards what's on everyone's mind with Iceberg. What do you feel is the internal culture?

Santona Tuli (06:56)

The internal culture is very like, is autonomy focused. it's, think one of the few times I would say fewer, like it's more rare than common to that I felt like really empowered to do my job and like to go beyond my job and sort of have, you know, come up with good ideas and then, or ideas and then see if they're good and try them and also like fail.

but I do feel like our culture is solidly grounded on, just try it, do it, it's fine. And your ideas will be shot down and sometimes it's frustrating when it's like you spent like two weeks building this out and then it's like too much, but that's work anywhere, right?

So that's, think, what stands out the most. I think everyone's empowered to do what they need to do. for example, when we decided to go in on Iceberg right before that, we were sort of really focused on ingestion into Snowflake, because again, we were very good at ingesting high volume data.

And that, mean, of course we still have that feature, but we realized that iceberg was a bigger deal. So it was sort of like, let's go, let's go do this. Let's really commit and hone in on our iceberg offerings. And then the whole team was like, yeah, let's, let's rally around this and work on the optimization. we're also, I guess, agile is one way of putting it. So yeah.

Tarush Aggarwal (08:13)

And this might be a somewhat obvious question, but what does the data stack look like at Upsolver?

Santona Tuli (08:20)

Yeah, it is largely Upsolver. So our production database is Mongo. And so we pull in data using change data capture into our Upsolver platform, Upsolver Lake. then that's supported. That's backed by AWS S3. we've got

the data in NS3 as well as state NS3. That's sort of how we built out our platform so that it's really easy to recover state. It's all hanging out there and then you can query. So we use a few different query engines depending on the team and the needs. most of our data is in large part, the catalog is a glue catalog and we're using Athena.

to query it. But we also, you we use Snowflake as well. You know, it's sort of the more analytics focus table. So like more model data, data that's in sort of what we call marts or like metrics tables are accessible in Snowflake and we can query from there. For the lower level stuff, we usually go into Athena.

We don't have a ton of dashboards built out. So we do like, you know, there's Grafana and stuff for sort of tracking the product features. We're not like building out a ton of

and dashboards, I mean, who would look at them? There wouldn't be anyone to look at. We've got the subject matter experts that you can just talk to, ping, and you know. So we actually pretty heavily, our marketing team pretty heavily uses the marketing, like CRM tools that we use for answering a lot of questions. And then we can augment that, of course, with any insights that we have from product data, which is, you know, bread and butter. So of course we do a good job with the product data. We don't use a...

ton of SaaS offerings. Of course, we have a CRM tool and we have a ticketing tool at Zendesk. But other than that, we're not pulling in from hundreds of different sources or anything or a bunch of ETL, reverse ETL. We keep it simple and we really focus in our product data to figure out what we need to build and what we need to do better, what our users are saying as well.

Tarush Aggarwal (10:29)

Yeah, do you have any particular BI layer for your analytics stack or it's mainly stuff?

Santona Tuli (10:36)

Not really.

It's mainly query-focused. we do have some pre-built dashboards in QuickSight, which isn't great, but it's sufficient. And I've worked with various data visualization tools before. And for me, the...

Tarush Aggarwal (10:51)

Yeah.

Santona Tuli (11:03)

effort or the value isn't there from switching into like, you know, a Sigma or something like that at this time, as I grow the team, as we grow the team, we might feel the need for do that, to do that. But basically between the engineers that are like monitoring our production data, it's in Grafana and between like data, more data folks or analytics folks who are jumping in and asking, looking at

running Athena queries and sort of looking at QuickSide dashboards, that's sufficient for us today. So again, like we're not, we have a fairly rich metrics, you know, compendium, but we're not necessarily creating like.

dashboards or reporting on those because we just don't have the audience for that. Like if the CEO has a question and wants to know about something, I expect him to come talk to us data or analytics folks to sort of figure that out and we can figure that out together rather than having to look at a dashboard every morning.

Tarush Aggarwal (12:05)

makes a ton of sense. How big is the entire company?

Santona Tuli (12:08)

So we're about, I think, around 40 people right now, 40 something.

Tarush Aggarwal (12:13)

And are there any other folks on the data team?

Santona Tuli (12:17)

So I have one report.

Tarush Aggarwal (12:18)

Nice. So 2 out of 40, actually fun fact is like typically, very typically, engineering teams are about 20 % of a company at the latest stage. Obviously, I know at Absolver, it's going to be a lot more. And then data teams are about 20 % of the size of engineering teams. So.

that would basically indicate that 4 % of 40 is about 1.6, so you're just on the money.

Santona Tuli (12:45)

All right, awesome. That's really good to hear.

Now, I have been on teams before where we had an outsized data team and the data team was also... I think it's a common trope to fall into where when you're first establishing a data team to like, this is gonna be the... It's gonna save us. This is gonna be the thing that completely turns our startup around.

And we're all in on the data story. But you come in, and if you have 20 people for a 200-people team or something, what are all these people going to do? It's not even that there isn't enough work. You might be able to create the work, but then the deliverables.

are they really valuable? Are they really driving the sort of impact you think they're driving? And I think that's, it's really easy to over-engineer. I've seen it before, I've been part of teams that have done that before. And it's something that I think is now a cautionary tale for me.

Tarush Aggarwal (13:47)

Yeah, especially for a lot of companies which are just not a market fit or just around that point. Is that really the best use of limited resources to use that word?

Santona Tuli (13:53)

Great.

Right. And product market fit is so tricky too. Like you might, it's not, it's not a one thing, right? It's not like you hit product market fit and then, you know, you're, yeah, it's, it's on binary and then you sort of have to do the product market fit, fit proof, you know, a few times to really get there. So you can't just be like, yeah, we have product market fit and now it's time to onboard my, you know, 10 % of my work team. That's going to be data. No, that's not how that works.

Tarush Aggarwal (14:06)

Yeah, it's on battery.

What's one achievement which you've been, which you've really, which has had an impact on the company and is something you've been really proud of?

Santona Tuli (14:31)

Yeah, so it's actually, it's interesting at Upsolver. think it's been more my outward facing work. I've always been sort of product inclined and just I'm an outgoing person. So I like have traditionally sort of inserted myself, I mean, not without invitation, but sort of inserting myself into like customer calls to really understand like what their use case was. And this is even before Upsolver to like make sure that

we're sort of hitting it on the nail with the product that we're building. Having worked as a machine learning engineer before where I was using various vendor tools, it's not always easy to get the feedback through to the vendors and then get any changes that they make in a sufficiently short time scale that it actually matters anymore. Things move so fast. So I think...

when I switched from being a product focus or product focus, like a engineer or scientists working, building on a product, to doing data for more analytics and also for, building a data tool. So I sort of think of that as like two, two phases that, you know, for me, and honestly, I think of it as two types of jobs that you can have is you're either building, data stuff.

for a product or you're building data stuff because you have a tool that other folks are going to use. And of course, mean, this is not all jobs. is sort of like the data science landscape. anyways, when I made that switch, really wanted it to be like as a data person or as a data team,

at a data tooling company, it's super important to have that sort of feedback loop, have that, as a subject matter expert, give the feedback to the product team, but also, I haven't encountered all possible use cases. I've never worked on a CV product, like a computer vision.

product, right? So like my I have seen my job as being the person that is going to bring in that knowledge for products to leverage, even if I haven't done that personally. So just leveraging my friends, my network, my, you know, I like to say, like having a beer over, you know, with one of my friends that works at XYZ company and sort of like really understanding what his or her struggles are, and then sort of seeing what

what we can do as a company to address those use cases. So it's interesting, like it's almost like not data work, but it's, I don't know who else could do this work other than, you know, someone on the data team. So that has been very impactful that I'm proud of. And again, extending beyond up Upsolver at other places as well. And then sort of as a corollary to that perhaps,

I've been with the Iceberg move at Upsolver, with the Iceberg focus, we decided to start hosting our own sort of mini conference kind of thing, which is the Chill Data Summit, which, know, I don't know if you like the name, but I came up with it. So either way, you know, I'm to blame. So the Chill Data Summit is sort of a learning day for Apache Iceberg. And, you know, I've been pretty involved in that.

and along with the marketing team and the events, events coordination folks to really like, what are we going to talk about? What's the agenda going to be? Who can we pull in all the experts in iceberg? And how do we make it as impactful, as informative as possible for data scientists, data engineers, ML folks to spend a day learning iceberg. And I think that has been.

really impactful for Upsolver. think folks at Upsolver sort of appreciate that I'm doing that work and helping put together those sort of learning days.

Tarush Aggarwal (18:24)

That's awesome. What's one thing which has been challenging? Something which you're either working through right now or something which you have already resolved.

Santona Tuli (18:33)

So I think for me, it's going back to the challenge that I mentioned in the beginning, it's like working with small data and

getting meaningful insights, but not overly relying on sort of signals that may or may not be something has been a challenge.

When we go and decide on, let's say, a go-to-market strategy, right? Usually the way that, again, it's a small team and we're all sort of wearing multiple hats and then being agile. I'm usually in that room, trying to bring the voice of data into

into like, should our go to market be for now? Like not like large long-term strategies per se, but I'm talking more like what is the motion right now? What do we talk about? so bringing in like, this is what is on the psyche of the data industry right now.

And this is where we fit in into that story. Here's the data to back that up. And here's the parts that intuition, that heuristics and gut feeling, and how do we sort of make this decision that we're going to focus on this right now, but not think that this is going to solve all our problems or it's going to have outsized impact on what we're trying to do. So yeah, I think it's...

hard sometimes to be that voice of reason, to be the, and at the same time be like, hey, let's look at the data, let's not ignore the data. So, you it's, you you're nodding. I think perhaps you've been in this, or dealt with this trade off before. It's like, let's look at the data always, definitely. Also let's understand that this is like four points on a graph and there are many different ways to sort of fit those four points.

So yeah.

Tarush Aggarwal (20:22)

Santona, thank you so much for being in the show.

Santona Tuli (20:25)

Thank you so much.

Get notified when a new season is released

Please enter your work email.
Thank you for subscribing! Stay tuned for the next season!
Oops! Something went wrong while submitting the form.

Stay tuned for updates

Please enter your work email.
Stay tuned!
Oops! Something went wrong while submitting the form.

Don’t miss out on our top picks

Listen more