~/$ mbaurin's blog

my first open source contribution was terrifying

i wanted to contribute to open source but had no idea where to start.

most projects felt too intimidating, too complex, or too far from what i actually understood. then i found a GitHub issue in StarRocks asking for a format_bytes() function. simple request: convert raw byte counts to human readable formats like "4.00 KB" instead of "4096".

i understood the problem. i could write the solution. so i did.

but hitting "create pull request" still meant my code would be public, reviewable by strangers, potentially criticized by people much smarter than me.

and then i remembered... we are all just apes typing on keyboards on a floating rock. i clicked submit.

why this particular contribution

honestly? there was no dramatic story behind it. i never used StarRocks before. i was not solving a critical production issue or scratching my own itch.

i just wanted to contribute to something, and this seemed achievable. a database function that formats bytes. straightforward logic, clear requirements, bounded scope. perfect for someone who had never submitted a patch to a major open source project.

sometimes the best reason to do something is that you can do it.

diving into someone else's codebase

contributing to StarRocks meant understanding how a distributed analytical database works under the hood. not just the user-facing SQL, but the C++ backend, the Java frontend, the function registration system, the testing framework.

the codebase was intimidating. thousands of files, complex build systems, coding standards i had never seen. where do you even start when you want to add a simple function?

turns out, you start by reading a lot of existing code. how do other string functions work? where are they registered? what does the testing look like? i spent more time reading than writing, which felt inefficient but was probably the most valuable part.

the implementation decisions

the actual function was straightforward: take a BIGINT, convert it to human readable format with appropriate units. but the devil was in the details.

should i use 1000 or 1024 as the base? most users expect "KB" but technically that is 1024 bytes. i went with 1024-based calculations but displayed KB/MB/GB instead of KiB/MiB/GiB because that is what people actually expect to see.

what about edge cases? negative numbers should return null. zero should return "0 B". null input should return null. these seem obvious now but took actual thought at the time.

the C++ implementation was surprisingly satisfying:

// Find appropriate unit
int unit_index = 0;
for (int i = 6; i >= 0; --i) {
    if (bytes >= thresholds[i]) {
        unit_index = i;
        break;
    }
}

simple loop, clear logic, handles everything from bytes to exabytes. sometimes the straightforward solution is the right solution.

what i learned

first, reading existing code is not procrastination, it is research. understanding the patterns and conventions saved me multiple review cycles.

second, edge cases matter more in open source. when you are building for everyone, you need to handle everything.

third, the open source community is surprisingly welcoming. i expected gatekeeping and elitism. instead i found people who genuinely wanted to help improve the software.

my format_bytes() function is now part of StarRocks. somewhere, someone is using it to make sense of their data, which feels pretty good.

the terrifying part was not the technical complexity. it was realizing that contributing to open source means your code becomes permanent, public, and used by people you will never meet.

but that is exactly why it matters.