Thursday, October 5, 2017

What I Find Hard In Doing Data Science Work - Survival Guide (Part 1.1)

Data Science is broad and it turns out that the hardest part in doing data science work is "data mining".

While you scrape and collect data from different sources, you need to determine which source holds the "source of truth" so the basis of your hypothetical theory is easily identified. While this is common to decision making, it turns out that this is not always true in data science.

What I realize in collecting data is that a source is just another point of failure. The bigger your scope, the more sources you have, the more discrepancy you'll have -- the more cross-checking you need to perform.
To make data actionable, it needs to be accessible, accurate and standardized.
Seeking for the correct values, one needs to figure out the "why" when inputs and outputs are shown right next to each other. While people rely on human intellect in performing judgments, where bias and error are at the 90% marginal rate -- in data science, you can't afford to be wrong. On the other hand, you can't afford not to know. And that's the reason why the data being collected should be reliable.

Types of data:
Data is everywhere. However, if we categorize the data into neutralization (aka form), it all boils down to two types. 

While you thought, that the one you should be paying attention to is towards "data you need", think again...

Some scenarios and cases don't give you the ability to nail down the data you are in need. So you're left with no option other than to create and generate it.

Extracting data is easy, generating data is complex.

Personal Experience:
While generating data is very rewarding, the story doesn't end there. Most common problem with data is "sorting" particularly "parsing". I don't have any good knowledge about excel sheets and other tools. Luckily, my bash skills can address most of the things I need.

If you have experience in sed, awk and regex -- you should be good.
Science is limited by data, Data is limited by Engineering
All set of tools are welcome, however, the main concern in executing the task will always be efficiency. Don't feel bad if you don't know how to do things in other ways (ie. like parsing data on excel sheets), instead, stick to what you know best and works.

Sunday, October 1, 2017

A Devops Engineer Inside The World of Data Science - Survival Guide (Part 1)

Recently, I was tasked by the CTO to help the Operations department be more efficient and self-sustaining in their day-to-day routine. It was said that the only way to dig what and where the problem is through checking the messy gold mines (aka data) where details are disclosed and kept.

For someone who doesn't have any background in data science and big data, I wasn't sure if I could deliver the needs on time and accurately. While at the back of my mind, there is this voice that says "take it and explore". So the troll face in me says "challenge accepted".
What is fascinating about startup is, you can be anyone! Wearing many hats is a privilege and it's always good to have a taste of everything...
Before jumping into the waters of data science, the main thing I was up to -- is to know the fundamentals of it. I am not only after the formulation but the logic on how "factors" and "components" affect your formulation. The foundation I am trying to build is from the thinking "When does data make sense?"

My strategy for this role would be:
  • Research - about the tools, practices and know-hows
  • Design Thinking - conceptualization, formulation and composition
  • Delivery - reporting, analysis and dashboards
NOTE: The catch about data science is that, you're solving a problem that was either asked or never thought existed. It's always the underlying message, that you're after for.
The task was given to me Friday, prior to ending the shift. Not wasting any of my time, I made sure weekends are well spent and my Monday shift is all set.

This writeup doesn't give you the complete comprehensive guide to being a data scientist, rather gives you a good kickstart in taking your baby steps towards being one.

The main catch I was able to grasp is "visualization". Structured data is nonsense when it doesn't tell you a story on the first glimpse. That's the reason why people create and construct a great dashboard that explains it all.
When your work talks for itself, don't interrupt.
Since the early web, people love to do reporting with graphs and images to represent a body of information. As we evolve to modernization, the type of reporting also adapts the innovation. Dashboard plays a great role in reporting nowadays. Not only for analytics but also for user-experience.

As I deep dive into the topic of "dashboards", here are the pointers I noted from different articles and podcasts I've gone through.

Organizing Dashboard:
There are studies that prove that some dashboard are not cool as it looks like. Smart dashboards are what we are after for, thus, knowing what a bad dashboard is vital as we go along our research.

This is my personal structure of what a good dashboard looks like. Labelled based on "emphasis" and how people will look at it.

In creating dashboards that people will love to look at, using the right color scheme is a thing that should be observed. You need to be aware that not everyone who will be looking at your graph sheets and data details is not on a 100% visual state.

Choosing the right color, font and highlights will spice up the dashboard. It makes the "important" things easier to see and be marked.

Numerical Representation:
When numbers are involved in your dashboard, you might want to consider placing identifiers on every numerical value. This way, by just simply looking at the dashboard -- users already know what is the message.

Like the image below, which do you think speaks more? The one on the left or the one on the right? Which does makes more sense?

NOTE: When a number is added into your dashboard, it doesn't represent anything. It purely states the "value" but doesn't tell you anything more. Adding a "symbol" (ie. arrow), will tell you what that number means (might it be good or bad).

Other common mistakes people think makes their dashboard cool are the following:
  • Placing useless decorations
  • Implementing crosstabs
  • Using scrollbars 

As for the tools, there are now lots of choices you can pick -- from opensource to enterprise grade. As for me, the company is subscribed to Tableau tools.

Here is some list of the software that can help you set your course.


Part 2 of this writeup will somewhat tackle about the Tableau tools usage and certain topics about data sorting, modelling and structuring. I wish you all the best for your data science career.

PS: I really appreciate how the company gives me this kind of opportunity.