Getting and Cleaning Data – A Data Scientist’s Perspective

Getting and Cleaning Data – A Data Scientist’s Perspective

Getting cleaning is easy with the available tools that we have right now, and the available technology on sharing datasets. We have tools such as git, svn, zip file and more. However, although these tools are available, they are to be installed on the machine in order to read data.

Most resources that we have today are mostly compatible, but if you are to get data that is large, or something that is in the realm of security, you will have to authenticate to access data. After having the data, you will then need to inspect if the corresponding data dictionary is available, or the data was labeled if it is a csv file. There are cases where data is labeled on a per item per field, which is easy to read in some quick linear view, without having to look at the header, you will be able to know what the data is. The hard part on this that it will give you a toll later is, you will have to clean the label part. For example this line:
name: bob pet_type: cat age: 3

You will have to clean up by removing first the label, there are cases that you will have to transform the cleaned data to make use of it at later stage of analysis.

If you will write some function to process data, make sure to use low-level tools and functions. This will greatly reduce your time to process data. Also it is better to have a separate script or tool dedicated on the cleaning part, before analyzing data to save time. After that you can then create another intermediate data for use on the visualization part.

At the end of your analysis, you should be able to present your analytics report in some form such as a PDF or a presentation available to your audience. When writing reports, make sure you assume the audience is not technical and needs an easy to understand format and that there is adequate data for seeing that same insight that you want to communicate.