Tales of Science and Data

A project by Martina Pugliese.

This book is a collection of notes on Data Science, from Statistics to Machine Learning, passing through all sorts of related areas.

I've decided to give form to a rather disorderly collection of notes I had about data science & all sorts of related areas, which is how this project has generated. You can read more in the Meta page about the how's and the why's of this.

Contents

Meta and resources

This section explains how this whole thing has started and why, what it is and how it's done, plus some awesome resources found on the web.

Probability, statistics & data analysis

A collection of notes on topics regarding Probability and Statistics and the way to use them to analyse data and draw conclusions.

Machine learning: concepts and procedures

How do we do Machine Learning? This chapter offers a high-level overview of the techniques and methodologies.

Machine learning: fundamental algorithms

This chapter is pretty much a page for each algorithm in "shallow learning", that is, all non "deep". Neural networks, even when shallow, are not presented here as there is a dedicated chapter on them, which is the same chapter that dives into deep learning. The division here is into the main learning paradigms.

Machine learning: model assessment

This part deals with how to assess the quality of a model and diagnose problems.

Artificial neural networks

Digging into the world of Artificial Neural Networks, a fascinating area of Machine Learning particularly on the rise these days. This deserved its own chapter.

Natural language processing

Natural Language Processing (NLP) is the field (a part of Machine Learning) which deals with text, an unstructured data source. What NLP tries to do is putting text into numerical representations, and extracting information from it.

Computer vision

Images, seen by the machine. This section deals with using computers to extract and use information from visual data. We will illustrate a whole set of methods, which may or may not encompass the use of Neural Networks.

The Computer science appendix

Some (non-comprehensive) notes on Computer Science fundamentals.

The mathematics appendix

Some (non-comprehensive) notes on mathematics, used everywhere in data work. Useful little bits.

Toolbox

(Some) software tools used in Data Science, high-level overviews.

About the code parts

Several pages contain snippets of code. I've been using Python (3) and for those pages a link to a relative Jupyter notebook in the Github repo corresponding to this book is provided for your perusal if you want to play around. The overall repo is reachable on Github and you can also visualise the notebooks prettyfied via the Jupyter Notebooks viewer.

The libraries used in the notebooks are usually (unless specified) those of the Python data stack (Numpy, Scipy, sklearn, Pandas, ...). The plots presented in here have been customised, the repo contains all styling files.

Notify me of mistakes

Mistakes happen. Inaccuracies and oversights as well, from the content point to view to the rendering/graphics one (e.g., one TeX formula doesn't appear rendered). You are more than welcome, encouraged in fact, to submit issues to the repo for these things.

License

(C) 2017-2021 Martina Pugliese

This book is released under the Creative Commons NoDerivatives 4.0 International (CC BY-NC-ND 4.0).