The meta on all this

What exactly is this thing? Why did I even bother writing it? How come did I choose this format?

About this project

What's this

This "book" is a personally elaborated collection of material on Data Science. The title refers to, without any intention of comparison, Poe's Tales of Mystery and Imagination. It's just an inspiration. Within the general public, science has generally a reputation for being something cryptic and not easily accessible. However, the enormous and relatively recent rise in interest towards data renders science a grandiose playground for imagination, so I thought of a the title that would capture these two sides of the coin.

Data Science is an interdisciplinary field and a very broad one, somewhat ill-defined, and there is tons of good material in the web; I cannot read it all because I'm only human and my time is limited, there's also countless books I'd like to study, there's countless MOOCs I'd like to take. Because of this, collecting my own notes into a single place helps me a lot in both my own learning and in order to have one reference point to go to when in need of info. This is why this project is not meant to be exhaustive, nor finalised and set in stone: it is an ongoing effort, a perennial work in progress.

The other big advantage of doing this is the fact that I keep track of all material explored when dealing with the subject at hand: it is present in the form of references at the bottom of each page. It's a bit like building a taxonomy to navigate the immense archive of knowledge in the field.

I should definitely point out that none of this is a rigorous exposition of the topics, nor that it is exhaustive. Remember, it's primarily notes. There's way better material around on the mathematical side of what it treated, and I've always made the effort to list some in the references if you want to dig deeper or read the original sources of an idea.

I just thought it might be useful to others as well so here goes.

A bit on the genesis of this

This project has formerly existed in several forms throughout the years. The first form was a bunch of hand-written notes in several notebooks I had. This form lived long enough but eventually proved to be rather unfeasible to update - editing, adding notes and expanding something hand-written isn't the most practical thing to do, nor very scalable (and note that I do love writing things by hand!). Then, I had considered GitBook but at the time it didn't offer features I desired for this. So I had resorted to keep this as a GitHub repo of Jupyter notebooks, served with the nbviewer. It worked pretty well for a while. Then one day I sat down with the intention to refactor the whole thing, reconsidered GitBook which meantime had expanded a lot, and here we are now. The original notebooks repo still exists as the code support to this book (you will see links to the notebooks in each page).

How to use

Use it as you please. I just want to stress that this book spans a large variety of topics in the space of "data science" and it is not necessarily the case that you would see all of these in a typical "data scientist" position - do not feel like you have to have expertise in all the areas outlined here (I certainly do not). Throughout my studies and career as researcher and data scientist, I had the opportunity (and luck) to deal with a large variety of topics and areas. I'm not an expert in any, though, I just learn along the way. Data science is a very broad field - these notes just present various sub-areas and flavours of what it can mean.

Structure of the book

I gave the material the organisation which to me makes more sense, this does not mean that it is the best organisation ever, not that it is an objective way of classing things. In fact, some of the topics treated could be potentially categorised differently across folders.

Anyway, each single topic lives in its own page, so that it can be read as isolated from all the rest. When extensive code and plots are present in a page, a link to a Jupyter notebook is also provided (as per above).

Draft pages

Some pages are marked as DRAFT. It means I'm still largely working on it (usually there's also a GitHub issue open) and the content in there is partial or not really great.

About references

References are indicated at the end of each page, rather than at the global level. This is because I find it much more effective to have a list of further material from within the topic I'm looking at rather than as a separate thing. Furthermore, this allows for an orderly structure.

When accessible, a link to the PDF in the case of journal papers is always provided. Otherwise, when paper is not open-access and not being made accessible elsewhere by the author(s), the link will just point to the journal source.

On top of the topic-related references, the page Beautiful Web contains an updated list of great comprehensive resources on Data Science areas in general, which are amazing per se and are highly recommended for use and peruse.

Because these are notes, some material contains reinterpretations of existing one - credit is always given.

About data and images

Data

I typically use some free data for illustrative purposes: usually classic, old datasets which these days are utilised for pedagogical purposes. Links are always provided. Should I use something else, it may be data I've generated myself or that I've been given permission to use and show.

Images

Images are of various types:

  • I create them in code with Matplotlib or something else (this is for graphs)

  • I take them from the Internet (always when reuse is allowed and I give attribution)

  • I draw them by hand, this because to illustrate concepts I find hand-drawing to always be the most satisfying and clear mean; these ones may look terrible but I've decided to trade beauty for speed - after all, life is about compromises

  • I do them in the form of vectorised graphic, usually using Inkscape for the job