Fork me on GitHub

Other articles

  1. Data wrangling in Pandas and Spark - Time series of energy production

    An important part of the everyday tasks of a data scientist is data wrangling. Input data must be accessed, retrieved, understood, and transformed before machine learning can be applied to create predictive models. While most talk is around deep learning these days, this less sexy topic is arguably more important for real life situations.

    read more

    comments

  2. Installing R-studio on Azure HDInsight

    Running R on Spark brings the advantage of distributed in memory computing and access to data stored in HDFS. In this context, it is convenient to have access to R-studio hosted from the R-server. In this post I will show a few ways how to install R-studio on the edge node of a HDInsight Spark cluster. read more

    comments

  3. Create Azure HDInsight clusters using templates

    Microsoft Azure provides big data infrastructure through the HDInsight product. However, they bill you per hour, no matter if you use the computing resources or not. One option is to destroy clusters when they are not needed, but to get the clusters back up with minimum effort it is best to use predefined scripts or templates. This post shows you a few alternatives how to accomplish this. read more

    comments

  4. Building a data lake 1: Weather and time

    I am currently building a data lake which will be used to improve operations at an energy company using machine learning. Among the many interesting topics the following are prioritized: Can we predict the energy production of hydroelectric, solar and wind power plants? Can we predict the energy consumption using weather reports? After all, home owners need to heat their homes more on a cold winter day than on a sunny day in October. read more

    comments

  5. Using Folium to show geographic data

    This post demonstrates how to

    • Display a map from OpenStreetMaps using Folium
    • Add custom shape files to define regions of interest
    • Color the regions of interest based on data in a pandas dataframe
    • (New) Brief introduction to GeoPandas.

    The code of this post is available at https://github.com/rsandstroem/IPythonNotebooks/blob/master/GeoMapsFoliumDemo/GeoMapsFoliumDemo.ipynb .

    August 1, 2017: This post is almost a year old, but I decided to make some technical improvements to it to improve the viewing experience online. (It is the Swiss National Day after all!) During this renovation, I stumbled upon GeoPandas, which is a great package worth mentioning in this context. read more

    comments

  6. Simple MongoDB demo

    Introduction

    This blog post is a tutorial that introduces the basics of MongoDB and how to utilize it within a python environment. To accomplish this we will use pymongo to connect python with MongoDB.

    Mongo DB is a convenient NoSQL database which uses JSON syntax to store and query documents. I am frequently using MongoDB when dealing with streaming data from sensors or APIs. In big data scenarios I find MongoDB powerful since it allows for replication and sharding, thus enabling distributed computing and high availability, but that is a topic for another tutorial!

    read more

    comments

blogroll

social