Fork me on GitHub

Other articles

  1. Data wrangling in Pandas and Spark - Time series of energy production

    An important part of the everyday tasks of a data scientist is data wrangling. Input data must be accessed, retrieved, understood, and transformed before machine learning can be applied to create predictive models. While most talk is around deep learning these days, this less sexy topic is arguably more important for real life situations.

    read more

    comments

  2. Which is the most popular language for data science in 2017?

    Which is the most popular language for data science in 2017?

    While performing a market analysis for a business case I looked at some trends in the field, and made a quick query to answer the above question "for fun". Since the result was both striking and rather unexpected I decided to share it with you in this blog post. (Plus, it gave me an excuse to add support for Bokeh read more

    comments

  3. Building a chat bot - hosting

    In conjunction with a public referendum I was given the task to create an intelligent chat bot. The chat bot could understand what the user wanted to know and provide the answers our subject matter experts had prepared. This post describes you can give users access to the bot by hosting it at a web service. It also describes how to add more developers to the project. The post assumes that you have already read the first post and second post on this topic.

    read more

    comments

  4. Building a chat bot - programming

    In conjunction with a public referendum I was given the task to create an intelligent chat bot. The chat bot could understand what the user wanted to know and provide the answers our subject matter experts had prepared. This post describes how to program the bot to capture user intents and take action. The post assumes that you have already read the first post on this topic.

    read more

    comments

  5. Discovery of the Higgs Boson

    This week I wanted to draw attention to the book that my former colleagues and I wrote, and which was recently published. It is great to finally have it in my hand! The book is about the discovery of the Higgs boson at the Large Hadron Collider at CERN. read more

    comments

  6. Installing R-studio on Azure HDInsight

    Running R on Spark brings the advantage of distributed in memory computing and access to data stored in HDFS. In this context, it is convenient to have access to R-studio hosted from the R-server. In this post I will show a few ways how to install R-studio on the edge node of a HDInsight Spark cluster. read more

    comments

  7. Create Azure HDInsight clusters using templates

    Microsoft Azure provides big data infrastructure through the HDInsight product. However, they bill you per hour, no matter if you use the computing resources or not. One option is to destroy clusters when they are not needed, but to get the clusters back up with minimum effort it is best to use predefined scripts or templates. This post shows you a few alternatives how to accomplish this. read more

    comments

  8. Can you trust your data?

    Business is the process of managing under conditions of uncertainty. To ensure optimal outcome business leaders must base their decisions on information of varying reliability. This is not new, humans always had a need for making decisions with uncertain outcomes. “Is he a friend or foe?” “Should I trade my food for a hammer?” “Can I reach safety if I turn and run, or should I fight the lion?” read more

    comments

  9. Building a data lake 1: Weather and time

    I am currently building a data lake which will be used to improve operations at an energy company using machine learning. Among the many interesting topics the following are prioritized: Can we predict the energy production of hydroelectric, solar and wind power plants? Can we predict the energy consumption using weather reports? After all, home owners need to heat their homes more on a cold winter day than on a sunny day in October. read more

    comments

  10. Using Folium to show geographic data

    This post demonstrates how to

    • Display a map from OpenStreetMaps using Folium
    • Add custom shape files to define regions of interest
    • Color the regions of interest based on data in a pandas dataframe
    • (New) Brief introduction to GeoPandas.

    The code of this post is available at https://github.com/rsandstroem/IPythonNotebooks/blob/master/GeoMapsFoliumDemo/GeoMapsFoliumDemo.ipynb .

    August 1, 2017: This post is almost a year old, but I decided to make some technical improvements to it to improve the viewing experience online. (It is the Swiss National Day after all!) During this renovation, I stumbled upon GeoPandas, which is a great package worth mentioning in this context. read more

    comments

  11. Simple MongoDB demo

    Introduction

    This blog post is a tutorial that introduces the basics of MongoDB and how to utilize it within a python environment. To accomplish this we will use pymongo to connect python with MongoDB.

    Mongo DB is a convenient NoSQL database which uses JSON syntax to store and query documents. I am frequently using MongoDB when dealing with streaming data from sensors or APIs. In big data scenarios I find MongoDB powerful since it allows for replication and sharding, thus enabling distributed computing and high availability, but that is a topic for another tutorial!

    read more

    comments

blogroll

social