Introduction

This blog post is a tutorial that introduces the basics of MongoDB and how to utilize it within a python environment. To accomplish this we will use pymongo to connect python with MongoDB.

Mongo DB is a convenient NoSQL database which uses JSON syntax to store and query documents. I am frequently using MongoDB when dealing with streaming data from sensors or APIs. In big data scenarios I find MongoDB powerful since it allows for replication and sharding, thus enabling distributed computing and high availability, but that is a topic for another tutorial!

For more information on MongoDB see https://www.mongodb.com/ .

This post was generated with Jupyter Notebook and is accessible at https://github.com/rsandstroem/IPythonNotebooks/blob/master/MongoDBDemo/MongoDB_demo.ipynb .

Prerequisites

For this tutorial we will need MongoDB version 3.0 or later and pymongo to connect to MongoDB from Python. NB: Be careful with data loss if you are upgrading from an older release used in production!

If all is fine "mongo --version" should tell you that you are using version 3.0 or later, and from a python prompt "import pymongo" will not return an error message.

Create a MongoClient

First, import things we will need. Use pymongo to connect to the "test" database. Specify that we want to use the collection "people" in this database.

In [1]:

import os
import pandas as pd
import numpy as np
from IPython.core.display import display, HTML
import pymongo
from pymongo import MongoClient
print 'Mongo version', pymongo.__version__
client = MongoClient('localhost', 27017)
db = client.test
collection = db.people

Mongo version 3.3.0

Import data into the database

Import data from a json file into the MongoDB database "test", collection "people". We can do this using the insert method, but for simplicity we execute a "mongoimport" in a shell environment, but first we drop the collection if it already exists.

In [2]:

collection.drop()
os.system('mongoimport -d test -c people dummyData.json')

Out[2]:

Check if you can access the data from the MongoDB.

We use find() to get a cursor to the documents in the data. Let's see who the three youngest persons in this data are. Sort the results by the field "Age", and print out the first three documents. Note the structure of documents, it is the same as the documents we imported from the json file, but it has unique values for the new "_id" field.

In [3]:

cursor = collection.find().sort('Age',pymongo.ASCENDING).limit(3)
for doc in cursor:
    print doc

{u'Country': u'Serbia', u'Age': 18, u'_id': ObjectId('58d690f11ac4479b459dfdf6'), u'Name': u'Sawyer, Neve M.', u'Location': u'-34.37446, 174.0838'}
{u'Country': u'Somalia', u'Age': 19, u'_id': ObjectId('58d690f11ac4479b459dfdbc'), u'Name': u'Townsend, Cadman I.', u'Location': u'-87.69188, -144.16138'}
{u'Country': u'Eritrea', u'Age': 20, u'_id': ObjectId('58d690f11ac4479b459dfdde'), u'Name': u'Graham, Emerald O.', u'Location': u'61.35398, 28.04381'}

Aggregation in MongoDB

Here is a small demonstration of the aggregation framework. We want to create a table of the number of persons in each country and their average age. To do it we group by country. We extract the results from MongoDB aggregation into a pandas dataframe, and use the country as index.

In [4]:

pipeline = [
        {"$group": {"_id":"$Country",
             "AvgAge":{"$avg":"$Age"},
             "Count":{"$sum":1},
        }},
        {"$sort":{"Count":-1,"AvgAge":1}}
]
aggResult = collection.aggregate(pipeline) # returns a cursor
df1 = pd.DataFrame(list(aggResult)) # use list to turn the cursor to an array of documents
df1 = df1.set_index("_id")
df1.head()

Out[4]:

	AvgAge	Count
_id
China	46.250000	4
Antarctica	46.333333	3
Guernsey	48.333333	3
Puerto Rico	26.500000	2
Heard Island and Mcdonald Islands	29.000000	2

For simple cases one can either use a cursor through find("search term") or use the "$match" operator in the aggregation framework, like this:

In [5]:

pipeline = [
        {"$match": {"Country":"China"}},
]
aggResult = collection.aggregate(pipeline)
df2 = pd.DataFrame(list(aggResult))
df2.head()

Out[5]:

	Age	Country	Location	Name	_id
0	32	China	39.9127, 116.3833	Holman, Hasad O.	58d690f11ac4479b459dfdb3
1	43	China	31.2, 121.5	Byrd, Dante A.	58d690f11ac4479b459dfdee
2	57	China	45.75, 126.6333	Carney, Tamekah I.	58d690f11ac4479b459dfdf9
3	53	China	40, 95	Mayer, Violet U.	58d690f11ac4479b459dfe06

Use the MongoDB data

Let's do something with the data from the last aggregation, put their location on a map. Click on the markers to find the personal details of the four persons located in China.

In [6]:

import folium
print 'Folium version', folium.__version__

world_map = folium.Map(location=[35, 100], 
                    zoom_start=4)
for i in range(len(df2)):
    world_map.simple_marker(location=df2.Location[i].split(','), popup=df2.Name[i]+', age:'+str(df2.Age[i]))
    
world_map

Folium version 0.2.0

/opt/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:7: FutureWarning: simple_marker is deprecated. Use add_children(Marker) instead

Out[6]:

In case no map is shown, try the following command from a terminal window and retry:

pip install folium or sudo conda install --channel https://conda.binstar.org/IOOS folium

For more information on how to use maps, color by region etc, please check out GeoMapsFoliumDemo

Data Scientist Blog

Simple MongoDB demo

Introduction

Prerequisites

Create a MongoClient

Import data into the database

Check if you can access the data from the MongoDB.

Aggregation in MongoDB

Use the MongoDB data

Comments !

Introduction

Prerequisites

Create a MongoClient

Import data into the database

Check if you can access the data from the MongoDB.

Aggregation in MongoDB

Use the MongoDB data

Comments !

blogroll

social