This blog post is a tutorial that introduces the basics of MongoDB and how to utilize it within a python environment. To accomplish this we will use pymongo to connect python with MongoDB.
Mongo DB is a convenient NoSQL database which uses JSON syntax to store and query documents. I am frequently using MongoDB when dealing with streaming data from sensors or APIs. In big data scenarios I find MongoDB powerful since it allows for replication and sharding, thus enabling distributed computing and high availability, but that is a topic for another tutorial!
For more information on MongoDB see https://www.mongodb.com/ .
This post was generated with Jupyter Notebook and is accessible at https://github.com/rsandstroem/IPythonNotebooks/blob/master/MongoDBDemo/MongoDB_demo.ipynb .
For this tutorial we will need MongoDB version 3.0 or later and pymongo to connect to MongoDB from Python. NB: Be careful with data loss if you are upgrading from an older release used in production!
If all is fine "mongo --version" should tell you that you are using version 3.0 or later, and from a python prompt "import pymongo" will not return an error message.
First, import things we will need. Use pymongo to connect to the "test" database. Specify that we want to use the collection "people" in this database.
import os
import pandas as pd
import numpy as np
from IPython.core.display import display, HTML
import pymongo
from pymongo import MongoClient
print 'Mongo version', pymongo.__version__
client = MongoClient('localhost', 27017)
db = client.test
collection = db.people
Import data from a json file into the MongoDB database "test", collection "people". We can do this using the insert method, but for simplicity we execute a "mongoimport" in a shell environment, but first we drop the collection if it already exists.
collection.drop()
os.system('mongoimport -d test -c people dummyData.json')
We use find() to get a cursor to the documents in the data. Let's see who the three youngest persons in this data are. Sort the results by the field "Age", and print out the first three documents. Note the structure of documents, it is the same as the documents we imported from the json file, but it has unique values for the new "_id" field.
cursor = collection.find().sort('Age',pymongo.ASCENDING).limit(3)
for doc in cursor:
print doc
Here is a small demonstration of the aggregation framework. We want to create a table of the number of persons in each country and their average age. To do it we group by country. We extract the results from MongoDB aggregation into a pandas dataframe, and use the country as index.
pipeline = [
{"$group": {"_id":"$Country",
"AvgAge":{"$avg":"$Age"},
"Count":{"$sum":1},
}},
{"$sort":{"Count":-1,"AvgAge":1}}
]
aggResult = collection.aggregate(pipeline) # returns a cursor
df1 = pd.DataFrame(list(aggResult)) # use list to turn the cursor to an array of documents
df1 = df1.set_index("_id")
df1.head()
For simple cases one can either use a cursor through find("search term") or use the "$match" operator in the aggregation framework, like this:
pipeline = [
{"$match": {"Country":"China"}},
]
aggResult = collection.aggregate(pipeline)
df2 = pd.DataFrame(list(aggResult))
df2.head()
Let's do something with the data from the last aggregation, put their location on a map. Click on the markers to find the personal details of the four persons located in China.
import folium
print 'Folium version', folium.__version__
world_map = folium.Map(location=[35, 100],
zoom_start=4)
for i in range(len(df2)):
world_map.simple_marker(location=df2.Location[i].split(','), popup=df2.Name[i]+', age:'+str(df2.Age[i]))
world_map
In case no map is shown, try the following command from a terminal window and retry:
pip install folium or sudo conda install --channel https://conda.binstar.org/IOOS folium
For more information on how to use maps, color by region etc, please check out GeoMapsFoliumDemo
Comments !