Crack Stats

Using LinkedIN API with Python to retrieve Connections’ Details

Ever faced a problem extracting your LinkedIN connections’ details one by one, don’t worry after reading this post you will able to scrape all your connections’ details in one go. To follow what is next you need to have a basic understanding of Python and an account on LinkedIN.

Step 1: Sign up at https://www.linkedin.com/secure/developer and click on “Add New Application”
Step 2: Fill the compulsory fields, accept the terms & conditions and click on “Add Application”
Step 3: On the next screen you will receive some keys and ids that we will we using in next steps. These keys are specific to you and are like “passwords” to access your account, so keep them safe
Step 4: Open a text editor, copy the code below and paste it. Update the consumer_key, consumer_secret, user_token and user_secret as you have received in step 3

#Including the libraries that needed for this program
import oauth2 as oauth
import httplib2
import time, os, simplejson, csv
import demjson

#Authentication and connecting with the server
consumer_key = 'XXXXXXXXX'
consumer_secret = 'XXXXXXXXXXXXXXX'
user_token = 'XXXXXXXXXXXXXXXXXXX'
user_secret = 'XXXXXXXXXXXXXXXXXXXXX'
consumer = oauth.Consumer(consumer_key, consumer_secret)
client = oauth.Client(consumer)
access_token = oauth.Token(
key=user_token,
secret=user_secret)
client = oauth.Client(consumer, access_token)

#Retrieving connections' your connection data. Update the consumer_key, consumer_secret, user_token and user_secret#as yours
resp,content = client.request("https://api.linkedin.com/v1/people/~/connections?format=json", "GET", "")

#Cleaning the data received
data =demjson.decode(content)
v = data.values()[3][0:]
k = v[1].keys()[0:]

#Writing the data to a CSV file 
f = csv.writer(open("connection_detail.csv", "wb+"))
f.writerow(v[1].keys()[0:])
for i in range(len(v)):
    f.writerow(v[i].values()[0:])

Step 5: Execute the above code in Python, it will generate a CSV file in the working directory named “connection_detail.csv” that has all your connections’ data.

December 5, 2014 0

Neo4j: Basics of Cypher

Cypher : Neo4j’s Graph Query Language

Cypher is a declarative, SQL inspired language for Creating(inserting student data into the graph), Finding (retrieving individual student and school), Querying(discovering related students and schools based on relationships, properties etc.) and much more in graphs. Cypher allows us to perform actions on the graph database without requiring us to describe exactly how to do it.

Nodes

Cypher uses ASCII-art to represent patterns. In Cypher we need to surround nodes with parentheses which look like circles, e.g. (node). To refer node after defining it, we’ll give it an identifier like (school) for school or (student) for students. So for example, if we want to find all the students and the school they have studied in, the query will include the identifiers student and school, e.g. in a pattern like (student)-->(school) so we can refer to them later, e.g. to access properties like student.name and school.name.

The more general structure is:

MATCH (node) RETURN node.property

MATCH (node1)-->(node2) 
RETURN node2.propertyA, node2.propertyB

Relationships

The problem with the Cypher query we saw above is that it didn’t mention anything about the relationship between the nodes, so even though we used the identifier student, we may well have gotten back teachers of our schools. So we need to be able to describe the types of relationships in our Cypher queries.

If we wanted to retrieve every student who had studied in a school, we would describe the pattern(student)-[:Student_Of]->(school) to retrieve only nodes that had a relationship typed Student_Of with other nodes (schools) those nodes would the be students as stated by the Student_Of relationship.

Or generally:

MATCH (node1)-[:REL_TYPE]->(node2)

Sometimes we need access to information about a relationship (e.g. its type or properties). For example, we might want to output the year the when the student joined that particular school and that year would probably be a property of the Student_Of relationship. As with nodes, we can use identifiers for relationships (in front of the :TYPE). If we tried to match (student)-[year_join:Student_Of]->(school) we would be able to output the year_join.year for each of the student in all of the schools that they studied in.

MATCH (node1)-[rel:TYPE]->(node2) 
RETURN rel.property

November 7, 2014 0

Introduction to Graph Database with Neo4j

Conventional databases(Relational Database) typically consist of many tables. A table is much like a spreadsheet, in that it’s made up of rows and columns. All rows have the same columns, and each column contains the data itself, think of tables in the same way like a table in Excel. The definition of those tables and their columns determine the storage capabilities of the database, whereas the relations between the columns define the kinds of facts that can be stored in such a database.

What is Graph Database ?

We live in a world, where there are different pieces of information which are connected to each other in way or another. A Graph database take this connectivity or relationship between different information as a core aspect of its data model, which helps in efficient storage, processing and querying. While other databases computes relationship at query time, a graph database stores relationship as its primary element. Accessing those already persistent connections is an efficient, constant-time operation and allows you to quickly traverse millions of connections per second. Graph databases performs well in managing highly connected data and complex queries.

A Graph Database stores data in a Graph, representing any kind of data
The records in a graph database are called Nodes
Nodes are connected to each other via directed Relationships
Each single Node and Relationship can have named attributes referred to as Properties
A Label is used to organizes nodes into groups
.

Image Source: Neo4j.com

Sponsored by Neo Technology, Neo4j is an open-source NoSQL graph database implemented in Java and Scala. Neo4j is a Database – use it to reliably store information and find it later. Neo4j’s data model is a Graph, in particular a Property Graph. Cypher is Neo4j’s graph query language(declarative). You describe what you you are interested in, not how it is acquired. Cypher is meant to be very readable and expressive

You can download Neo4j from http://neo4j.com/download and install it as a server on all operating systems. By default, the Neo4j Server is bundled with a web interface bound tohttp://localhost:7474.

November 2, 2014 0

Installing RHadoop in Cloudera Hadoop Ecosystem

R and Hadoop

Apache Hadoop is developed in Java and Java is one of the main programming languages for Hadoop. Although if you don’t now Java or don’t want to work with it, you can still use any other language like Python, R or Ruby to write MR(MapReduce) using streaming APIs. In this blogpost, I am going to show how to integrate R and Hadoop using rmr package(This tutorial assumes that you have a working hadoop ecosystem with R and RStudio Server installed. If you don’t have go through my previous blogpost http://www.crackstats.in/setting-up-rstudio-server-on-cloudera-quickstart-vms/)

The most common way to link R and Hadoop is to use HDFS (potentially managed by Hive or HBase) as the long-term store for all data, and use MapReduce jobs (potentially submitted from Hive, Pig, or Oozie) to encode, enrich, and sample data sets from HDFS into R. It then allows you to perform complex modeling exercises on a subset of prepared data in R .

Revolution Analytics released RHadoop allowing integration of R and Hadoop. RHadoop is a collection of three R packages that allow users to manage and analyze data with Hadoop. RHadoop consists of following packages which are available for download at https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads):

rmr2 – functions providing Hadoop MapReduce functionality in R
rhdfs – functions providing file management of the HDFS from within R
rhbase – functions providing database management for the HBase distributed database from within R

So let’s begin

Step 1: Open terminal in the virtual machine and set up the following environment variables

HADOOP_HOME=/usr/lib/hadoop-0.20-mapreduce
HADOOP_CMD=/usr/bin/hadoop
HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.5.0.jar

Note: Update the path in HADOOP STREAMING according to your Hadoop version

Step 2: Download rmr2 package from https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads , open RStudio server execute the following code:

install.packages(“/home/username/Downloads/rmr2_3.2.0.tar.gz”, repos = NULL, type=”source”)
Sys.setenv(HADOOP_HOME=”/usr/lib/hadoop-0.20-mapreduce”)
Sys.setenv(HADOOP_CMD=”/usr/bin/hadoop”)
Sys.setenv(HADOOP_STREAMING=”/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.5.0.jar”)

Step 3: Now check RHadoop Mapreduce capability by executing the following commands

library(rmr2)
small.ints <- to.dfs(1:1000)
out <- mapreduce(input = small.ints, map = function(k, v) keyval(v, v^2))
df <- as.data.frame(from.dfs(out))
head(df)

Reference : http://blogr-cs.blogspot.in/2012/12/integration-of-r-rstudio-and-hadoop-in.html

October 1, 2014 0

Creative Commons: Protect Your Creative Work

Being a creative artist, blogger, innovator many people face this problems like how they can protect there work so that it is not being copied or used by someone else; want people to share their work or even modify it as long as they give credit; don’t want companies to use it for commercial purpose.

Creative Commons is a nonprofit organization that enables the sharing and use of creativity and knowledge through free legal tools. They provide free, easy-to-use copyright licenses provide a simple, standardized way to give the public permission to share and use your creative work — on conditions of your choice. CC licenses let you easily change your copyright terms from the default of “all rights reserved” to “some rights reserved.”

Licenses provided by Creative Commons are not an alternative to copyright, instead they work alongside copyright and enable you to modify your copyright terms to best suit your needs.

If you want to give people the right to share, use, and even build upon a work you’ve created, you should consider publishing it under a Creative Commons(CC) license. CC gives you flexibility (for example, you can choose to allow only non-commercial uses) and protects the people who use your work, so they don’t have to worry about copyright infringement, as long as they abide by the conditions you have specified.

There are various conditions that you can include in your license, some of them are:

Attribution- ShareAlike: Including this condition in your license allows someone to remix, tweak, and build upon your work even for commercial purposes but they must give you credit and license their new creations under the same terms.

Attribution- Noderivs: This term allows redistribution, commercial and non-commercial, as long as it is passed along unchanged and in whole, with credit to you.

Attribution- Non Commercial: This license lets others use, change and build upon your work as long as they are doing it for non-commercial purposes and their new works must acknowledge your contribution but they don’t necessarily have to be license their derivative works on these same terms.

You can see various licences available by visiting this link : http://creativecommons.org/licenses

Image Source :http://creativecommons.org/licenses/

Choose a Creative Commons license : http://creativecommons.org/choose/

August 20, 2014 0

Setting up RStudio Server on Cloudera Quickstart VMs

Image Source:- www.cloudera.com — Image Source:- http://www.cloudera.com

Cloudera(www.cloudera.com) provides Quickstart Virtual Machines(VMs) for Apache Hadoop testing. These QuickStart VMs contain a single-node Apache Hadoop cluster, filled with example data, queries, scripts, and Cloudera Manager to manage your cluster. These VMs run on CentOS 6.2 and are available for VMware, VirtualBox, and KVM(All require a 64-bit host Operating System)

In this blogpost, I am going to show how to integrate R and RStudio Server on Cloudera Quickstart VMs.

For this tutorial, I will be using:
Host OS : Windows 8.1 Professional
Virtualization Software : Oracle Virtual Box
Cloudera Quickstart VM : CDH 5.1x

Download and install Oracle Virtual Box from (https://www.virtualbox.org/wiki/Downloads)
Also, download Cloudera Quickstart VM from (http://www.cloudera.com/content/support/en/downloads/quickstart_vms/cdh-5-1-x1.html)
Open the downloaded virtual machine in the Virtual Box, you will see the following screen after the successful loading of the VM
Open the terminal, from Applications>> System tools>> Terminal

Type the following commands in the terminal

$ sudo bash
$ yum install kernel -y
$ sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
$ sudo yum -y install git wget R
$ sudo ln -s /etc/default/hadoop-0.20-mapreduce /etc/profile.d/hadoop.sh
$ cat /etc/profile.d/hadoop.sh | sed 's/export //g' > ~/.Renviron
$ wget http://download2.rstudio.org/rstudio-server-0.97.248-x86_64.rpm
$ sudo yum install --nogpgcheck rstudio-server-0.97.248-x86_64.rpm

R and RStudio server installation is complete now. Type “ifconfig” in the terminal and note down your inet address.
You can access RStudio server on any web browser in your host machine. Just type inet add:8787 in the address bar. For example if your inet addr is 144.16.192.214, type 144.16.192.214:8787.
The login credentials are:
Username: cloudera
Password: cloudera

August 12, 2014 0

Getting Started with Apache Hadoop on Hortonworks Sandbox

Image Source :- http://hadoop.apache.org/

What is Apache Hadoop ?

Apache Hadoop was created to fulfill the need of processing Big Data. It has been the driving force behind the growth of the big data industry. A wide variety of companies and organizations use Hadoop for both research and production. Hadoop enables businesses to gain insight from massive amounts of structured and unstructured data quickly. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Apache Hadoop has two main subprojects:

MapReduce – MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner.
HDFS – A file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes

In this blogpost, I am going to show how to setup Hortonworks Sandbox which is a portable Hadoop environment that includes latest developments packaged up in a virtual environment and how to perform basic task on datasets using Apache Hive, HCatalog etc.

Setting Up Hortonworks Sandbox

Download Hortonworks sandbox from http://hortonworks.com/products/hortonworks-sandbox/ and download any of the virtualization environment like Oracle Virtual Box or VMWare Player.
Then install the virtualization environment on your computer and open it.(I have used Virtual Box for demonstration)
Click on File>>Import Appliance>>Import the file you have downloaded from Hortonworks earlier
In the next window set all settings as shown in the image below and click on “Import”
After successfully importing the virtual appliance start the virtual machine. You will see a window like below in which you can see all the Sandbox services being loaded.
When all the features of the Sandbox are loaded, you will see a window like below and now it is ready to use.
Now you can start using Apache Hadoop, you can use Hue which is a web interface for analyzing data with Apache Hadoop(you can access Hue by your browser at http://127.0.0.1:8000). You can also use Hadoop by command line.

Apache Hive

Apache Hive project provides a data warehouse view of the data in HDFS. Using a SQL-like language Hive lets you create summarizations of your data, perform ad-hoc queries, and analysis of large datasets in the Hadoop cluster. The overall approach with Hive is to project a table structure on the dataset and then manipulate it with HiveQL.

Performing a Basic Task Using Apache Hive

Download a sample dataset(I have used the following dataset, fuel consumption rating of the cars in Canada in year 1995-1999 http://www.nrcan.gc.ca/sites/www.nrcan.gc.ca/files/oee/files/csv/MY1995-1999%20Fuel%20Consumption%20Ratings.csv
Open Hue(htttp://127.0.0.1:8000) and then click on the “File Browser” tab and upload the file downloaded.

After uploading the file, you can see it the “File Browser”
You an also view the content of the file by clicking on it.
Then, open the “HCatalog” tab and create a table from dataset we have just uploaded. Now we are ready to perform queries on the dataset using Hive.
Now, switch to “Hive” tab, where we can write queries on the database.

Sample Query 1: To know the datatype of various columns present in the database.

Query
Result

Sample Query 2: To know the fuel consumption of the cars made by “Acura”

Query
Result
_

August 6, 2014 1

Analyze Network Data with Wireshark

Image Source : www.wireshark.org — Image Source : http://www.wireshark.org

Wireshark is a free and open-source packet analyzer. It is used for network troubleshooting, analysis, software and communications protocol development and education. A network packet analyzer is a tool used to analyze what’s going on inside a network cable.

Wireshark capture network packets in real time and tries to display that packet data as detailed as possible in human-readable format.

How to use Wireshark to capture network data ?

Step 1: Download and install Wireshark from http://www.wireshark.org/download.html
Step 2: Open Wireshark from your computer. Then, select a network from left side of the window and press start to start capturing network data from it.
Step 3: As soon as you click start, Wireshark starts capturing data packets for that Network. Press “Stop”, whenever you want Wireshark to stop caputring packets.
You will see packets highlighted in green, blue and black. Wireshark uses colors to help you identify the types of traffic at a glance. By default, green is TCP traffic, dark blue is DNS traffic, light blue is UDP traffic.
You can save this data in CSV(Comma Separted) format to analyze it later for other purposes.

July 25, 2014 0

Infogr.am: Infographics Made Easy

Infographics are one of the most efficient and best way of to represent a conclusion, result or facts. In form of text, images and design infographics represent complex data and results. Infographics have already established themselves as a popular marketing tools.

Infographics are graphic visual representations to present complex information quickly and clearly. And, of course to make a good one you need to have decent knowledge of colors, font(type), sketching and softwares such as Photoshop, Illustrator etc. which certainly you can’t gain overnight. So what if you need a infographic, hiring a freelance graphic designer can cost you much.

There are a few websites available where you can design your infographic easily like infogr.am ,piktochart.com etc.. These websites allow you to select from 100 of already existing templates.