portable data science tools

Toying the idea to have a portable data science tools to bring around. Prior, I being using
Anacoda on a Windows PC

76 days of “enforced” summer breaks without toggling along a laptop means I am away from a computer. If I am lucky, I have internet access on a mobile phone or a desktop with better connectivity but without the data science tools. Wanting to keep up leanring and developing, so my skills are no longer rusty, the idea to have a portable data science tools to bring around ever more sensible.

Back to Dublin and in a bit of a slump, struggle to get my motivation back and not sure what to be doing with everything. Worse of all, feeling guilty at the same time. I then decide to revisit those DIY project I embark on last year without success. The success of having a SSD and its performace spurred me on https://twitter.com/mryap/status/900776007409520642

Jupyter on the Cloud

One option is to use cloud computing. Azure Notebooks looks like a viable options. Jupyter is used as an IDE for both Python and R. This negates the use of seperate RStudio and Python IDE.

Jupyter in the USB drive

I am using Winpython with success. It works when I move WinPython in the following situations.

  • one portable drive to another (USB to portable drive),
  • one directory to another on the same machine.

Nothing breaks. In terms of performance, it bets cloud-hosted Jupyter like Axure Notebooks based on install cygwin and run ipython from there.

  • to avoid “Error in contrib.url(repos, “source”): trying to use CRAN without setting a mirror”, you need to run this first

  • you can use Notepad++ to tinker with the user-interface, Juypton dashboard.
    (https://twitter.com/mryap/status/910825822541447169)

  • Cygwin gives you the advantage of being able to run linux commands on Windows, act on data and setting up data science
    environment without installing Ubuntu. These linux commands are useful if you are following good tutorials where linux commands is commonly use compare to Windows PC.

    Verdict

    Being using Rstudio for sometime, I really love using Jupyter as it allows me to work with both R and Python on the same tool. Don’t think I will go back but I am keeping RStudio as now you can use R Markdown to create website IF you want the quickest, easiest way to get Jupyter notebook up and running, Anaconda is an excellent option. If ultimate portability is all you seek, goes for the cloud options. My favourite option is Winpython https://winpython.github.io/

    The USD 15,000 Deep Learning Machine

    The Beast

    A purpose-bulit machine to design and train your model to detect and identify “Empty” and “Occupied” Parking spots(a.k.a Deep Learning). At USD 15,000 each, the hardware requirement for someone embarking on a deep learning project also need a deep pocket. After much googling, I came up with a cheaper option. Without a monitor, it bloody expensive for me.

    PCPartPicker part list / Price breakdown by merchant

    Type Item Price
    CPU Intel – Core i5-6600K 3.5GHz Quad-Core Processor $228.98 @ OutletPC
    CPU Cooler Cooler Master – GeminII M4 58.4 CFM Sleeve Bearing CPU Cooler $32.89 @ OutletPC
    Motherboard Asus – B150I PRO GAMING/WIFI/AURA Mini ITX LGA1151 Motherboard Purchased For $0.00
    Memory Kingston – FURY 16GB (2 x 8GB) DDR4-2400 Memory $193.78 @ OutletPC
    Storage Samsung – 960 EVO 250GB M.2-2280 Solid State Drive $117.60 @ Amazon
    Video Card Gigabyte – GeForce GTX 1070 8GB Mini ITX OC Video Card $424.98 @ Newegg
    Case Cooler Master – Elite 110 Mini ITX Tower Case $38.89 @ OutletPC
    Power Supply Cooler Master – 550W 80+ Bronze Certified Semi-Modular ATX Power Supply $57.98 @ Newegg
    Prices include shipping, taxes, rebates, and discounts
    Total (before mail-in rebates) $1125.10
    Mail-in rebates -$30.00
    Total $1095.10
    Generated by PCPartPicker 2017-09-21 15:55 EDT-0400

    To side track, I learn that

    CPU and Cooling

    The CPU must be compatible with the selected chipset while providing sufficient PCIe support. Consumer CPUs are ideal because the application target is highly fault tolerant and the NVIDIA DIGITS DevBox is acting as a workstation instead of a server. Pairing the CPU with effective cooling is crucial for optimal performance, especially when the GPUs are under peak load. I chose the Intel Core i7-5930K CPU with a Corsair Hydro H60 cooler.

    Memory and Storage

    RAM is important for handling large DNN files and datasets. The Intel Core i7-5930K CPU can stably support up to 64 GB RAM. An Intel Xeon processor can handle more RAM and allow ECC. However, using the Intel Xeon processor will significantly increase the cost.

    Chassis, Thermal, and Acoustic Considerations

    Acoustics and heat management are major considerations, especially when deploying the NVIDIA DIGITS DevBox in a normal office environment. A chassis that separates the power supply and disks from the heat generated by the CPU and GPUs is ideal.

    Power Supply

    The power supply should provide enough power to operate the system components along with some headroom to ensure stable operation. The total dissipated power for all of the system components used in a sample build is between 1,200 and 1,300 watts.
    Our sample build uses an EVGA SuperNOVA 1600W P2 power supply that delivers approximately 90% efficiency at 100% load (1,400 watts), ensuring system stability at peak workloads.

    Motherboard

    Effective deep learning requires multiple GPUs. However, suitable PCIe topology is critical to being able to use those GPUs efficiently. Synchronous Stochastic Gradient Descent (SGD) for deep learning relies on broadcast communication between the GPUs. SGD acceleration needs P2P DMAs to work between devices. This means that all GPUs must be on the same I/O hub with a very fast PCIe switches. Workstation motherboards based on the Intel X99 chipset with a PLX bridge setup can support four PCIe Generation3 x16 cards at either full speed or with minimal drop-off.

    The sample build used the ASUS X99-E WS workstation motherboard that supports Intel LGA 2011-v2 CPUs while drawing only 20W.

    Measuring non-profit success

    Measuring non-profit and government websites is a bit more challenging. You need to figure out your goals and measure how efficient you are in achieving them. Every organization needs money to operate. In a non-commerce environment, the objective won’t be profit, but rather achieving more with less.

    For non-profit, here are just some ways to incorporate non-commerce goals into the financial statement:

    – The cost of attracting volunteers and optimizing to bring that number down

    – The cost of attracting donations (how much do you spend for every dollar collected) and optimizing to bring that number down

    – For government and advocacy sites – cost per visitor (you want to spread your message further for less) and cost per engagement.

    Here are some articles that will be useful:

    http://www.grokdotcom.com/2009/07/29/turning-web-analytics-into-nonprofit-success/

    I Got No Ecommerce. How Do I Measure Success?

    Web Analytics Success Measurement For Government Websites

    Current Status

    Embark on a self learning and development journey. Currently building a data science platform using open source tool for my own use as some vendor charge USD 175 per month, another USD 1000 – USD 5000.

    A Linkedin connection commented: “What problem are you trying to solve?”

    Good to keep that in mind when I work on this pet project. Actually it for the fun of doing it though at best I might be trying to reinvent the wheel.