Videos uploaded by user “Stoney Vintson”
PySpark: Python API for Spark
UC Berkeley AmpLab member Josh Rosen, presents PySpark. PySpark is the new Python API for Spark which is available in release 0.7 This presentation was given at the Spark meetup at Conviva in San Mateo, Ca on Feb 21st 2013. Download here http://spark-project.org/downloads/ Summary: 00:33 What is Spark? 03:00 What is PySpark? 03:45 Example Word Count 04:35 Demonstration of interactive shell on AWS EC2 06:22 tracking time elapsed, %time berkeley_pages.count() 06:37 Spark web interface 09:14 Distributing data, sc.parallelize 11:20 API documentation 11:27 Python doctest, create tests from interactive samples 11:58 Example kmeans.py, k-means clustering 12:39 Getting help help(sc) 13:00 Example wordcount.py 13:18 PySpark Implementation details 14:15 PySpark less than 2K lines including comments 17:18 Pickled Objects, RDD[Array[Byte]] 17:44 Batching Pickle to reduce overhead 18:00 Consolidating operations into single pass when possible 19:27 PySpark Roadmap, adding sorting support, file formats such as csv, PyPy JIT
Views: 48049 Stoney Vintson
Introduction to AmpLab Spark Internals
Matei Zaharia of UC Berkeley's AmpLab presents an introduction to Spark Internals 2012-12-18 at Yahoo in Sunnyvale, Ca. The presentation is 1 hour 14 minutes long. Summary 2:32 Spark Project Goals 4:48 Spark Code base Size 5:59 Code base breakdown by module 8:45 Components 10:41 Example Job 12:03 RDD Graph 14:43 Data Locality 15:48 In More Detail: Life of a Job 16:15 Scheduling Process 27:11 RDD Abstraction 27:52 RDD Interface 29:34 Example: HadoopRDD 30:28 Example: FilteredRDD 31:32 Example: JoinedRDD 32:47 Discussion of source code 38:25 Dependency Types, Narrow and Wide 39:49 DAG Scheduler 40:43 Discussion of source code 42:05 Scheduler Optimizations 45:39 Task Details 51:07 Worker 52:00 Other Components: BlockManager 52:16 Other Components: CommunicationsManager 52:24 Other Components: MapOutputTracker 52:42 Extending Spark 52:53 Extension points: RDD, SchedulerBackend, spark.serializer 53:38 What People Have Done 53:39 Possible Future Extensions 54:15 As an Exercise to prepare for extending Spark 54:50 How to contribute 54:52 Development Process: Issue tracking, developer list, master on Github Follow code style and add tests
Views: 32504 Stoney Vintson
Spark Streaming:  Large Scale near real-time Stream Processing
Tathagata Das of UC Berkeley AmpLab presents Spark Streaming which has been released as alpha in release 0.7 of Spark. This presentation was given at the Spark meetup on Feb 21st 2013 at Conviva in San Mateo, Ca. Download: http://spark-project.org/downloads/ Summary: 00:09 Motivation 01:07 Case study: Conviva, Inc. 03:26 Goals 04:04 Existing Streaming Systems, 05:07 Storm and Trident 06:40 Discretized Stream Processing Series of very small, deterministic batch jobs 07:52 State between batches in memory, immutable, fault tolerant 08:11 Minimum batch time period from 1/2 second to aproximately 1 second 08:46 Visual representation of Discretized Stream Processing 16:32 Fault Recovery 17:02 Fault Recovery is computed in parallel 17:12 Programming Model and DStreams 17:53 DStream Data Sources, {HDFS, Kafka, Flume, Twitter, TCP sockets, Akka actor, ZeroMQ} 18:34 Transformations of DStreams RDD like operations, New window and stateful operations 19:18 Output: HDFS, console, foreach arbitrary operation on every RDD 19:53 Example: 20 most popular hashtags in the last 10 minutes of tweet stream 23:15 Smart window-based reduce 25:24 Sort transform by key on hashtags 27:09 Demo using AWS 29:39 Other Operations, Maintaining state, tracking sessions 30:45 Performance, Can process 6 GB/sec (60M records/sec) on 100 nodes at sub-second latency, Grep, WordCount 31:32 Comparison Spark Streaming: 670k records/sec/node Storm: 115k records/sec/node Apache S4: 7.5k records/sec/node 32:30 Fast Fault Recovery, recovers from faults/stragglers within 1 sec 32:53 Real Applications: Conviva real-time monitoring of video metadata 34:05 Real Applications: Mobile Millennium Project, traffic estimation Markov chain Monte Carlo simulations on GPS observations 35:39 Failure semantics 35:53 Java API for Streaming 36:06 Contributors, 5 from UC Berkeley, 3 external contributors 36:12 Vison, one stop shop, stream processing + Ad-hoc queries + batch processing 37:24 Questions 38:00 Strata Conference presentations on Berkeley Data Analytics Stack (BDAS) 38:37 Conclusion New Streaming guide Spark Streaming system in paper http://tinyurl.com/dstreams
Views: 10430 Stoney Vintson
AmpLab GraphX: Graph Analytics on Spark
Joseph Gonzalez and Reynold Xin of the UC Berkeley AmpLab present the GraphX graph analytics API that is implemented using the Spark distributed computing framework. The presentation was given at Flurry in San Francisco, Ca on July 2nd 2013.
Views: 8477 Stoney Vintson
Apache Spark:  Distributed Machine Learning using MLbase
Ameet Talwalker and Evan Sparks present their work on the MLbase project which will be a distributed Machine Learning platform on top of Apache Spark. This presentation was given at Twitter on August 6th 2013. http://mlbase.org/ http://www.meetup.com/spark-users/events/129721872/ In this talk we describe our efforts, as part of the MLbase project, to develop a distributed Machine Learning platform on top of Spark. In particular, we present the details of two core components of MLbase, namely MLlib and MLI, which are scheduled for open-source release this summer. MLlib provides a standard Spark library of scalable algorithms for common learning settings such as classification, regression, collaborative filtering and clustering. MLI is a machine learning API that facilitates the development of new ML algorithms and feature extraction methods. As part of our release, we include a library written against the MLI containing standard and experimental ML algorithms, optimization primitives and feature extraction methods. http://spark-project.org/ http://incubator.apache.org/projects/spark.html
Views: 10805 Stoney Vintson
AmpLab Tachyon and Shark update
Tachyon: Reliable File Sharing at Memory-Speed Across Cluster Frameworks presented by AmpLab visiting researcher Ali Ghodsi and AmpLab graduate student Haoyuan Li. Also, AmpLab member Reynold Xin gives an update for Shark that includes a feature roadmap for release 0.8 Shark releases will match Spark releases starting with Shark 0.7 This talk was presented at Google Ventures Startup Lab on May 9th 2013. http://tachyon-project.org/ https://github.com/amplab/tachyon
Views: 1492 Stoney Vintson
Data Mining Case Study meetup:  Data Mining Overview
Junling Hu presents a high level overview of data mining at the "Data Mining Case Study" meetup at the HackerDojo in Mountain View, Ca on Aug 17th 2013.
Views: 1554 Stoney Vintson
Performance Tuning:  Adam Leventhal, Dtrace
"DTrace: the Performance Tuner's Swiss Army Knife" Adam Leventhal - Director and Systems Architect, Delphix
Views: 496 Stoney Vintson
Performance Tuning:  Steve Shah, "10 Things I've Learned about TCP"
Steve Shah, Sr Director, Product Management, Citrix Systems speaking at the SF Bay Area Large-Scale Production Engineering Meetup held at Yahoo URL Cafe in Sunnyvale, Ca on Thursday 8/23/2012. He spoke about "10 Things I've Learned about TCP" length: 23:58 minutes
Views: 791 Stoney Vintson
AmpLab Spark 0.7 release
UC Berkeley AmpLab member Matei Zaharia gives a quick presentation on the Spark 0.7 release at the Spark meetup. Matei gave this presentation at Conviva on 02-21-2013. Major new features in the 0.7 release are Spark Streaming and the python PySpark API to Spark. http://spark-project.org/spark-release-0-7-0/ Summary: 00:27 Agenda: Introduction, Python API, Streaming 00:40 What is Spark? 02:06 Contribution statistics 02:54 List of individual contributors 03:12 New features in this release Major Python API (PySpark) Spark Streaming alpha release Minor Memory monitoring dashboard Maven build & Debian packages RDD checkpointing Metadata cleanup (TTL) Shuffle speedups Improved EC2 scripts 05:37 Release of 0.7 from master in a few days
Views: 371 Stoney Vintson
AmpLab Spark:  Deep Dive with Spark Streaming
A newer Spark streaming presentation from AmpCamp 3 ( Summer 2013 ) https://www.youtube.com/watch?v=_adU0xpFpU8 Presentation slides http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617 Tathagata Das (TD) a member of the UC Berkeley AmpLab presents a deep dive with Spark Streaming on June 17th 2013 at Plug and Play in Sunnyvale California. Presentation slides are not embedded in the video. http://spark-project.org/documentation/
Views: 2683 Stoney Vintson
Intro to RockOurData
RockOurData serves data generalists and covers analysis, machine learning, visualization and sometimes a bit of data pipelines. You can also see blog articles and listen to podcast episodes at rockourdata.com
Views: 25 Stoney Vintson

Tanya danielle masturbating video
Ascensiunea raului online dating