Polyglot persistence – http://www.dummies.com/programming/big-data/engineering/big-data-and-polyglot-persistence/
Month: January 2017
- Design a meeting scheduler – https://www.careercup.com/question?id=5720778041458688
- How to create a min heap from an array – https://www.youtube.com/watch?v=oAYtNV6vy-k
- Inserting into a Heap – https://www.google.co.in/#q=inserting+an+element+in+a+heap+complexity
4. Jump Search – http://www.geeksforgeeks.org/jump-search/
5. Bubble Sort – https://www.youtube.com/watch?v=Jdtq5uKz-w4
6. Selection Sort – http://quiz.geeksforgeeks.org/selection-sort/
7. Merge Sort – http://quiz.geeksforgeeks.org/merge-sort/
8. Insertion Sort – http://quiz.geeksforgeeks.org/insertion-sort/
9. Quick Sort – http://quiz.geeksforgeeks.org/quick-sort/
10. Counting Sort – http://www.geeksforgeeks.org/counting-sort/
System Design Interview Questions Links
Elevator Design : http://thought-works.blogspot.in/2012/11/object-oriented-design-for-elevator-in.html
ElasticSearch general considerations:
- _source field : its good to store it; would help when we need to re-index
2. Dynamic mappings support: think twice before giving support for querying on new dynamic fields.
3. Should you care about field data?
Elastic Search Mappings
Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. For instance, use mappings to define:
- which string fields should be treated as full text fields.
- which fields contain numbers, dates, or geolocations.
- whether the values of all fields in the document should be indexed into the catch-all
- the format of date values.
- custom rules to control the mapping for dynamically added fields.
3. On the face of it, though a search query is parallelized across many shards, it involves a lot of overhead in terms of a co-ordinating node talking to all the other nodes and multiple network hops which might impact performance.
4. Once the number of shards for an index is chosen, they cannot be changed. So, if the index cannot be fit into the designated number of shards, the whole index would have to be re-indexed
In some systems like log management (using logstash), one way of avoiding 3 and 4 is by having index per day. In these cases, when we need to search logs for the last x days, we can only restrict the query to those nodes containing the shards of those indexes.
In some systems where we need to do user specific search, we can have an index per user. However, if the number of users is more, even this becomes a bottleneck. So, we can have a shared index but use the routing feature to route user specific data to specific shards. Also, when the index outgrows the shard, only that shard that holds the index data for the specific user can be re-indexed
- Can documents/access be partitioned in a natural way
- Need to find documents by id (update/delete/get etc)
- know the relevant features (Routing, aliases, multi-index search)
- Measure the impact of distributed search
- Indices do not come for free
- Care about field data?
Donts for elastic search
- Dont create more shards than you need. More shards enable larger indices and can scale out operations on individual documents. If unsure, overallocate by a little not by a factor or 3 or 4. Evaluate the number of shards required
- Dont treat all nodes as equal (master, data, client/aggregator nodes etc). Have dedicated master nodes. Data nodes are used for querying and where the actual search happens (loading index data into memory etc). Also try to distinguish between data nodes and client nodes
- Dont run wasteful queries – Avoid deep pagination queries, Use scan + scroll instead of sorting. Only query indices/shards that may contain hits
Re-indexing would be required if
1. Mapping changes
2. Index/shard reaches its capacity
3. Reduce/Increase number of shards
Factors to consider for re-indexing
- Where is the data that needs to be re-indexed? Is it in an external data source or the _source field in elastic search?
- Is it ok to have down time?
- Update API usage
- Disable refresh and decrease number of replicas
Logstash (DataFlow Engine) Key features
- Open-source central processing engine for data logistics
- Construct dataflow pipelines to transform events and route streams
- Data source agnostic
- Horizontally scalable with native buffering out of the box
- Robust plugin ecosystem for integrations and processing
More articles on elastic stack
- Logstash pipeline configuration (is there support for DAG pipelines?)