Twitter

look at evernote : 九章 new 系统一 in detail:

summary

Scenario

which level of QPS do we need to deal with? 10M or 1M
- eg. Twitter
  MAU 330M DAU 170M (~ half of the value for SNS company)
- eg. facebook
  MAU 2B.
- beyond the DAU, we have more work to do. (stickiness, habit pattern)
- - refer to: how many http://bit.ly/1Kml0M7
based on the DAU, we can get the number of concurrent user?
- DAU * avg_num_request_per_user / num_secs_per_day = 150M * 60 / 86400 (3600 * 24) ~ 100k
- the peak num would be Avg_concur_user * 3 = 300k
- if this is a fast growing product, normally we have
- - max peak users in three month = Peak users * 2=600k
- Read QPS
- - 300k
- Write QPS
- - 5k (估算)
QPS matters?
- QPS = 100 •
- - 用你的笔记本做 Web 服务器就好了 •
- QPS = 1k •
- - 用一台好点的 Web 服务器就差不多了 •
  - 需要考虑 Single Point Failure •
- QPS = 1m •
- - 需要建设一个1000台 Web 服务器的集群 •
  - 需要考虑如何 Maintainance(某一台挂了怎么办)

Service

we should split the entire application into the small module...

User service // sql database
- register
- login
Tweet service // no sql database
- post a tweet
- news feed(所有人的新鲜事)
- timeline(只是我的)
Media service // file system Amazon S3
- upload image
- upload video (断点续传，传到一个url，把url传给server)
Friendship service // sql or no sql database
- follow
- unfollow

StorageHow to store News Feed?

Pull model and Push model:

Pull model

Pull model is pretty straightforward, but the biggest problem of Pull model is slow performance, because N time DB reads is pretty slow, especially for a bunch of followings. How to alleviate this problem? use cache. Avoid using DB read in for loop, use select ... in ...

Push ModelWe need to create a news feed table.

Steps:

为每个用户建一个List存储他的News Feed信息,
用户发一个Tweet之后，将该推文逐个推送到每个用户的News Feed List中 fanout
用户需要查看News Feed时，只需要从该News Feed List中读取最新的100条即可
biggest problem for the Push model is that follower的数目可能很大。。。可能对于某一个消息来说，延迟会很长

Scale

step1: Optimize

解决设计缺陷 solve problems
- 解决pull 缺陷
- - 最慢的部分发生在用户读请求时(需要耗费用户等待时间)
  - - 在DB访问之前加入Cache
    - Cache每个用户的Timeline
    - Cache每个用户的news feed
- 解决push 缺陷
- - 浪费更多的存储空间
  - - 与Pull模型将News Feed存在内存(Memory)中相比 •
    - Push模型将News Feed存在硬盘(Disk)里完全不是个事儿 •
    - Disk is cheap
  - 不活跃用户 Inactive Users
  - - push的一大缺陷，如果inactive users很多，push就会浪费很多资源，
    - Rank followers by weight
  - 粉丝数目followers >> 关注数目 following
  - - 比如原来是push模型，但是发现太慢了，fanout operation would take hours!
    - tradeoff，push + pull
    - 尝试在现有的模型下做最小的改动来优化
    - - 比如多加几台用于做 Push 任务的机器，Problem Solved! •
    - 对长期的增长进行估计，并评估是否值得转换整个模型
  - Push 结合 Pull 的优化方案
    - 普通的用户仍然 Push •
    - 将 Lady Gaga 这类的用户，标记为明星用户 •
    - 对于明星用户，不 Push 到用户的 News Feed 中
    - 当用户需要的时候，来明星用户的 Timeline 里取，并合并到 News Feed 里
  - 如何定义明星？
    - followers > 1M
    - select count(*) … 效率很低
  - how to improve?
    - add one attribute for the user table: is_superstar

summary:

push model requires less resources and less code, and need low real-time requirement.

but if you have more followers and more tweets, it’s better to use pull model.

need some kind of transition process, which can not only solve the current problem but also make changes as small as possible.

Design Twitter