Mailbrew Diary #3: Newsletter Generation

September 18, 2019

Users will experience Mailbrew mainly through the newsletters we send them. It's super-important to get them right, from look to content, to timing of delivery.

When creating a newsletter users will define its schedule and the sources that populate it with content. I decided to model both sources and schedules as flexible JSON objects on the newsletter model in our db (Postgres has great native support for JSON, you can even query it). This will allow easy expansion of both source and schedule types in the future (we plan to add a lot of them) and will make updating them from our frontend React app as simple as posting to a rest endpoint.

Schedules

For the MVP, we plan to support daily, weekly and monthly schedules. The most complex schedule is the monthly one. Here is an example:

{
  "type": "monthly",
  "day_of_month": 15,
  "hour": 10,
  "minute": 0
}

This schedule describes a newsletter that is received monthly on the 15th at 10am.

Sources

For the MVP, we plan to support Reddit and RSS as sources.

A source describes where to take the content for a given newsletter. A collection of sources makes the recipe on how to create the newsletter. Here is an example of our RSS source:

{
  "type": "rss",
  "feeds": [
    {
      "title": "Daring Fireball",
      "url": "https://daringfireball.net/feeds/main"
    },
    {
      "title": "Mac Stories",
      "url": "https://www.macstories.net/feed/"
    }
  ]
}

Similarly, the Reddit source describes the subreddits to include, how many posts to include per subreddit and a mode (top of day, week or month).

Pipeline

The pipeline contains all the steps needed for newsletter generation. I broke it down into small pieces with single responsibilities to make it more scalable and implemented a locking mechanism through Redis that will allow to have hundreds of workers running the same pipeline without overlapping. Workers will not generate/send the same newsletter issue twice by acquiring a distributed lock when doing these operations.

The first step in the pipeline runs every 10 minutes. It takes care of scanning all newsletters, parsing their schedule and determining when the next one should be published. If that's within the next 4 hours a task is queued to generate it (we use Celery to handle our async tasks).

This newsletter generation task fetches the content from the sources and structures it in an agnostic format that is then passed to our templating engine (Django's one) to generate the HTML that makes the newsletter. We save this in two places: Redis (with expiration of 5h), to be quickly retrieved when sending out the newsletter and Amazon S3. We don't store the HTML in the DB (that's an error I already made with Unreadit, HTML takes a shit-ton of space and db space is expensive). Amazon S3 is easy to use and will allow us to also embed the generated newsletter on the web app.

There is one last step in the pipeline: the one that looks at all generated unpublished newsletters and, when the time comes, schedules the tasks to send them using Amazon SES. Amazon SES is a super-cheap email service by Amazon with good deliverability that charges you $1 per 10,000 emails.


I will probably do a follow up to see how this plan holds up on the battleground when we launch the service. If you have feedback on my choices, feel free to hit me up on Twitter.