2009/01/16

Twitter on Merb : Merb flat app Walkthrough





Description


Motivation

I’m a newbie in merb community. Since “the merger”, I needed correct and reliable reference and so was other newbies in the community. I started to translate merb-book in Korean. But it was not enough for us .I thought that learning merb would be easier if I build an app what I really need such as tracking recent changes. It was easy since there are good tutorials and tools. And scraping nightly version of README file could be helpful too. Because merb core team members are too busy to write guides and tutorials.

I laid aside scraping the README files for now. I just aggregated merb commits rss and send them to twitter. I use TwitterFox as twitter client on my desktop. Whenever a new message comes, It shows the message in a brief moment. It's really handy.



Generating a flat app


$ merb-gen flat MyApp
$ tree
|-- README.txt
|-- Rakefile
|-- application.rb
|-- config
| |-- framework.rb
| `-- init.rb
|-- gems
|-- spec
| `-- spec_helper.rb
|-- tasks
| `-- merb.thor
| |-- app_script.rb
| |-- common.rb
| |-- gem_ext.rb
| |-- main.thor
| |-- ops.rb
| `-- utils.rb
`-- views
`-- foo.html.erb
$git init
$capify .
view raw gistfile1.txt hosted with ❤ by GitHub


This is what I did at first time. As you see, It is really small and shows basic structure how merb works. If you want to learn more about internals of merb, visit here(http://github.com/michaelklishin/merb-internals-handbook)


Using Model in a flat app

I edited config/framework.rb file. Because I could not use twitter_search gem to verify the message is unique or not. This is same as twitter timeline api. It seems that they use cache. Therefore I had to add a model to check the message I'm sending is unique or not.

Merb::Config[:framework] = {
:application => Merb.root / "application.rb",
:config => [Merb.root / "config", nil],
:public => [Merb.root / "public", nil],
:model => Merb.root / "models",
:view => Merb.root / "views"
}
view raw framework.rb hosted with ❤ by GitHub


First, I added one line :model => Merb.root / "models" in config/frameworks.rb file. And then, wrote dependencies in init.rb file.

$merb-gen model tweet

# referenced API.
# http://merbapi.com/classes/Merb/BootLoader/BuildFramework.html
class Tweet
include DataMapper::Resource
property :id, Serial
property :message, Text, :length => 255, :nullable => false, :unique => true
property :name, String, :length => 20, :nullable => false
property :category, String, :length => 20
property :link, String
property :date, String,:nullable => false, :index => true
property :created_at, DateTime
property :updated_at, DateTime
end
view raw tweet.rb hosted with ❤ by GitHub


$rake db:automigrate


Scraping with nokogiri

Since merb flat app has a single controller file, It was easy to start. But the twitter api and nokogiri took some time to use. I was not familiar with nokogiri, I googled and found this tutorial. (http://www.robertsosinski.com/2008/12/08/scraping-pages-made-easy-with-ruby-and-nokogiri/).
And I better use a well designed web site template. I picked one of the recommended templates from samshmagazine.(http://www.smashingmagazine.com/2008/12/01/100-free-high-quality-xhtmlcss-templates/)

There were 3 or 4 ruby libraries for twitter. And twitter4r has more documents than the others. It took much time to discover that timeline api and search api use cache. I just forgot about the cache. Anyway I installed necessary gems.

$sudo gem install twitter4r twitter nokogiri shorturl

And then I eddited index.html.erb file. In the controller, it only needs render method. I just wanted to use feedtools and feedupdater gem. But there were not enough document to reference and as merb-book says, I better not use codes that I don't understand. Scraping pages with nokogiri is easy but my app would be fragile.

def self.scrape
merb = 'http://github.com/wycats/merb/commits/1.0.x'
negotiation = 'http://github.com/wycats/rails/commits/master'
blog= 'http://yehudakatz.com'
self.get_commits_and_bark(merb, "edgemerb")
self.get_commits_and_bark(negotiation, "wycats_rails")
self.get_blog_and_bark(blog)
end
#parsing github commits
def self.get_commits_and_bark(url, category)
temp = []
@doc = Nokogiri::HTML(open(url))
@doc.css('div.human').each do |h|
h.search('div.message').each {|k| @message = k.content}
h.search('div.name').each do |k|
@name = (k.content).to_s.strip.gsub(" (author)", "").gsub("(committer)","")
end
h.search('div.date').each do |k|
@date = k.content
@new_date = Time.parse(@date.strip)
end
h.search('pre a[href]').each do |k|
@link = (k.attributes)
begin
@new_link = ShortURL.shorten("http://github.com" + @link.to_s.strip.gsub("href", ''), :rubyurl)
rescue => e
@new_link = ShortURL.shorten("http://github.com" + @link.to_s.strip.gsub("href", ''), :tinyurl)
rescue Timeout::Error => e
@new_link = "http://github.com" + @link.to_s.strip.gsub("href", '')
end
end
temp <<{:message => @message, :name => @name, :date => @new_date, :link => @new_link, :category => category}
end
temp.slice!(4, temp.length)
temp.sort_by{|a| a[:date]}.each do |data|
self.save_this(data)
end
end
#parsing a blog
def self.get_blog_and_bark(url)
temp = []
@doc = Nokogiri::HTML(open(url))
@doc.css('div.entry').each do |h|
@name = "Katz"
h.search('h3.entrytitle').each {|k| @title = k.content }
h.search('h3 a[href]').each do |k|
@link = (k.attributes)
begin
@new_link = ShortURL.shorten(@link.to_s.strip.gsub("href", '').gsub('relbookmark',''), :rubyurl)
rescue => e
@new_link = ShortURL.shorten(@link.to_s.strip.gsub("href", '').gsub('relbookmark',''), :tinyurl)
rescue Timeout::Error => e
@new_link = @link.to_s.strip.gsub("href", '').gsub('relbookmark','')
end
end
h.search('div.meta').each do |k|
@date = k.attributes
@new_date = Time.parse(@date.to_s.strip.gsub(" Posted", "").gsub(" Comments(","").gsub(")",""))
end
temp<<{:message => @title, :name => @name, :date=> @new_date, :category => "Katz's blog", :link => @new_link}
end
temp.slice!(4, temp.length)
temp.sort_by{|a| a[:date]}.each do |data|
self.save_this(data)
end
end
def self.save_this(data)
@tweet = self.new(data)
if @tweet.save
puts "succeed to save them"
begin
self.post(data)
rescue Twitter::RESTError => re
unless re.code == "403"
puts re
end
end #begin-rescue end
else
puts "failed to save them"
end
return @tweet
end
def self.post(tweet)
@poster = String.new
@poster = '[' + tweet[:category] + ']' + tweet[:message] +'('+ tweet[:name] +')'+ tweet[:link]
STARLING.set('twitter_post', {:send => @poster})
end
view raw parsing.rb hosted with ❤ by GitHub


I know the codes above is not that beautiful. But I hope this code help another rubist who want a scraping example through nokogiri. As you see above, k.content which was return by nokogiri's search method was Hash. This is why i used to_s method.


Scheduling background works



And posting a message to twitter was tricky for me. Because of increasing popularity of twitter, posting throuth api was too fast for twitter sometimes. Posting messages takes much time and that code block needs to sleep. In the initial codes, I wrapped the block with run_later. But I found that I need to schedule the tasks. There would be several way to acheive scheduling the daemons. In version 0.0.1.4(Twitter-on-Merb), I used crontab and rake. But the codes were dirty and it did not work as I expected. I talked about this with lakteck via tiwtter, and I remembered that I've seen rufus-scheduler at igvita.com[link]. It was really easy to implement scheduler with rufus-scheduler gem. The scheduler scrapes web sites on every 10 minutes. And I made a starling consumer daemon. I referenced "Advanced Rails Recipes #42"

confing/init.rb
...
require 'memcache'
require 'rufus/scheduler'
require 'fileutils'
Merb::BootLoader.before_app_loads do
system 'starling -d -P log/pids/starling.pid -q log/starling_queue'
end
Merb::BootLoader.after_app_loads do
STARLING = MemCache.new('127.0.0.1:22122')
if Merb.environment == 'production' && (not File.exist?('tmp/scheduler.lock'))
FileUtils.touch('tmp/scheduler.lock')
scheduler ||= Rufus::Scheduler.start_new
scheduler.every "10m" do
self.scrape
# puts "hello! Task Invoked: #{Time.now}"
end
end
if Module.constants.include?('Mongrel') then
class Mongrel::HttpServer
alias :old_graceful_shutdown :graceful_shutdown
def graceful_shutdown
FileUtils.rm('tmp/scheduler.lock', :force => true)
old_graceful_shutdown
end
end
end
end
...
view raw boot.rb hosted with ❤ by GitHub


lib/daemons/starling_consumer.rb file
#!/usr/bin/env ruby
require 'rubygems'
$running = true
Signal.trap("TERM") do
$running = false
end
require 'twitter/console'
require 'memcache'
gem('twitter4r', '0.3.0')
require 'twitter'
require 'time'
require 'open-uri'
require 'daemons'
sleep_time = 0
starling = MemCache.new('127.0.0.1:22122')
config_file = File.join( File.dirname(__FILE__) + '/../../', 'config', 'twitter.yml')
message = starling.get('twitter_post')
while($running) do
message = starling.get('twitter_post')
if message
@post_client = Twitter::Client.from_config(config_file, 'user' )
begin
status =@post_client.status(:post, message[:send])
rescue Exception
print 'X'
starling.set('twitter_post', message)
sleep_time = 30
end
else #no work
# print '.'
sleep_time = 5
end
$stdout.flush
sleep sleep_time
end #loop end
view raw starling.rb hosted with ❤ by GitHub


lib/daemons/starling_consumer_ctl.rb file
require 'rubygems'
require 'daemons'
starling_client = File.join(File.dirname(__FILE__), "starling_consumer.rb")
options = {
:dir_mode => :normal,
:dir => File.join(File.dirname(__FILE__), "..", "..", "log"),
:ARGV => ARGV,
:log_output => true,
:multiple => false,
:backtrace => true,
:monitor => false
}
Daemons.run(starling_client, options)
view raw ctl.rb hosted with ❤ by GitHub


config/consumer.god file. Daemons are started after "cap:deploy".
MERB_ROOT='/home/deploy/repos/twitter/current'
def generic_monitoring(w, options = {})
w.start_if do |start|
start.condition(:process_running) do |c|
c.interval = 10.seconds
c.running = false
end
end
w.restart_if do |restart|
restart.condition(:memory_usage) do |c|
c.above = options[:memory_limit]
c.times = [3, 5] # 3 out of 5 intervals
end
restart.condition(:cpu_usage) do |c|
c.above = options[:cpu_limit]
c.times = 5
end
end
w.lifecycle do |on|
on.condition(:flapping) do |c|
c.to_state = [:start, :restart]
c.times = 5
c.within = 5.minute
c.transition = :unmonitored
c.retry_in = 10.minutes
c.retry_times = 5
c.retry_within = 2.hours
end
end
end
God.watch do |w|
w.name = "consumer"
w.interval = 60.seconds
w.group = "twitter"
w.start = "/bin/bash -c 'cd #{MERB_ROOT}; ruby lib/daemons/starling_daemon_ctl.rb start'"
w.restart = "/bin/bash -c 'cd #{MERB_ROOT}; ruby lib/daemons/starling_daemon_ctl.rb restart'"
w.stop = "/bin/bash -c 'cd #{MERB_ROOT}; ruby lib/daemons/starling_daemon_ctl.rb stop'"
w.start_grace = 60.seconds
w.restart_grace = 60.seconds
w.pid_file = "#{MERB_ROOT}/log/consumer.pid"
w.behavior(:clean_pid_file)
generic_monitoring(w, :cpu_limit => 70.percent, :memory_limit => 18.megabytes)
end
view raw god.rb hosted with ❤ by GitHub


Cache

My flat app was too slow, and it was time to implement cache. I referenced,
Here's quick introduction [original gist document link]

== Quick intro
With fragment caching, you can mix dynamic and static content.

With action caching, the whole template is cached
but the before filters are still processed.

With page caching, the whole template is put in html files in a special
directory in order to be handled directly without triggering Merb.

by Dan Kubb on July 22, 2008


# application.rb
class TwitterOnMerb < Merb::Controller
cache :commits, :about
def _template_location(action, type = nil, controller = controller_name)
controller == "layout" ? "layout.#{action}.#{type}" : "#{action}.#{type}"
end
def index
render
end
def commits
render
end
def about
render
end
end
#===== in commits.html.erb ==========================
...
<ul>
<%- fetch_partial 'recent_commits', :updated_at=> (Tweet.recent.created_at rescue nil) %>
</ul>
...
#====== _recent_commits.html.erb partial ===============
<%- Tweet.recent.all(:limit => 20).each do |tweet| %>
<li>
<b id="<%= tweet.category%>"><%= tweet.category%></b>
<a href='<%=tweet.link%>'><%= tweet.message %></a><br/>
<font><%= relative_date(Time.parse(tweet.date)) %></font><br/>
</li>
<% end-%>
view raw cache.rb hosted with ❤ by GitHub


But my init file was wrong at that time and throws erros whenever I run "rake db:automigrate" since I set up the cache(http://gist.github.com/44702). I really didn't know how to fix this. At first, I guessed that datamapper version could be the reason why causes it. After updating the gems, error were still there. Fortunately I have known Guillaume Maury(giom) recently via twitter. He helped to get out of this. This code was not familiar with me. I thought I knew the way to setup the cache.
The codes were already in his article (http://gom-jabbar.org/articles/2008/12/14/example-nginx-configuration-for-merb-with-page-caching-using-the-file-store). I run my apps on passenger so I simply skipped it. :) I highly recommend his blog if you are in Merb2 or Rails3. Here is the snippet written by him. (The initial flat app with plain dependency line did not work.)


And I referenced here.
http://blog.evanweaver.com/files/doc/fauna/memcached/files/README.html

$ wget http://download.tangent.org/libmemcached-0.25.tar.gz
$ tar -xzvf libmemcached-0.25.tar.gz
$ cd libmemcached-0.25.tar.gz
$ ./configure
$ make && sudo make install
$ sudo gem install memcached
$ irb
> require 'rubygems'
> require 'memcached'


Deploy

$cap deploy:setup
$scp -r database.yml to server
$scp -r twitter.yml to server

$rake db:create
$rake db:automigrate
$cap deploy

this is my deploy.rb file.
set :application, "YOUR APP NAME"
set :domain, "YOUR IP ADDRESS"
set :user, 'deploy'
set :use_sudo, false
set :scm, :git
set :runner, user
set :ssh_options, { :forward_agent => true }
default_run_options[:pty] = true
set :port, PORTNUMBER #ssh port
set :repository, "ssh://deploy@YOUR IP ADDRESS/DIRECTORY TO YOUR GIT REPOSITORY"
set :deploy_to, "/DEPLOY DIRECTORY/#{application}"
set :deploy_via, :remote_cache
set :copy_exclude, ['.git', 'Capfile', 'config/deploy.rb']
role :app, application #or ip address
role :web, application #or ip address
role :db, application, :primary => true
namespace :deploy do
desc "Link in the production extras and Migrate the Database"
task :after_update_code, :roles => :app do
run "ln -nfs #{shared_path}/database.yml #{release_path}/config/database.yml"
run "ln -nfs #{shared_path}/twitter.yml #{release_path}/config/twitter.yml"
run "ln -nfs #{shared_path}/log #{release_path}/log"
#if you use ActiveRecord, migrate the DB
#deploy.migrate
end
end
namespace :deploy do
[:start, :stop, :restart].each do |action|
task action, :roles => :app do
sudo "/usr/sbin/monit #{action.to_s} all -g merb_#{monit_group}"
end
end
task :spinner, :roles => :app do
start
end
end
namespace :god do
desc "restart god "
task :restart, :roles=>:app do
run "god stop twitter"
run "god load #{current_path}/config/consumer.god"
run "god start twitter"
# run "god -c #{current_path}/config/consumer.god"
end
end
namespace :god do
desc "stop god "
task :stop, :roles=>:app do
run "god stop twitter"
end
end
namespace :god do
desc "start god "
task :stop, :roles=>:app do
run "god start twitter"
end
end
after "deploy", "deploy:after_update_code"
after "deploy:update_code", "deploy:after_update_code"
after "deploy:restart", "god:restart"
view raw deploy.rb hosted with ❤ by GitHub

tip

And if you have problems after your deployment, check if config.ru(must be 'config.ru', not 'Config.ru') file exists and the permissions are correct(chmod 755 -R).