scaling booklist

i used to hate reading. in fact, i probably didn't read a single book between 2019 and 2024. but when a good friend gifted me a copy of setting the table by danny meyer in december of 2023, i decided to pick it up. i ended up devouring it in a weekend.

i realized i actually liked reading, but hated wasting my time on the wrong books. after that experience, i started making a point to ask the people i admired for their book recommendations. a year later, i'd made it through 30+ books and loved every single one.

the internet has no shortage of book recommendations. in fact, there are dozens of websites that consolidate expert book recommendations, but most of them are painfully slow, impossible to search, cluttered with intrusive ads, and overall pretty janky. booklist started as a weekend project to solve this problem. a few months later, it's turned into one of the largest collections of expert book recommendations on the internet and the hardest performance problem i’ve ever tackled.

how it works

booklist is a curated collection of the most frequently recommended books on the internet. it can be used to help you discover what to read next or to explore the relationships hidden in the data (e.g. who has the most similar reading tastes? what are the most popular books or genres? etc.).

data collection

i started by gathering a list of websites to scrape. i ended up with over a dozen - all with different structures and layouts. i wrote one massive stagehand script that takes in a website as a command line argument, identifies all of the recommenders on the page, visits each of their pages, and scrapes all of their book recommendations. i also included a number of personal sources from friends and founders.

here's what the stagehand code looks like:

// extract book recommendations from the page
async function extractBookRecommendations(page: Page, personName?: string) {
  const instruction = `Look for a list or collection of book recommendations on the page. For each book found:
      1. The title should be a proper book title
      2. The author should be the actual writer of the book (not ${personName})
      3. Skip items without both title and author
      4. Skip items where the author name matches ${personName}`

  const schema = z.object({
        books: z.array(
          z.object({
            title: z.string(),
            author: z.string(),
          })
        ),
      })

  const result = await page.extract({
    instruction,
    schema,
    useTextExtract: true,
  });

  return result.books;
}

data processing

once a person or book is scraped, i kick off a set of enrichment steps to add metadata and context.

for each person:

a stagehand script that finds the most relevant website for them (often wikipedia or twitter/x)
a braintrust function to generate a brief description and classify their type (investor, journalist, etc.)
a function to generate an embedding for the description using openai

and for each book:

a stagehand script that finds the book on amazon
a braintrust function to generate a description and classify the genre (fiction, non-fiction, etc.)
a function to generate embeddings for the title, author, and description using openai

data storage

everything is stored in supabase (books, people, recommendations).

create table "public"."books" (
    "id" uuid not null default uuid_generate_v4(),
    "title" text not null,
    "author" text not null,
    "description" text,
    "genre" text[] not null,
    "created_at" timestamp,
    "updated_at" timestamp,
    "amazon_url" text,
    "title_embedding" vector(1536),
    "author_embedding" vector(1536),
    "description_embedding" vector(1536),
    "similar_books" jsonb,
    "recommendation_percentile" numeric
);

create table "public"."people" (
    "id" uuid not null default uuid_generate_v4(),
    "full_name" text not null,
    "created_at" timestamp,
    "updated_at" timestamp,
    "url" text,
    "type" text,
    "description" text,
    "description_embedding" vector,
    "similar_people" jsonb,
    "recommendation_percentile" numeric
);

create table "public"."recommendations" (
    "id" uuid not null default uuid_generate_v4(),
    "person_id" uuid not null,
    "book_id" uuid not null,
    "source" text not null,
    "source_link" text,
    "created_at" timestamp,
    "updated_at" timestamp
);

when inserting new books, i check for potential duplicates by comparing the embeddings in an rpc function. if the book is already found, i skip the book and just add the new recommender to the people table and add a new row to the recommendations table. if not, i insert all three.

making it fast

after scraping hundreds, then thousands, then tens of thousands of books, i found myself running into some serious performance issues.

data fetching

my first instinct was to make the site fully static. with force-static, the data would be fetched from supabase once at build time, meaning the same cached content would be served to all users. however, i eventually reached vercel's isr page size limit when the recommendations table grew too large. i was trying to pre-load 40mb, but the limit is ~19mb.

v0 and windsurf recommended switching to dynamic rendering or implementing pagination to reduce the page size. however, i was determined to keep the site static without pagination or lazy loading. my solution was slightly unconventional but straightforward:

single dump at build time: at next build, i query all three supabase tables and dump the results into multiple json files under /public/data/.
initial load: on page load, the app immediately fetches just the first 50 books and recommenders, allowing the page to render instantly.
immediate swr fetch for remaining data: once the first 50 records load, swr quickly fetches the full dataset in the background, followed by extended book details like descriptions and related titles.

// load first 50 books and recommenders immediately
const { data: initialBooks } = useSWR<EssentialBook[]>(
  "/booklist/data/books-initial.json",
  fetcher,
);

const { data: initialRecommenders } = useSWR<FormattedRecommender[]>(
  "/booklist/data/recommenders-initial.json",
  fetcher,
);

// quickly load remaining books and recommenders after initial render
const { data: allEssentialBooks } = useSWR<EssentialBook[]>(
  initialBooks ? "/booklist/data/books-essential.json" : null,
  fetcher,
);

const { data: allRecommenders } = useSWR<FormattedRecommender[]>(
  initialRecommenders ? "/booklist/data/recommenders.json" : null,
  fetcher,
);

// load extended details once essential data is available
const { data: extendedData } = useSWR<ExtendedBook[]>(
  initialBooks ? "/booklist/data/books-extended.json" : null,
  fetcher,
);

// merge essential and extended data
const essentialBooks = allEssentialBooks || initialBooks;
const recommenders = allRecommenders || initialRecommenders;

const books = essentialBooks?.map((book) => {
  const extended = extendedData?.find((e) => e.id === book.id);
  return {
    ...book,
    ...(extended || { related_books: [], similar_books: [] }),
  };
});

// skeleton loader until initial data arrives
if (!initialBooks || !initialRecommenders) {
  return <GridSkeleton />;
}

// render the full grid
return <BookList initialBooks={books || []} initialRecommenders={recommenders || []} />;

grid

to render tens of thousands of rows without overloading the main thread, i used fast-grid, an open-source, dom-based table built for large datasets by my friend gabriel. fast-grid works as follows:

multi-threaded sort/filter: row ordering and filtering run in a web-worker on a shared-array buffer, so the ui thread stays free for rendering.
custom virtualization: a small pool of row elements is reused while scrolling, making it possible to “show” millions of rows with only a few dozen in the dom.
pre-loaded rows: rows are prepared before they enter the viewport, avoiding visible pop-ins.
mobile support: custom touch scrolling bypasses the browser’s 15 million-pixel height limit and delivers a consistent 60 fps on older phones.

search, sort, & filter

the combination of semantic search, column filtering, and dynamic sorting proved to be another challenge. the system had to feel instant despite handling complex state updates and large datasets:

semantic search caching: search results (matching row id from semantic vector searches in supabase) are cached in localStorage. repeated queries pull directly from cache, preventing unnecessary network requests.
debounced inputs: both search queries and column-filter inputs are debounced (150ms), limiting the number of renders triggered by rapid input.
url-synced state: search queries, filter states, and sort parameters are synced to the url after debounced input settles, enabling persistent and shareable views without excessive updates.
efficient state management: search results are stored as sets of ids in memory, allowing immediate client-side filtering and sorting without additional data fetching.

closing thoughts

as the cost and effort of building continues to approach zero, i'm trying to be more intentional with what i build and put out into the world. anyone with a good idea can use tools like v0, windsurf, or cursor to build it. i think this is awesome and have no doubt that we'll benefit greatly from having a more diverse set of creators at the helm. however, we'll likely see a lot more slop.

in many ways, booklist forced me to go back to basics. the interface very likely could have been one-shotted, but most of the real work happened behind the scenes: building and cleaning the dataset, architecting for scale, and making it fast. if the database were half the size, i probably wouldn’t have had to think through any of it. but dealing with scale pushed me to understand the mechanics more deeply and come up with a more thoughtful solution. which, in the end, was my main goal.

you can find the code for the scraper and the site on my github. feel free to clone it to build your own or submit a pr if you have any suggestions!

thanks to ankur, eden, gabriel, ishaan, rishab, shreyas, soham, and the stagehand team for their help and support with this project.